A cross-language dependency syntax analysis method based on hierarchical syntax understanding and memory-enhanced large model
By constructing a multilingual dependency tag library and employing an explicit syntactic memory enhancement strategy, the problem of insufficient performance in low-resource language dependency parsing was solved, achieving high-precision cross-language dependency parsing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- KUNMING UNIV OF SCI & TECH
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-12
AI Technical Summary
Existing cross-linguistic dependency parsing methods exhibit significant performance degradation on low-resource languages such as Vietnamese, primarily due to the scarcity of labeled data, the insufficient memory capacity of large language models for weakly occurring dependency labels, and the lack of explicit modeling mechanisms for deep syntactic commonalities.
By constructing a multilingual dependency tag library, evaluating tag memory strength, and guiding model re-parsing, and combining implicit multi-task learning with explicit syntactic memory enhancement strategies, the dependency parsing accuracy of low-resource languages is significantly improved.
It significantly improves the accuracy of dependency parsing for low-resource languages, reaching the current state-of-the-art level, and alleviates the problem of forgetting cross-language syntactic knowledge.
Smart Images

Figure CN122197866A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of natural language processing and artificial intelligence technology, specifically involving a cross-linguistic dependency parsing method based on hierarchical syntactic understanding and memory enhancement large model, which is particularly suitable for improving the accuracy of dependency parsing of low-resource languages by utilizing the syntactic knowledge of high-resource languages. Background Technology
[0002] To date, natural language processing has made significant progress in cross-linguistic research. Large language models, with their powerful context modeling and zero-shot transfer capabilities, have approached human-level performance in dependency parsing tasks for high-resource languages. However, when dealing with low-resource languages such as Vietnamese, model performance generally drops sharply. This is due to two main reasons: firstly, low-resource languages lack large-scale, high-quality, manually annotated dependency treebanks, making it difficult for models to fully learn complex syntactic patterns; secondly, existing cross-linguistic transfer methods often rely on surface word vector alignment or multi-task joint training, failing to delve into the deep syntactic commonalities between languages and lacking effective modeling mechanisms for low-frequency or weakly occurring dependency relations (such as "nmod:tmod", "obl:agent", "discourse") in the target language.
[0003] Current mainstream cross-linguistic dependency parsing methods mainly fall into two categories: First, direct fine-tuning strategies based on multilingual pre-trained models (such as mBERT and XLM-R). This method is simple and easy to implement, but it is prone to overfitting when target language labeled data is scarce and cannot effectively preserve the rich syntactic knowledge of the source language. Second, transfer frameworks based on adversarial learning or subword embedding alignment. While these can achieve cross-linguistic representation alignment to some extent, their focus is mainly on the lexical or shallow grammatical levels, ignoring the hierarchical, recursive, and label-specific nature of dependency structures themselves. More importantly, large language models exhibit a significant "syntactic memory forgetting" phenomenon during cross-linguistic inference, meaning they lack stable activation capabilities for dependency labels that are rare or unseen during training, leading to systematic biases in the parsing results regarding key syntactic relations.
[0004] Furthermore, although Chinese and Vietnamese share some similarities in macro-level word order (such as SVO structure), significant differences remain in micro-level syntactic construction. For example, Vietnamese extensively uses function words (such as "đã" and "sẽ") to mark tense, while Chinese relies on context; Vietnamese modifiers are typically post-positioned (e.g., "nhà màu đỏ" meaning "red house"), while Chinese modifiers are pre-positioned. These differences make simple cross-linguistic parameter sharing or feature transfer ineffective, necessitating a novel analytical mechanism capable of explicitly modeling syntactic label memory and dynamically guiding large language models to focus on weak syntactic patterns.
[0005] Therefore, how to fully utilize the deep syntactic understanding capabilities of large language models under limited annotation resources, and how to alleviate the problem of cross-linguistic syntactic knowledge forgetting through effective memory enhancement mechanisms, has become a key challenge in improving the performance of low-resource language dependency parsing. To address these issues, this invention proposes a cross-linguistic dependency parsing method based on a hierarchical syntactic understanding and memory enhancement large model. By constructing a multilingual dependency tag library, evaluating tag memory strength, and guiding model re-parsing, high-precision modeling of low-resource language syntactic structures is achieved. Summary of the Invention
[0006] The technical problem this invention aims to solve is that existing cross-linguistic dependency parsing methods exhibit significant performance degradation on low-resource languages (such as Vietnamese), primarily due to the scarcity of labeled data, insufficient memory capacity of large language models for weakly occurring dependency tags, and the lack of explicit modeling mechanisms for deep syntactic commonalities. To overcome these shortcomings, this invention provides a cross-linguistic dependency parsing method based on a hierarchical syntactic understanding and memory enhancement model. This method aims to significantly improve the accuracy of dependency parsing for low-resource languages by deeply integrating implicit multi-task learning with explicit syntactic memory guidance.
[0007] The technical solution of this invention is: a cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model, the specific steps of which are as follows:
[0008] Step 1: Obtain labeled dependency syntax tree data for the source language (Chinese, English) and target language (Vietnamese) from the Universal Dependency Tree Library (UD) and perform standardized preprocessing.
[0009] Step 2: Construct a multi-task joint fine-tuning framework to simultaneously train the large language model through cross-linguistic part-of-speech tagging and dependency parsing. Fine-tune the parameters of the large language model (LLM) using low-rank adaptive (LoRA) technology to achieve implicit alignment of grammatical knowledge between the source and target languages. The aim is to implicitly align linguistic knowledge between different languages and activate the model's grammatical understanding of low-resource languages.
[0010] Step 3: Use the fine-tuned large language model to perform preliminary parsing of the target language sentence, and generate an initial target language parse tree containing predicted dependency relations and part-of-speech tagging.
[0011] Step 4: Analyze the distribution of grammatical structures in the source and target language training corpora, and automatically construct a dependency tag library containing tag features, distribution frequencies, part-of-speech (POS) pairs, and grammatical example sentences. This tag library explicitly records the feature definition, frequency of occurrence, typical POS pairs, and specific bilingual explanatory example sentences for each tag.
[0012] Step 5: By analyzing the prediction accuracy of each label in the initial parse tree and its frequency in the training set, calculate the memory strength score for each dependency label. Based on this, the labels are divided into three levels: weak memory, medium memory, and strong memory, in order to identify the weak links in the model during the analysis process.
[0013] Step 6: According to the label The memory hierarchy retrieves corresponding explicit grammatical knowledge from the dependency tag library to construct enhanced prompt words, which guide the fine-tuned LLM to correct the initial parse tree, resulting in the final target language dependency parsing tree.
[0014] Furthermore, the specific steps of Step 1 are as follows:
[0015] Step 1.1: Collect the latest Chinese, English, and Vietnamese datasets from the Universal Dataset (UD) website.
[0016] Furthermore, Step 2 includes:
[0017] Step 2.1: Convert sentences in the source and target languages into a unified fine-tuning template that includes language type, part-of-speech sequence, and syntactic structure;
[0018] Step 2.2: Fix the original weights of LLM, learn the rank decomposition matrix through LoRA, and use the cross-entropy loss function to minimize the part-of-speech tagging loss and dependency parsing loss during training.
[0019] Furthermore, the specific steps of Step 3 are as follows:
[0020] Step 3.1: Input the sentence to be processed in the target low-resource language into the LLM (Large Language Model) that was fine-tuned in Step 2;
[0021] Step 3.2: Utilize the grammatical knowledge acquired by LLM during the multi-task fine-tuning stage to perform preliminary part-of-speech tagging on the words in the sentence and identify potential dependency connections between words;
[0022] Step 3.3: Based on contextual semantic information and word order features, LLM assigns corresponding dependency labels (nsubj, obj, amod, etc.) to each identified dependency relationship.
[0023] Step 3.4: The model integrates the head index and relation labels of all words, automatically constructs and outputs an initial dependency parse tree for the target language. This provides a foundation for subsequent memory strength analysis and stratified enhancement.
[0024] Furthermore, Step 4 includes:
[0025] Step 4.1: Use the fine-tuned LLM model to summarize the syntactic features, usage and meaning of each dependency tag in the UD corpus and generate the feature values of the tags;
[0026] Step 4.2: Calculate the distribution ratio of each label in the training data as the frequency value, and extract the part-of-speech combination of the center word and the dependent word and its corresponding frequency;
[0027] Step 4.3: Select representative syntactic examples and their corresponding explanations for each part-of-speech combination from the corpus to form example attributes of the tag library.
[0028] Furthermore, Step 5 includes:
[0029] Step 5.1: Re-parse the training dataset using the fine-tuned LLM and calculate the prediction accuracy for each dependency label;
[0030] Step 5.2: Combining prediction accuracy and training distribution frequency, calculate the memory strength score for each label using the memory strength formula. ;
[0031] Step 5.3: Set the stratification threshold, and... Labels with a value less than 0.6 are defined as weak memory labels, and labels with a value less than or equal to 0.6 are defined as weak memory labels. Tags with a value <0.9 are defined as medium-memory tags. Tags with a value of ≥0.9 are defined as strong memory tags.
[0032] Furthermore, Step 6 includes:
[0033] Step 6.1: For weak memory tags, simultaneously retrieve corresponding grammatical rules and similar structure examples from the dependency tag libraries of the source language and the target language for dual enhancement;
[0034] Step 6.2: For medium-length memory tags, only retrieve relevant syntactic structure information from the target language dependency tag library for auxiliary enhancement;
[0035] Step 6.3: For tags with strong memory, retain the initial parsing results and do not introduce additional external knowledge;
[0036] Step 6.4: Integrate the retrieved explicit grammatical knowledge, including part-of-speech matching rules and parsing examples of similar structures, into the Prompt to guide the LLM to capture cross-language syntactic commonalities and complete the final parsing.
[0037] The present invention also provides a cross-linguistic dependency parsing system based on a hierarchical syntactic understanding and memory enhancement large model, the system comprising: a module for executing the cross-linguistic dependency parsing method based on the hierarchical syntactic understanding and memory enhancement large model.
[0038] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the cross-linguistic dependency parsing method based on a hierarchical syntactic understanding and memory enhancement big model.
[0039] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the cross-lingual dependency parsing method based on a hierarchical syntactic understanding and memory-enhanced large model.
[0040] The beneficial effects of this invention are:
[0041] 1. This invention achieves implicit alignment of syntactic knowledge between languages by combining multi-task fine-tuning of cross-language part-of-speech tagging and dependency parsing with the LoRA low-rank adaptation strategy, while freezing the original weights of LLM. This effectively alleviates the "memory forgetting" problem in low-resource language parsing caused by semantic interference in high-resource languages, laying the foundation for subsequent accurate parsing.
[0042] 2. This invention innovatively constructs a multilingual dependency tag library that includes feature descriptions, frequency statistics, POS part-of-speech pairs, and representative example sentences. It deeply integrates implicit parameter fine-tuning with explicit grammatical knowledge guidance, fully explores cross-lingual syntactic commonalities, and significantly improves the overall accuracy of low-resource language dependency parsing.
[0043] 3. This invention designs a memory strength calculation model based on label accuracy and distribution frequency, and adopts a hierarchical enhancement strategy of "strengthening the bilingual library of weak memory labels, optimizing the target library of medium memory labels, and directly retaining strong memory labels" to specifically improve the parsing ability of weak grammatical structures. It has reached the state-of-the-art (SOTA) level on four low-resource language benchmark datasets, including Vietnamese and Tamil. Attached Figure Description
[0044] Figure 1 This is a flowchart from the present invention; Detailed Implementation
[0045] like Figure 1 As shown, a cross-linguistic dependency parsing method based on hierarchical syntactic understanding and memory enhancement large model is presented. This method utilizes the powerful contextual understanding capabilities of large language models (LLMs) and combines explicit grammatical knowledge enhancement strategies for cross-linguistic dependency parsing. The specific steps of the method are as follows:
[0046] S1: Collect labeled datasets of the source language (Chinese, English) and the target low-resource language (Vietnamese) from the Universal Dependency Dataset (UD). These datasets include part-of-speech tagging, head node indexes, and dependency relation labels. The specific steps of Step 1 are as follows:
[0047] S1.1: Download the labeled source language (Chinese, English) and target low-resource language (Vietnamese) datasets from the Universal Dataset (UD) website as experimental data.
[0048] S2: A multi-task joint fine-tuning strategy is adopted, using cross-lingual part-of-speech tagging and dependency parsing as joint tasks. Low-rank adaptive (LoRA) technology is used to fine-tune the parameters of the Large Language Model (LLM), achieving implicit alignment of grammatical knowledge between the source and target languages. The specific steps of S2 are as follows:
[0049] S2.1: Simultaneously set up two subtasks: cross-language part-of-speech tagging and cross-language dependency parsing;
[0050] S2.2: Transform sentences in the source and target languages into a unified fine-tuning template that includes language type, part-of-speech sequence, and syntactic structure; fix the original weights of LLM, learn the rank decomposition matrix through LoRA, and use the cross-entropy loss function to jointly optimize the part-of-speech tagging loss and dependency parsing loss during training.
[0051] Specifically, the large model first transforms the input sentence, which includes the golden language type, part-of-speech tagging, and dependency tree, into a high-dimensional feature vector x. Then, it uses Low-Rank Adaptation (LoRA) to fine-tune the large language model. This method achieves efficient parameter optimization by learning a set of low-rank decomposition matrices while keeping the original weights unchanged. This experiment used QLoRA on Qwen2.5-14B-Instruct to significantly reduce memory consumption during fine-tuning; and LoRA on Qwen2.5-7B-Instruct. Assume a linear layer is defined... W is the weight matrix, which LoRA modifies to... ,in , , ,and This greatly reduces the number of parameters that need to be learned. The parameter settings for fine-tuning are shown in Table 1 below.
[0052] Table 1: Parameter settings for fine-tuning the large model
[0053] By minimizing the multi-task loss function Iteratively update the parameters to obtain a fine-tuned large language model, and calculate the cross-lingual part-of-speech tagging loss. and cross-linguistic dependency parsing loss The formula for calculation is shown below:
[0054] (1)
[0055] (2)
[0056] Where P, H, L, and T represent the number of part-of-speech tags, word heads, dependency relation tags, and language types, respectively. , , , These represent the part-of-speech tagging, word header and dependency relation labels, and language type distribution probabilities, respectively. A value of 1 is assigned when only one element corresponds to a correct index. Finally, the loss is minimized. The following formula is used to optimize the parameters of a large model:
[0057] (3)
[0058] S3: Use the fine-tuned large language model to perform preliminary parsing of the target language sentence, generating an initial dependency parse tree of the target language containing predicted dependency relations and part-of-speech tagging. The specific steps of S3 are as follows:
[0059] S3.1: Input the target language text to be processed into the LLM fine-tuned in Step 2. In this step, the target low-resource language sentence to be parsed is formatted according to a predefined template, and specific language type identifiers and parsing instructions are added as the input sequence. Then, this sequence is passed to the large language model (LLM) that has been fine-tuned through multiple tasks, and the model's inference engine is started, enabling it to perform deep decoding of the target language text space based on the cross-language grammatical representations and semantic mapping capabilities learned in the fine-tuning stage.
[0060] S3.2: Utilizing the grammatical knowledge learned by the LLM during the multi-task fine-tuning stage, preliminary part-of-speech tagging is performed on the words in the sentence, and potential dependency connections between words are identified. Based on contextual semantic information and word order features, the LLM assigns corresponding dependency labels (nsubj, obj, amod, etc.) to each identified dependency pair. The model integrates the head index and relation labels of all words to automatically construct and output an initial dependency parse tree for the target language. This provides a foundation for subsequent memory strength analysis and stratified enhancement.
[0061] Furthermore, specifically, this includes: leveraging the model's activated grammatical understanding capabilities to generate an initial target parse result containing dependency arc indices and relation labels. The fine-tuned model captures hierarchical, structured relationships between target language words through a self-attention mechanism. First, it predicts the headword index for each word, thus outlining the dependency arc skeleton of the sentence. Next, the model assigns a corresponding dependency relation label to each dependency arc, such as "nsubj" (noun subject), "obj" (object), or "amod" (adjective modifier). Finally, the generated index and labels are integrated to automatically construct an initial dependency parse tree conforming to the Universal Dependency (UD) standard. The analysis results record the distribution of dependency relationships between words, and serve as a benchmark for subsequent calculation of tag memory strength and hierarchical knowledge enhancement.
[0062] S4: Statistically analyze the grammatical structure distribution in the source and target language training corpora, and automatically construct a dependency tag library containing label features, distribution frequencies, part-of-speech pair information, and grammatical example sentences; the specific steps of S4 are as follows:
[0063] S4.1: Count the frequency of each dependency label in the training set. From the Universal Dependency (UD) training corpus, the total frequency of each dependency label (such as "nsubj", "obj", etc.) is counted for both the source and target languages. The proportion of each label in the training data is calculated and defined as its frequency value. This metric reflects the extent to which the model is exposed to specific grammatical structures during the pre-training and fine-tuning phases.
[0064] S4.2: Use the fine-tuned LLM to summarize the syntactic features, usage, and meaning of each tag. The fine-tuned LLM is used as a knowledge extractor; given a specific dependency tag, it is guided to automatically summarize and conclude the syntactic features, semantic functions, and usage rules of that tag in a specific context, and these are used as feature attribute values in the tag library.
[0065] S4.3: Extract part-of-speech tag pairs and their frequency distribution. For each dependency tag in the corpus, extract the part-of-speech tag pairs (POS pairs) of the head and dependent words connected to it. Record the frequency of each POS combination and quantify the probability distribution of dependency relationships between different POS pairs.
[0066] S4.4: Select representative bilingual parsing examples and explanations. For each part-of-speech pair, representative typical sentences are selected from the corpus. The dependency relationships in these examples are explained in detail using a fine-tuned LLM, forming example attributes in the tag library, thus providing an explicit contextual reference scheme for subsequent reasoning stages.
[0067] S5 calculates the memory strength of each dependency label by analyzing the prediction accuracy of each label in the initial parse tree and its frequency in the training set. The label is... Based on this, the tags are divided into three levels: weak memory, medium memory, and strong memory; the specific steps of S5 are as follows:
[0068] S5.1: Calculate the prediction accuracy for each label based on the performance of LLM on the training set. The LLM, fine-tuned in S2, is used to re-parse sentences in the training set, and its prediction results are compared with the gold standard labels. The prediction accuracy of each dependency label on the training set is calculated. This is used to measure the model's current level of mastery of a specific grammar.
[0069] S5.2: Distribution frequency obtained by combining with S4.1 Compared with the prediction accuracy in S5.1 Substitute into the memory enhancement formula (1- ) Calculate the memory strength score, where the memory factor λ is used to balance the influence weights of frequency and accuracy.
[0070] S5.3: Based on the scores, the tags are divided into three levels: weak memory, moderate memory, and strong memory. Weak memory level ( The model's understanding of the label is poor, and it is easily affected by semantic interference, leading to forgetting; the intermediate memory level ( The model has basic cognitive abilities, but its analytical accuracy still has room for improvement; strong memory level. The model has a firm grasp of the grammatical structure and does not require additional external knowledge intervention.
[0071] S6: According to the label The memory hierarchy retrieves corresponding explicit grammatical knowledge from the dependency tag library to construct enhanced prompt words, guiding the fine-tuned LLM to correct the initial parse tree, resulting in the final target language dependency parsing tree. The specific steps of S6 are as follows:
[0072] S6.1: Retrieve external knowledge based on memory hierarchy. According to the tag memory strength hierarchy determined in S5, targeted enhancement information is retrieved from the dependency tag library. For weak memory tags, since the model has insufficient understanding of them, the system retrieves features, part-of-speech pairs, and representative example sentences from the tag libraries of both the source language (Chinese and English) and the target language (Vietnamese), using the mature grammatical patterns of the source language to assist in understanding. For medium memory tags, relevant knowledge is retrieved only from the target language tag library for targeted correction. For strong memory tags, since the model already has high confidence, the initial parsing results generated in S3 are directly retained without introducing additional external intervention.
[0073] S6.2: Constructing Enhanced Prompt Words. In this step, the system combines the retrieved explicit grammatical knowledge (including tag definitions, high-frequency part-of-speech combinations, and bilingual parsing examples) with the initial dependency parsing tree of the target language generated in S3. Organic integration is achieved. Following specific instruction templates, this external knowledge is transformed into contextual guidance information that the large model can understand. In this way, abstract grammatical rules are transformed into intuitive prompts, providing the model with clear error correction references, thereby guiding the LLM to capture syntactic commonalities across languages.
[0074] S6.3: Final Parsing and Correction Using LLM. The constructed enhanced prompt words are input into the fine-tuned LLM. Guided by external explicit grammatical knowledge, the model re-evaluates weak links or erroneous dependency relations identified in the initial parse tree through logical reasoning. Leveraging its powerful contextual understanding capabilities, combined with rules and example sentences in the prompt words, the LLM self-corrects the initially predicted dependency arcs and labels, ultimately outputting a high-precision final target language dependency syntax tree that conforms to the Universal Dependency (UD) standard. .
[0075] The present invention also provides a cross-linguistic dependency parsing system based on a hierarchical syntactic understanding and memory enhancement large model, the system comprising: a module for executing the cross-linguistic dependency parsing method based on the hierarchical syntactic understanding and memory enhancement large model.
[0076] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the cross-linguistic dependency parsing method based on a hierarchical syntactic understanding and memory enhancement big model.
[0077] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the cross-lingual dependency parsing method based on a hierarchical syntactic understanding and memory-enhanced large model.
[0078] This invention selects Chinese and English as source languages and Vietnamese as the target language from the Universal Dependencies Treebank (UD) for experiments. Labeled dependency score (LAS) and unlabeled dependency score (UAS) are used as performance evaluation metrics for the model. The calculation formulas are as follows:
[0079] (4)
[0080] (5)
[0081] To demonstrate the effectiveness of this invention, this embodiment selects different configurations of the Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct large models as comparison benchmarks. The comparison model settings include Zero-shot (directly using the pre-trained large model for zero-shot inference), and LORA and QLORA for efficient parameter fine-tuning of the large model using target language data (LoRA for the 7B model and QLoRA for the 14B model) as the basic fine-tuning benchmarks. Our Method: Based on the fine-tuned model, it combines explicit dependency tag library guidance and hierarchical memory reinforcement strategies.
[0082] Table 2: Experimental Results
[0083] Table 3
[0084]
[0085] The experimental results of this invention are shown in Tables 2 and 3. The experimental data show that the method proposed in this invention significantly improves the performance of both Qwen models of different sizes on the Vietnamese language task. For the Qwen2.5-7B model, the method of this invention significantly increases its LAS score from 63.26% (based on fine-tuning) to 79.35%. For the larger Qwen2.5-14B model, the method of this invention achieves optimal performance, with an LAS score of 83.14% and a UAS score of 68.51%. This demonstrates that this strategy is applicable not only to models of different sizes, but also that the larger the number of model parameters, the more significant the effect of unlocking syntactic parsing potential. The Zero-shot, One-shot, and Five-shot methods involved fall under the category of Prompt Learning, which tests the model's understanding of Vietnamese dependency structures by providing 0, 1, and 5 correct parsing examples as references, respectively, without changing the large model parameters. QLoRA is a parameter-efficient fine-tuning technique that uses 8-bit quantization and low-rank matrix updates to achieve implicit alignment of cross-language knowledge with low memory usage, serving as a key benchmark in this study. Our method is the core of this research. Building upon QLoRA fine-tuning, it incorporates explicit dependency tag library guidance and hierarchical memory enhancement strategies. By calculating tag memory strength and specifically reinforcing weak structures, the Qwen2.5-14B model achieves a state-of-the-art performance of 83.14% LAS on the Vietnamese language task.
[0086] In addition, Table 4 shows the influencing factors in the formula for memory strength. Impact on model performance. When When set to 60, the model achieves a balance between frequency and accuracy, and is better able to identify and reinforce weak memory labels.
[0087] Table 4: The impact of hyperparameters on Vietnamese parsing performance
[0088] Further error analysis showed that by combining explicit tag library guidance, the accuracy of tags that originally belonged to the weak memory level in Vietnamese (such as nmod and amod) was significantly improved after optimization. In summary, this invention effectively alleviates cross-linguistic semantic interference and significantly improves the accuracy and robustness of Vietnamese dependency parsing by performing hierarchical memory enhancement on the Qwen series models.
[0089] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model, characterized in that: The specific steps of the method are as follows: Step 1: Collect labeled datasets of the source language and the target low-resource language from a general dependency dataset. The datasets include part-of-speech tags, head node indexes, and dependency relation labels. Step 2: A multi-task joint fine-tuning strategy is adopted, using cross-linguistic part-of-speech tagging and dependency parsing as joint tasks. The parameters of the large language model are fine-tuned through low-rank adaptive technology to achieve implicit alignment of grammatical knowledge between the source language and the target language. Step 3: Use the fine-tuned large language model to perform preliminary parsing of the target language sentence, and generate an initial dependency parsing tree of the target language containing predicted dependency relations and part-of-speech tagging; Step 4: Statistically analyze the distribution of grammatical structures in the source and target language training corpora, and automatically construct a dependency tag library containing tag features, distribution frequency, part-of-speech pair information, and grammatical example sentences; Step 5: By analyzing the prediction accuracy of each label in the initial parse tree and its frequency in the training set, calculate the memory strength score for each dependency label. Based on this, the tags are divided into three levels: weak memory, medium memory, and strong memory. Step 6: According to the label The memory hierarchy retrieves corresponding explicit grammatical knowledge from the dependency tag library to construct enhanced prompt words, which guide the fine-tuned LLM to correct the initial parse tree, resulting in the final target language dependency parsing tree.
2. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that: The specific steps of Step 1 are as follows: Step 1.1: Collect the latest Chinese, English, and Vietnamese datasets from the official website of the general dataset.
3. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that, Step 2 includes: Step 2.1: Convert sentences in the source and target languages into a unified fine-tuning template that includes language type, part-of-speech sequence, and syntactic structure; Step 2.2: Fix the original weights of LLM, learn the rank decomposition matrix through LoRA, and use the cross-entropy loss function to minimize the part-of-speech tagging loss and dependency parsing loss during training.
4. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that: Step 3 includes: Step 3.1: Input the sentence to be processed in the target low-resource language into the large language model LLM that was fine-tuned in Step 2; Step 3.2: Utilize the grammatical knowledge acquired by LLM during the multi-task fine-tuning stage to perform preliminary part-of-speech tagging on the words in the sentence and identify potential dependency connections between words; Step 3.3: Based on contextual semantic information and word order features, LLM assigns corresponding dependency labels to each pair of identified dependency relations; Step 3.4: The model integrates the head node indexes and relation labels of all words, automatically constructs and outputs an initial dependency parse tree for the target language. This provides a foundation for subsequent memory strength analysis and stratified enhancement.
5. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that: Step 4 includes: Step 4.1: Use the fine-tuned LLM model to summarize the syntactic features, usage and meaning of each dependency tag in the UD corpus and generate the feature values of the tags; Step 4.2: Calculate the distribution ratio of each label in the training data as the frequency value, and extract the part-of-speech combination of the center word and the dependent word and its corresponding frequency; Step 4.3: Select representative syntactic examples and their corresponding explanations for each part-of-speech combination from the corpus to form example attributes of the tag library.
6. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that: Step 5 includes: Step 5.1: Re-parse the training dataset using the fine-tuned LLM and calculate the prediction accuracy for each dependency label; Step 5.2: Combining prediction accuracy and training distribution frequency, calculate the memory strength score for each label using the memory strength formula. ; Step 5.3: Set the stratification threshold, and... Labels with a value less than 0.6 are defined as weak memory labels, and labels with a value less than or equal to 0.6 are defined as weak memory labels. Tags with a value <0.9 are defined as medium-memory tags. Tags with a value of ≥0.9 are defined as strong memory tags.
7. The cross-linguistic dependency parsing method based on a hierarchical syntactic comprehension and memory enhancement model according to claim 1, characterized in that: Step 6 includes: Step 6.1: For weak memory tags, simultaneously retrieve corresponding grammatical rules and similar structure examples from the dependency tag libraries of the source language and the target language for dual enhancement; Step 6.2: For medium-length memory tags, only retrieve relevant syntactic structure information from the target language dependency tag library for auxiliary enhancement; Step 6.3: For tags with strong memory, retain the initial parsing results and do not introduce additional external knowledge; Step 6.4: Integrate the retrieved explicit grammatical knowledge, including part-of-speech matching rules and parsing examples of similar structures, into the Prompt to guide the LLM to capture cross-language syntactic commonalities and complete the final parsing.
8. A cross-linguistic dependency parsing system based on a hierarchical syntactic comprehension and memory enhancement model, characterized in that, The system includes a module for performing a cross-linguistic dependency parsing method based on a hierarchical syntactic understanding and memory enhancement large model as described in any one of claims 1 to 7.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the cross-lingual dependency parsing method based on a hierarchical syntactic understanding and memory enhancement big model as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the cross-lingual dependency parsing method based on a hierarchical syntactic understanding and memory enhancement large model as described in any one of claims 1 to 7.