A large language model-based chinese grammar correction method based on self-rethinking
By enhancing strategies in the data synthesis, fine-tuning, and inference stages, the overcorrection problem in Chinese grammar correction of large language models was solved, improving the accuracy and stability of error correction. The generated data is more consistent with the real error distribution, and the cost of manual annotation is reduced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUIZHOU UNIV
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Large language models are prone to overcorrection in Chinese grammar correction tasks, and existing methods cannot effectively solve this problem.
We employ a Chinese grammar correction method based on a large language model of self-rethinking. By enhancing the data synthesis, fine-tuning, and reasoning stages, including data synthesis of thought chains, two-stage fine-tuning, and control of Top-p and Temperature, we generate and select appropriate correction answers.
It effectively alleviates the overcorrection problem of large language models, improves the accuracy and stability of Chinese grammar correction, reduces the cost of manual annotation, and generates data that better reflects the real error distribution.
Smart Images

Figure CN122242644A_ABST
Abstract
Description
Technical Field
[0001] The present invention belongs to the technical field of electronic digital data processing, and specifically relates to a method for correcting Chinese grammar errors in large language models based on self-rethinking. Background Technique
[0002] The task of grammatical error correction (GEC) is to identify and correct grammar errors in sentences. The goal is to modify the sentence as little as possible without changing its original meaning. The GEC task can be a pre-task for various tasks (language learning, automatic speech recognition, text data annotation) and can serve industries such as education, media, and publishing. Chinese grammatical error correction (CGEC) focuses on Chinese grammar errors.
[0003] The types of Chinese grammar errors can be roughly divided into seven categories, namely lexical collocation errors, component omission, component redundancy, structural confusion, word order errors, illogicality, and ambiguity. Examples of the seven error types are shown in Table 1. In the first row, "clever and capable" and "hands" cannot be put together for collocation; in the second row, there should be a referential object after "has", otherwise the sentence pattern is incomplete; in the third row, both "about" and "around" can express the same meaning, and putting them together seems redundant; in the fourth row, either "the reason" or "caused by" can make the sentence structure complete, but using both of them together instead makes the sentence structure chaotic; in the fifth row, according to the semantics, one should "recognize" first and then "correct"; in the sixth row, "prevent" and "not occur" form a double negative, making the logic unclear and deviating from the original meaning of the sentence; in the seventh row, the reference of "the one who sees a doctor" is unclear, and it is impossible to determine whether it refers to the patient or the doctor here.
[0004] Table 1 Examples of Chinese Grammar Error Types
[0005]
[0006] Therefore, Chinese grammar error correction is a very challenging task. In actual application scenarios, it may also be necessary to process compound sentences that contain multiple error types. Chinese grammar errors can be roughly divided into two major categories according to their sources: one is the errors made by learners who learn Chinese as a second language; the other is the errors made by native speakers who use Chinese as their mother tongue. Table 1 shows the errors made by native speakers. The errors often made by learners are similar to "Goodbye, I have a lot of homework to do.", and most of these errors are simple. This sentence only needs to delete the word "have" to eliminate the error.
[0007] To meet the demands of real-world applications such as news review, official document proofreading, and announcement checking, models must be able to adapt to errors made by native speakers, thus more accurately reflecting actual Chinese usage. Large Language Models (LLMs) have demonstrated powerful capabilities across various domains, but their highly open-ended responses often lead to severe overcorrection (modifying correct parts of the original sentence). Overcoming this problem is crucial for exploring the application of LLMs in the CGEC task.
[0008] The paper "Yaxin Fan, Feng Jiang, Peifeng Li, Haizhou Li, 'GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning'", presented at the CCF International Conference on Natural Language Processing and Chinese Computing on October 12, 2023, discloses a method using LLMs to synthesize data and performs supervised fine-tuning (SFT) on open-source LLMs to validate its approach. GrammarGPT utilizes a hybrid dataset of data generated by ChatGPT and manually annotated data. For grammatical errors with clues, GrammarGPT guides ChatGPT to generate grammatically incorrect sentences by providing these clues. For grammatical errors without clues, GrammarGPT collects grammatically incorrect sentences from publicly available websites and manually corrects them. Furthermore, the paper employs an error-invariant augmentation method, replacing named entities in parallel data with similar named entities, allowing the model to focus on identifying unchanged errors rather than specific nouns. This paper fine-tunes a large language model using a hybrid dataset obtained through the aforementioned method. The main approach involves extracting error cues to construct different prompt words, which are then used to prompt the large language model to generate synthetic data. For data that the large language model cannot synthesize, manual annotation is employed. Undoubtedly, while this method reduces manual costs, it still cannot eliminate the manual annotation process. Another approach replaces named entities in parallel data with similar named entities to obtain error-invariant augmented data, but this augmented data fails to provide richer semantics.
[0009] Attempts to explore Chinese grammar correction tasks on LLM are still insufficient because LLM responses are open-ended. Although they can provide a wider range of modification options, they often contradict the principle of minimizing modifications in grammar correction. They are extremely prone to overcorrection, where parts of the sentence that should not be corrected are modified. Existing Chinese grammar correction methods are not yet sufficient to effectively overcome this problem. Summary of the Invention
[0010] LLM possesses a vast knowledge base, and its open-ended responses can meet the diverse needs of Chinese grammar correction. However, this also brings new challenges, namely the potential for overcorrection. To alleviate the inconsistency problem (multiple different answers to the same erroneous sentence, and not all of them being correct) and the overcorrection problem caused by the open-ended response characteristic of LLM, this invention provides a Chinese grammar correction method based on self-rethinking of large language models.
[0011] This invention can stimulate the direct error correction capability of LLM, alleviate the negative impact of LLM's open response characteristics on CGEC tasks (response inconsistency and overcorrection), and does not compromise the original intention of LLM multi-task fusion.
[0012] The technical solution adopted by this invention to solve the technical problem is as follows:
[0013] This invention provides a Chinese grammar correction method based on a large language model of self-rethinking, comprising the following steps:
[0014] Step 1: Enhance the data synthesis stage;
[0015] The augmented data synthesis process is broken down into multiple sub-problems using a chain of thought approach: LLM first learns how to correct examples and generate explanatory information, then generates synthetic data based on the explanatory information, and finally rethinks and checks whether the generated synthetic data is qualified. If it is not qualified, the synthetic data is regenerated.
[0016] Step Two: Fine-tuning Phase;
[0017] The first phase involved fine-tuning the base LLM using a synthetic dataset.
[0018] The second stage uses the existing dataset to perform a second fine-tuning of the LLM after the first stage.
[0019] Step 3: Reasoning Stage;
[0020] Based on controlling Top-p and Temperature, word-level edit distance and LLM rethinking are introduced to enable the LLM after secondary fine-tuning to output the most appropriate answer according to the selection strategy.
[0021] Furthermore, the step of generating the explanatory information is as follows: first, the error-correction pair is converted into a word-level modification step through a converter; then, the error-correction pair and the modification step are used together as input to prompt the LLM to generate explanatory information about why the error-correction pair was modified in this way.
[0022] Furthermore, the steps for generating the synthetic data are as follows: first, the error-correction pair Q, the error type, and the explanation information are optionally combined to prompt the LLM to mimic the error-correction pair Q to generate a new error-correction pair A.
[0023] Furthermore, the error-correction pair A and the error-correction pair Q have the same error mode but different semantics.
[0024] Furthermore, the step of rethinking and checking whether the generated synthetic data is qualified is as follows: constructing Few-shot data and prompting the LLM to rethink whether the synthesized data is qualified; for unqualified data, prompting the LLM to resynthesize data in combination with historical information.
[0025] Furthermore, the historical information includes error-correction pair Q, error type, explanation information, and error-correction pair A.
[0026] Furthermore, in step three, by controlling the values of Top-p and Temperature, the number of candidate words for the LLM response is reduced, while its random sampling strategy gives fewer opportunities to low-probability candidate words.
[0027] Furthermore, Top-p is used to control the number of candidate words; Temperature is used to control the sampling probability.
[0028] Furthermore, in step three, the LLM after secondary fine-tuning samples multiple answers to the same question and selects the answer with the smallest word-level edit distance among the multiple answers.
[0029] Furthermore, if multiple answers have the smallest word-level edit distance, then LLM is introduced to rethink and select the most appropriate answer.
[0030] The beneficial effects of this invention are:
[0031] This invention effectively alleviates the overcorrection problem of LLM in the CGEC task by jointly enhancing the data synthesis, fine-tuning, and inference stages, and improves the direct error correction capability of LLM for Chinese syntax. Compared with the prior art, the advantages of this invention are:
[0032] 1. In the augmented data synthesis stage, this invention prepares sufficient restrictive information and uses LLM to mimic existing datasets to generate augmented datasets. Simultaneously, it prompts LLM to rethink its generated augmented data and regenerate any substandard data. This invention eliminates the need for manual annotation in synthesizing augmented data, significantly reducing workload. Furthermore, the generated augmented data provides richer semantic information and more closely reflects real-world error distributions.
[0033] 2. To better fit the CGEC task, the fine-tuning phase is divided into two stages: the first stage uses the augmented dataset for training, and the second stage uses the original dataset for training. This invention utilizes a two-stage fine-tuning method with both augmented and original datasets, which promotes better model fitting.
[0034] 3. In the inference phase, this invention first controls the number of candidate words and the sampling probability, initially improving the error correction stability of LLM. Based on this, it samples multiple LLM responses to the same question and selects the response with the smallest word-level edit distance. If multiple responses have the smallest word-level edit distance, the LLM is introduced to reconsider a suitable response. This invention ensures the error correction performance and stability of LLM through the inference phase.
[0035] 4. In both the data synthesis and inference stages, this invention introduces the idea of LLM itself rethinking, evaluating the answers previously generated by LLM and choosing whether to modify them, thus ensuring the quality of the synthesized data.
[0036] 5. By integrating strategies for enhanced data synthesis, fine-tuning, and inference, this invention significantly improves the performance of LLM compared to existing models that only undergo one fine-tuning. Simultaneously, this invention effectively ensures that the error correction performance of LLM remains within a relatively stable range with minimal fluctuations, avoiding drastic fluctuations in error correction performance, thus providing strong assurance for the reliability and practicality of LLM. Attached Figure Description
[0037] Figure 1 To enhance the flowchart of the data synthesis phase.
[0038] Figure 2 This is a flowchart for the fine-tuning phase.
[0039] Figure 3 This is a flowchart of the reasoning stage. Detailed Implementation
[0040] The present invention will be further described in detail below with reference to the accompanying drawings.
[0041] This invention provides a Chinese grammar correction method based on self-rethinking large language model (LLM). By combining strategies in the augmented data synthesis, fine-tuning, and inference stages, it stimulates the direct error correction capability of LLM for Chinese grammar, generating diverse correction schemes while mitigating overcorrection. In the augmented data synthesis stage, LLM is used to mimic existing datasets to generate augmented datasets with unchanged errors but completely different semantics, further expanding the semantics without further human intervention. In the fine-tuning stage, a two-stage fine-tuning method is used with the augmented and original datasets to promote better model fitting. In the inference stage, multiple answers are sampled, and for those difficult to distinguish, the fine-tuned LLM is prompted to reconsider which answer is more suitable.
[0042] In the augmented data synthesis phase, this invention employs an augmented data synthesis method based on the Chain of Thought (COT). This method primarily breaks down the augmented data synthesis process into multiple sub-problems Q1-Q4 using a chain-of-thought approach. First, the LLM learns how to correct the example and generates explanatory information. Then, the LLM generates synthesized data based on the explanatory information. Finally, the LLM rethinks and checks the generated synthesized data to ensure its quality; if it is not satisfactory, it regenerates the synthesized data. Ultimately, a synthesized dataset can be obtained through the augmented data synthesis phase.
[0043] like Figure 1 As shown, the specific implementation process of the enhanced data synthesis stage is as follows:
[0044] Step 1: Generate explanation information;
[0045] This invention designs a converter to transform an error-correction pair Q into word-level modification steps. First, the error-correction pair Q is used as the original training data input. The converter transforms the error-correction pair Q into word-level modification steps. Then, the error-correction pair Q and the converter's output, i.e., the modification steps S, are used as input to prompt the LLM to generate explanatory information E about why the error-correction pair Q is modified in this way.
[0046] Sub-problem Q1: Hint 1 is to generate an explanation. The input is sentence information, such as error-correction pair Q, error type t, and modification steps S; Answer A1: Explain why this sentence needs to be modified in this way.
[0047] As one example, the error-correction pair Q is as follows:
[0048] Incorrect sentence: Not only has the aircraft's fuel consumption decreased, but its flight speed has also increased.
[0049] Correct sentence: Not only has the aircraft reduced its fuel consumption, but its flight speed has also increased.
[0050] For the above error-correction pair Q, the corresponding modification step S is as follows:
[0051] 1. Swap the word order of the two words "not only" and "fuel consumption".
[0052] 2. Delete the word "of".
[0053] Step2: Generate synthetic data;
[0054] Optionally combine the error-correction pair Q, error type t, and explanation information E to prompt the LLM to imitate the error-correction pair Q to generate a new error-correction pair A; where, compared with the error-correction pair Q, the error mode of the newly generated error-correction pair A remains the same but the semantics are completely different.
[0055] Among them, sub-question Q2: The prompt 2 is to generate synthetic data, and the input is sentence information, such as error-correction pair Q, error type t, and explanation information E; answer A2: Synthetic data pair.
[0056] As an example, the error-correction pair A is as follows:
[0057] Error sentence: The operating efficiency of the computer not only has improved, but also it is more convenient to use.
[0058] Correct sentence: The computer not only has improved operating efficiency, but also it is more convenient to use.
[0059] Step3: Check whether the generated synthetic data is qualified;
[0060] Construct Few-shot data, provide some examples of qualified or unqualified ones, and prompt the LLM to rethink whether the synthetic data, that is, its own answer, is qualified;
[0061] Among them, sub-question Q3: The prompt 3 is to determine whether the synthetic data pair is qualified; answer A3: Qualified / Unqualified.
[0062] Step4: Regenerate the data that fails the check;
[0063] For the data that fails the check, prompt the LLM to recombine data by combining historical information (error-correction pair Q, error type, explanation information, error-correction pair A).
[0064] Among them, sub-question Q4: The prompt 4 is to regenerate the unqualified data, and the input is historical information and the unqualified synthetic data pair; answer A4: New synthetic data pair.
[0065] As an example, the new synthetic data pair is as follows:
[0066] Incorrect sentence: The students' grades not only improved, but their attitude towards learning also became more positive.
[0067] Correct sentence: The students not only improved their grades, but also developed a more positive attitude towards learning.
[0068] In the fine-tuning phase, this invention uses a two-stage fine-tuning method, such as... Figure 2 As shown, the fine-tuning stage is divided into two sub-stages: The first stage uses a synthetic dataset to fine-tune the base LLM. Specifically, raw data is first collected, and the structure and semantics of the raw data are imitated. New data is synthesized using a general large model based on the raw data and its error structure information, thus obtaining the synthetic dataset. The synthetic dataset is then used to perform supervised fine-tuning (SFT) on the base LLM to obtain the large model with one-stage fine-tuning. The second stage uses an existing dataset (Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, Ming Cai, “FCGEC: Fine-Grained Corpus for Chinese Grammatical ErrorCorrection”, Findings of the Association for Computational Linguistics: EMNLP2022, published in October 2022) to perform a second fine-tuning on the large model with one-stage fine-tuning. Specifically, for the LLM trained using the synthetic dataset, supervised fine-tuning (SFT) is performed using the raw data to obtain the large model with two-stage fine-tuning, i.e., the LLM after augmentation training, thus improving the two-stage training. A two-stage fine-tuning method facilitated the rapid fitting of LLM to the CGEC task.
[0069] During the inference phase, the open-response characteristic of LLM is mainly due to its large number of candidate words and high sampling probability. More candidate words mean more possible responses, and high sampling probability makes it easier to select low-probability candidate words. This characteristic can easily lead to unstable Chinese syntax correction performance in LLM, resulting in overcorrection. Therefore, in addition to data-level control, this invention also employs a selection strategy during the model's inference phase.
[0070] The selection strategy used in the reasoning phase includes the following steps:
[0071] (1) First, control the Top-p and Temperature values of LLM within a reasonable range so that the number of candidate words for LLM answers is reduced, and its random sampling strategy gives fewer opportunities to low-probability candidate words.
[0072] Top-p, also known as kernel sampling, is a strategy for controlling the generated text, used to adjust the diversity and accuracy of the generated text. In this invention, Top-p is used to control the number of candidate words.
[0073] Temperature is a hyperparameter, typically a value between 0 and 1, used to control the creativity of the generated text. In this invention, Temperature controls the sampling probability.
[0074] (2) Then, sample multiple answers to the same question Q from the LLM after secondary fine-tuning, and select the answer with the smallest word-level edit distance among the multiple answers.
[0075] (3) If the word-level edit distance of multiple answers is the smallest, then LLM is introduced to rethink and select the most appropriate answer.
[0076] like Figure 3 As shown in the example, the input question Q is: convert the input sentence into a language expression that people prefer. The input sentence is as follows: The scope of indications has been reduced, and it is clearly stated that it is prohibited for children under 3 years old. The LLM outputs three answers: A1: It is clearly stated that it is prohibited for children under 3 years old, and the scope of indications has been reduced to the minimum (error correction); A2: The scope of indications has been adjusted, and it is clearly stated that it is prohibited for children under 3 years old (strictness, right or wrong); A3: The scope of indications has been narrowed, and it is clearly stated that it is prohibited for children under 3 years old (correction).
[0077] Based on the selection strategy, the word-level edit distances of A1, A2, and A3 are compared. A1 has a word-level edit distance of 12, while A1 and A3 both have a word-level edit distance of 1. Therefore, A1 and A3 are selected. Simultaneously, LLM (Lesson-Law Management) is introduced to reconsider and select the most appropriate answer, A3, under limited sample cues. The final output is A3: it narrows the indications and explicitly states that it is contraindicated in children under 3 years old (correct correction). Figure 3 As shown, strictness is key: while this type of modification might be considered a correct correction, the wording may be imprecise. Therefore, in machine-based scoring, it might be judged as incorrect. For example, in... Figure 3 In the first example, the correct correction is "narrow," while A2 uses the broader term "adjust." Although the corrected answer is grammatically correct, machine scoring may not accept this wording. Error Correction: The corrected result is incorrect; grammatical errors still exist. Correct Correction: The grammatical errors have been corrected, resulting in a grammatically correct answer.
[0078] As one example, the specific implementation process of the selection strategy used in the inference phase is as follows:
[0079] Input: Incorrect sentence Fine-tuning the large model Error correction prompts ;
[0080] Output: Corrected sentence ;
[0081] Step 1: Initialize the candidate result set sum and candidate result set One-to-one word-level edit distance set An empty list;
[0082] Step 2: Sampling and fine-tuning the large model For incorrect sentences Multiple correction results Store in candidate result set ;
[0083] Step 3: Correcting incorrect sentences Word segmentation yields a list of words. ;
[0084] Step 4: Traverse the candidate result set Candidate correction results were obtained ;
[0085] Step 5: Review candidate correction results Word segmentation yields a list of words. ;
[0086] Step 6: Calculate the word list and word list Word-level edit distance ;
[0087] Step 7: Obtain the word-level edit distance set The value of minimum edit distance in ;
[0088] Step 8: Obtain the word-level edit distance set index of the minimum edit distance value ;
[0089] Step 9: If the word-level edit distance set If the minimum edit distance value is unique, then the final corrected sentence output is obtained directly. Otherwise, the candidate correction result with the smallest distance value will be edited. Constitute a subset ;
[0090] Step 10: Imitate error correction prompts Build a rethinking of error correction prompts ;
[0091] Step 11: Provide some error-correction pairs as background information;
[0092] Step 12: Fine-tune the large model According to subsets Error correction prompts Rethinking the background knowledge and the final corrected sentences output .
[0093] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A Chinese grammar correction method based on a large language model of self-rethinking, characterized in that, Includes the following steps: Step 1: Enhance the data synthesis stage; The augmented data synthesis process is broken down into multiple sub-problems using a chain of thought approach: LLM first learns how to correct examples and generate explanatory information, then generates synthetic data based on the explanatory information, and finally rethinks and checks whether the generated synthetic data is qualified. If it is not qualified, the synthetic data is regenerated. Step Two: Fine-tuning Phase; The first phase involved fine-tuning the base LLM using a synthetic dataset. The second stage uses the existing dataset to perform a second fine-tuning of the LLM after the first stage. Step 3: Reasoning Stage; Based on controlling Top-p and Temperature, word-level edit distance and LLM rethinking are introduced to enable the LLM after secondary fine-tuning to output the most appropriate answer according to the selection strategy.
2. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 1, characterized in that, The steps for generating the explanatory information are as follows: first, the error-correction pair is converted into a word-level modification step using a converter; then, the error-correction pair and the modification step are used together as input to prompt the LLM to generate explanatory information about why the error-correction pair was modified in this way.
3. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 1, characterized in that, The steps for generating the synthetic data are as follows: First, the error-correction pair Q, the error type, and the explanation information are selectively combined to prompt the LLM to mimic the error-correction pair Q and generate a new error-correction pair A.
4. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 3, characterized in that, The error-correction pair A and the error-correction pair Q have the same error pattern but different semantics.
5. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 1, characterized in that, The step of rethinking and checking whether the generated synthetic data is qualified is as follows: construct Few-shot data and prompt the LLM to rethink whether the synthesized data is qualified; for unqualified data, prompt the LLM to resynthesize the data by combining historical information.
6. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 5, characterized in that, The historical information includes error-correction pair Q, error type, explanation information, and error-correction pair A.
7. The Chinese grammar correction method based on a large language model of self-rethinking as described in claim 1, characterized in that, In step three, by controlling the values of Top-p and Temperature, the number of candidate words for the LLM response is reduced, and its random sampling strategy gives fewer opportunities to low-probability candidate words.
8. A Chinese grammar correction method based on a large language model of self-rethinking, as described in claim 7, is characterized in that... The Top-p term is used to control the number of candidate words; the Temperature term is used to control the sampling probability.
9. A Chinese grammar correction method based on a large language model of self-rethinking, as described in claim 1, is characterized in that... In step three, the LLM after secondary fine-tuning samples multiple answers to the same question and selects the answer with the smallest word-level edit distance among the multiple answers.
10. A Chinese grammar correction method based on a large language model of self-rethinking, as described in claim 9, is characterized in that... If multiple answers have the smallest word-level edit distance, then LLM is introduced to rethink and select the most appropriate answer.