Chinese passage grammar error correction corpus automatic generation method based on edit distance optimization

By using an edit distance-based optimization method, a high-quality Chinese grammar correction corpus is constructed using large language model generation and edit distance evaluation. This solves the problem of insufficient training data for existing models and improves the performance of the model in practical applications.

CN122197870APending Publication Date: 2026-06-12KUNMING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KUNMING UNIV OF SCI & TECH
Filing Date
2026-03-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing Chinese grammar correction models have limited training data, cover only a single scenario, and exhibit a distributional offset between synthesized errors and real user input patterns, resulting in insufficient robustness and generalization performance in practical applications.

Method used

By using an edit distance-based optimization method, we generate erroneous sentences that match real user input scenarios using a large language model, and pair them with correct sentences to construct a high-fidelity, high-coverage synthetic error-correction corpus. By combining multi-model collaborative generation and edit distance evaluation, we select high-quality corpus segments.

🎯Benefits of technology

It significantly improves the performance of Chinese grammar correction models, generates corpora with higher consistency with real user input, and significantly enhances the robustness and generalization ability of the model in practical applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197870A_ABST
    Figure CN122197870A_ABST
Patent Text Reader

Abstract

The present application relates to a Chinese text grammar error correction corpus automatic generation method based on edit distance optimization. The present application optimizes the prompt template according to the error correction result, obtains a more suitable prompt; combines the text containing errors and the best prompt, inputs the large model, and obtains the preliminary error correction result of the model; respectively evaluates the error correction results of different large models, and obtains the corresponding editing scores; using the editing score and the preliminary error correction result to guide the large model to obtain more accurate error correction results, and performing artificial secondary verification. The present application automatically generates Chinese grammar error correction corpus by effectively utilizing the characteristics of open source large models, such as deep understanding and efficient processing of natural language. With high-quality correct Chinese text as seed corpus, through the multi-model collaborative generation mechanism, the error-containing text conforming to the real user error mode is generated, and is paired with the original correct text to form the training corpus, and excellent experimental results are obtained on the Chinese grammar error correction task.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Technical Language

[0002] This invention relates to a method for automatically generating Chinese discourse grammar correction corpora based on edit distance optimization, belonging to the field of natural language processing technology. Background Technology

[0003] With the rapid development of natural language processing technology, users have placed higher demands on the convenience and accuracy of text input in various human-computer interaction scenarios. However, due to limitations in input methods (such as keyboard input on mobile devices and speech recognition transcription) and differences in users' language abilities, user-generated text commonly suffers from grammatical errors, inappropriate word choice, redundant sentences, or disorganized structure. If these errors are not corrected, they will directly affect the performance of downstream tasks (such as information retrieval, machine translation, and intelligent question answering) and degrade the end-user experience.

[0004] To address these challenges, automatic grammatical error correction (GEC) technology has been widely integrated into practical systems such as text editors, smart input methods, email clients, and search engines. In the Chinese context, Chinese grammatical error correction (CGEC) aims to automatically detect and correct various grammatical errors in input sentences while strictly preserving the original semantics. Typical applications include: preprocessing user search queries to improve retrieval accuracy; post-editing speech recognition output text to enhance semantic understanding reliability; and providing real-time grammatical feedback for non-native language learners or writing aids.

[0005] In recent years, deep learning-based sequence-to-sequence (Seq2Seq) models and pre-trained language models have demonstrated significant advantages in the CGEC task. However, the performance of such models is highly dependent on large-scale, high-quality labeled corpora. Ideal training data is typically organized in the form of parallel sentence pairs, i.e., containing manually labeled "incorrect sentence – correct sentence" samples. However, constructing such corpora faces the following key bottlenecks:

[0006] First, Chinese grammatical errors are complex and diverse, encompassing multiple dimensions such as misuse of parts of speech, incomplete sentence components, disordered word order, inappropriate collocation, and redundancy. Furthermore, the manifestation of these errors is highly dependent on the context, demanding extremely high linguistic proficiency from the annotators. Second, manual annotation is costly, inefficient, and susceptible to subjective judgment, resulting in limited scale and narrow coverage of existing public datasets, making it difficult to support the training of models with high generalization capabilities. Third, to alleviate the data scarcity problem, existing research often employs synthetic data strategies, such as simulating errors by applying random character substitutions, deletions, insertions, or word order perturbations to correct text, and then using a Seq2Seq model to learn the "error correction mapping." However, such synthetic errors often exhibit significant distributional shifts from error patterns in real user input, failing to cover complex grammatical problems caused by semantic confusion, dialect interference, or speech recognition errors in real-world scenarios, thus limiting the robustness and generalization performance of models in real-world applications.

[0007] It is worth noting that although some studies have attempted to introduce rules or linguistic knowledge to constrain the synthesis process, their coverage remains limited and they are difficult to adapt to the dynamic evolution of language use. Furthermore, even in manually annotated data, inconsistencies in annotation standards or proofreading oversights often introduce noisy samples, further affecting the stability and convergence of model training. To address these issues, this invention proposes an automatic construction method for Chinese discourse grammar correction corpora based on collaboration among multiple open-source Large Language Models (LLMs) and edit distance optimization. This method fully leverages the powerful capabilities of large language models in semantic understanding, context modeling, and controllable text generation. By designing error-simulating prompts tailored to real user input scenarios, it guides the model to generate erroneous sentences that conform to human language habits and have contextual plausibility, while simultaneously outputting the corresponding standard corrected sentences, thereby constructing a high-fidelity, high-coverage synthetic error-correction corpus. Compared to traditional synthesis strategies based on rules or random perturbations, this method can more realistically simulate user errors in typical scenarios such as multi-scenario input, speech-to-text transcription, and writing in a non-native language. It significantly improves the consistency between the synthesized data and the real error distribution, providing a reliable and scalable data foundation for training high-performance Chinese grammar correction models. Summary of the Invention

[0008] This invention provides an automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization, which solves the problem of consuming a lot of human resources for manual annotation of Chinese grammar correction corpus. This invention has achieved excellent experimental results in improving the performance of Chinese grammar correction models.

[0009] The technical solution of this invention is: an automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization, the specific steps of which are as follows:

[0010] Step 1: Based on existing mature Chinese grammar correction prompt templates and prompt templates that have been adjusted multiple times through mainstream models, manual optimization and organization are carried out.

[0011] Step 2: Use the generated prompt template to guide the Qwen / QwQ-32B and DeepSeek-V3 large models to automatically segment and generate Chinese grammar correction corpus;

[0012] Step 3: Perform data cleaning and post-processing on the Chinese grammar correction corpus generated in the previous step.

[0013] Step 4: Use the ChERRANT edit distance tool to evaluate the distance between the candidate Chinese grammar correction corpora generated by the two models, and select the corpus segments whose normalized edit distance between the two model generation results is less than the preset threshold (0.8) to improve the quality of the generated Chinese grammar correction corpora;

[0014] Step 5: By analyzing the feature similarity and distance correlation of the evaluated Chinese grammar correction corpus, the pre-processed corpus with features meeting the requirements is fed into the QwenLong-L1-32B model (this model is the first long-context large-scale reasoning model (LRM) trained using reinforcement learning, specifically optimized for long text reasoning tasks. This model achieves stable transfer from short context to long context through a reinforcement learning framework of progressive context expansion. In seven long-context document question answering benchmarks, QwenLong-L1-32B outperforms flagship models such as OpenAI-o3-mini and Qwen3-235B-A22B, and its performance is comparable to Claude-3.7-Sonnet-Thinking) for error correction processing;

[0015] Step 6: The accuracy of the text content is ensured through manual review. By comparing with Lang8 (1.2M), it can be verified that the automatic generation method of Chinese text grammar correction corpus with edit distance optimization has excellent performance and meets the requirements for existing Chinese grammar errors.

[0016] As a further aspect of the present invention, the specific steps of Step 1 are as follows:

[0017] Step 1.1: DeepSeek-V3 and Qwen / QwQ-32B large language models, such as DeepSeek-R1, kimi-k2, and Wan, are invoked from the silicon-based streaming platform. Through experimental verification, this invention ultimately selects DeepSeek-V3 and Qwen / QwQ-32B large language models as the basic models for constructing this Chinese grammar correction corpus.

[0018] Step 1.2: Deploy DeepSeek-V3 and the Qwen / QwQ-32B open-source large language model as API interfaces for easy calling;

[0019] Step 1.3: From the publicly available Chinese grammar correction corpus Lang8, manually select passages containing typical Chinese grammar errors. Using the selected examples, construct thought chains, existing mature Chinese grammar correction prompt templates, and prompt templates that have been repeatedly adjusted through mainstream models. Manually optimize and organize these to generate suitable prompt templates.

[0020] Step 1.4: Filter out some prompt templates that cannot be processed by the language model, such as prompt templates involving links, special symbols, or non-Chinese characters, and filter out completely identical prompt templates.

[0021] Step 1.5: Strictly write the generated required format into the template to avoid formatting errors.

[0022] As a further aspect of the present invention, the specific steps of Step 2 are as follows:

[0023] Step 2.1: To ensure the correctness of the corpus generated by the model, this invention combines existing mature error correction templates and manually constructs multiple test templates;

[0024] Step 2.2: In order to reduce excessive modifications while maintaining high error correction accuracy, the error correction effect of each prompting scheme was tested independently multiple times from the manually constructed schemes. After selecting the prompting scheme that best meets the requirements, the manually constructed prompting template was optimized using the open source model Qwen / Qwen3-VL-235B-A22B-Instruct. The best prompting template was selected by combining the generation results of multiple models.

[0025] Step 2.3: The corpus is segmented using DeepSeek-V3 and Qwen / QwQ-32B models respectively. The segmentation is made reasonable by means of semantic boundary priority breakpoint identification, intelligent block size control and context continuity processing.

[0026] Step 2.4: Input the same text corpus into DeepSeek-V3 and Qwen / QwQ-32B models respectively, and obtain the error correction results of the two models;

[0027] Step 2.5: While keeping the meaning and logic of Chinese sentences unchanged, the model is required to delete repeated content, unknown symbols, redundant spaces, incorrectly used punctuation, and single short texts with inconsistent content from the Chinese grammar correction corpus, thereby eliminating potentially low-quality corpus in the generated Chinese grammar correction corpus.

[0028] As a further aspect of the present invention, the specific steps of Step 3 are as follows:

[0029] Step 3.1: Use the Chinese-English text filtering tool to actively delete long English paragraphs in the Chinese grammar correction corpus, while retaining appropriate English proper nouns;

[0030] Step 3.2: Use data cleaning tools to perform preliminary data cleaning on the processed Chinese grammar correction corpus.

[0031] Delete texts with short character lengths and ambiguous language from the corpus, delete paragraphs that are exactly the same before and after modification from most of the corpus, and keep only a very small number of identical paragraphs as reference examples;

[0032] As a further aspect of the present invention, the specific steps of Step 4 are as follows:

[0033] Step 4.1 Use the ChERRANT edit distance tool to evaluate the distance between the candidate Chinese grammar correction corpora generated by the two models, and select corpus segments whose normalized edit distance between the two model generation results is less than the preset threshold (0.8) to improve the quality of Chinese grammar correction corpora and select corpus segments that are suitable for model processing and have typicality.

[0034] As a further aspect of the present invention, the specific steps of Step 5 are as follows:

[0035] Step 5.1: Call the API interface of the deployed QwenLong-L1-32B model to reprocess the filtered candidate error correction corpus. Corpus with an evaluation score greater than a preset threshold (e.g., 8.0) is required to be retained, and corpus with a score less than the threshold is reprocessed. A corpus can be processed a maximum of 3 times until the filtering condition value is reached or it is removed to prevent excessive modification.

[0036] Step 5.2: Use the QwenLong-L1-32B model to comprehensively evaluate the semantic consistency, fluency, and error rationality of the initially generated error-correct sentence pairs, and filter or correct low-quality samples. Corpus with evaluation scores greater than 8.0 is retained.

[0037] As a further aspect of the present invention, the specific steps of Step 6 are as follows:

[0038] Step 6.1 Analyze the edit distance optimized Chinese discourse grammar correction corpus, and analyze its data distribution and correlation with publicly available Chinese grammar correction corpora;

[0039] Step 6.2: To verify the quality difference between the Chinese discourse grammar correction corpus optimized by edit distance and existing corpora, this invention selects a mainstream Chinese grammar correction model, namely the sequence-to-sequence model. Experiments are conducted on the publicly available Chinese grammar correction corpus Lang8, the Chinese grammar correction corpus generated collaboratively by multiple models, and the Chinese grammar correction corpus after edit distance evaluation.

[0040] Step 6.3: Evaluate the error correction performance of the Chinese grammar correction model by calculating evaluation metrics such as accuracy, recall, and F-score.

[0041] The present invention also provides an automatic generation system for Chinese discourse grammar correction corpus based on edit distance optimization, the system comprising: a module for executing the automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization.

[0042] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization.

[0043] The present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, characterized in that the computer program, when executed by a processor, implements the method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization.

[0044] The present invention also provides a computer program product, including a computer program, characterized in that, when the computer program is executed by a processor, it implements the automatic generation method of Chinese discourse grammar correction corpus based on edit distance optimization.

[0045] The beneficial effects of this invention are:

[0046] This invention automatically generates Chinese grammar correction corpora by effectively utilizing the deep understanding and efficient processing capabilities of open-source large models such as Qwen / QwQ-32B, DeepSeek-V3, and QwenLong-L1-32B for natural language. Using high-quality correct Chinese text as seed corpora, a multi-model collaborative generation mechanism generates erroneous text that conforms to real user error patterns. This erroneous text is then paired with the original correct text to form training corpora, achieving excellent experimental results on Chinese grammar correction tasks. Attached Figure Description

[0047] Figure 1 This is a flowchart from the present invention. Detailed Implementation

[0048] Example 1: As Figure 1As shown, an automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization is described, the method comprising:

[0049] Based on existing mature Chinese grammar correction prompt templates and prompt templates that have been adjusted multiple times through mainstream large models, manual optimization and organization were carried out.

[0050] The generated prompt templates are used to guide the Qwen / QwQ-32B and DeepSeek-V3 large models to automatically segment and generate Chinese grammar correction corpus.

[0051] Perform data cleaning and post-processing on the Chinese grammar correction corpus generated in the previous step.

[0052] The ChERRANT edit distance tool was used to evaluate the distance between the candidate Chinese grammar correction corpora generated by the two models. Corpus segments with a normalized edit distance of less than a preset threshold (0.8) between the two models were selected to improve the quality of the generated Chinese grammar correction corpora.

[0053] By analyzing the feature similarity and distance correlation of the evaluated Chinese grammar correction corpus, the pre-processed corpus with features meeting the requirements is fed into the QwenLong-L1-32B model (the first long-context large-scale reasoning model (LRM) trained using reinforcement learning, specifically optimized for long text reasoning tasks. This model achieves stable transfer from short to long contexts through a reinforcement learning framework of progressive context expansion. In seven long-context document question answering benchmarks, QwenLong-L1-32B outperforms flagship models such as OpenAI-o3-mini and Qwen3-235B-A22B, and its performance is comparable to Claude-3.7-Sonnet-Thinking) for error correction processing;

[0054] The accuracy of the text content is ensured through manual review. The performance of the automatic generation method of Chinese discourse grammar correction corpus with edit distance optimization can be verified by comparison with Lang8 (1.2M), which meets the requirements for existing Chinese grammar correction.

[0055] Furthermore, the specific steps of the method are as follows:

[0056] a1. Download the DeepSeek-V3 and Qwen / QwQ-32B model files from the Silicon Flow website. Utilize OpenAI's API format for local API deployment, adjusting some hyperparameters of the model, such as Temperature and Top-p, through interface methods and links. Finally, customize the returned data and deploy the model as an API interface for easy subsequent calls.

[0057] a2. In the classification system of Chinese grammatical errors, error types are mainly divided into seven categories: structural confusion, illogicality, incomplete components, redundant components, improper collocation, improper word order, and ambiguity. Among these, the identification and correction of ambiguity errors usually rely on rich contextual information and professional knowledge, making them difficult to handle effectively in general grammar correction systems. Therefore, this invention focuses on constructing a corpus for correcting the first six types of grammatical errors to improve the targeting and efficiency of corpus construction. This invention extracts high-quality corpus samples containing typical grammatical errors from the publicly available Chinese grammar correction corpus Lang8 through a professional manual screening mechanism. Based on these samples, this invention constructs a systematic thought chain guidance mechanism to generate prompt templates for multi-model collaborative generation. This mechanism, through structured guidance, enables models to accurately identify and correct various grammatical errors, effectively solving the technical problems of low efficiency and uneven quality in the construction of grammar correction corpora in existing technologies.

[0058] a3. Filter out the prompt templates that can be processed by the language model from the constructed prompt templates.

[0059] 1) Filter out prompt templates that contain links, special symbols, or non-Chinese characters.

[0060] 2) Filter out identical prompt templates.

[0061] 3) Filter out overly complicated templates.

[0062] a4. In order to ensure the standardization and correctness of the model output content, this invention uses manually defined output format prompt templates and repetitive emphasis prompt templates to ensure the quality of Chinese grammar correction corpus.

[0063] 1) Manually construct multiple prompt templates for Chinese grammar correction corpora.

[0064] 2) In order to reduce excessive modifications while maintaining high error correction accuracy, after selecting the most suitable prompting scheme, the manually constructed prompting template was optimized using the open-source model Qwen / Qwen3-VL-235B-A22B-Instruct.

[0065] a5. Combine the generation results of multiple models to select the best prompt template, and generate candidate error correction corpus for the corresponding model.

[0066] 1) The method used in this invention to guide DeepSeek-V3 and Qwen / QwQ-32B large models strictly follows the paradigm in Table 1 below, which includes a general prompt, an input field that provides the text to be edited, and an output field that requires a model response.

[0067] Table 1. Paradigm Format

[0068] a6. A method for quality assessment and screening of Chinese grammar correction corpora based on the ChERRANT edit distance tool is used to evaluate the distance between candidate Chinese grammar correction corpora generated by two models, and to select Chinese grammar correction corpora with similar distances to improve the quality and consistency of the Chinese grammar correction corpus. This method calculates the ChERRANT distance between the generated candidate grammar correction corpora and the reference grammar correction corpus, and selects corpora with similar distances to ensure that the generated corpora maintain a high degree of consistency with the reference corpus in grammatical correction, thereby significantly improving the quality and usability of the corpus.

[0069] a7. Based on the analysis and evaluation of the feature similarity and distance correlation of the Chinese grammar correction corpus, the pre-processed corpus that meets the distance requirements is input into the QwenLong-L1-32B model for quality assessment and refinement.

[0070] Check whether the number of characters after processing is not less than 80% of the original segment length; if the character count constraint is not met, the processing is repeated a maximum of 3 times; after each processing, a quality assessment is performed, and corpora with an assessment score greater than 8.0 are retained, otherwise they are removed.

[0071] a8. Ensure the accuracy of the text content through manual review. Review criteria include:

[0072] Grammatical correctness: Ensure that the corrected sentences conform to Chinese grammar rules.

[0073] Semantic consistency: The revised statement should retain its original meaning.

[0074] Language naturalness: The revised sentences should conform to the habits of natural language expression.

[0075] a9. To further evaluate the quality of the multi-model generated Chinese grammar correction corpus, this invention uses the ChERRANT editing and alignment tool as an evaluation index for corpus quality. The specific calculation method is as follows:

[0076] 1) The ChERRANT edit alignment tool can perform fine-grained, character-level editing operations to align and classify error types between the model output and the reference answer. The specific calculation formula is shown below:

[0077] (1)

[0078]

[0079]

[0080] Among them, E sysE represents the total number of edits proposed for the model. match E represents the number of edits correctly processed by the model. ref This represents the total number of errors in the reference answer.

[0081] This invention uses F 0.5 As the primary evaluation metric, β = 0.5 indicates that precision is valued twice as much as recall.

[0082] 2) Filtering ChERRANT editing F 0.5 Chinese grammar correction corpus with a value greater than 0.8.

[0083] a10. Verify whether Chinese text correction corpora based on model collaboration and edit distance assessment can improve model error correction performance.

[0084] 1) This invention uses the publicly available Chinese grammar correction corpus Lang8 and the Chinese text correction corpus evaluated by model collaboration and edit distance as experimental data. Precision, recall, and F0.5 are commonly used as evaluation metrics for the Chinese grammar correction model. The specific calculation methods are as follows:

[0085]

[0086]

[0087]

[0088] 2) To verify the effectiveness of the proposed method in improving the performance of Chinese grammar correction models, this invention selects a mainstream Chinese grammar correction model as the benchmark model. The sequence-to-sequence model adopts the most commonly used Transformer structure, aiming to directly "translate" faulty sentences into correct sentences. In this verification, this invention uses the open-source Chinese-BART model and the Mengzi-T5 model for pre-training and fine-tuning.

[0089] Table 2 Performance of different models on this Chinese corpus

[0090] All fine-tuned models were fine-tuned under the same conditions:

[0091] Training data: A pre-trained version trained using a fusion of the Lang8-based training set and the error-correcting sample training set generated in this invention;

[0092] Hyperparameters: batch size=32, learning rate=0.0001, epochs=5, LoRA Rank=16, LoRAAlpha=16, LoRA Dropout=0.05, max token=1024;

[0093] Fine-tuning the dataset: using gec_merge for fine-tuning;

[0094] Evaluation: Use the ChERRANT tool to perform character-level alignment between the model output and the reference answer, and calculate sentence-level Precision, Recall, and F. 0.5 (β=0.5).

[0095] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.

Claims

1. An automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization, characterized by: The specific steps of the method are as follows: Step 1: Based on existing mature Chinese grammar correction prompt templates and prompt templates that have been adjusted multiple times through mainstream models, manual optimization and organization are carried out. Step 2: Use the generated prompt template to guide the large model to automatically segment and generate Chinese grammar correction corpus; Step 3: Perform data cleaning and post-processing on the Chinese grammar correction corpus generated in the previous step; Step 4: Use the ChERRANT edit distance tool to evaluate the distance between the candidate Chinese grammar correction corpora generated by the two models, and select the corpus segments whose normalized edit distance between the two model results is less than the preset threshold in order to improve the quality of the generated Chinese grammar correction corpora. Step 5: By analyzing the feature similarity and distance correlation of the evaluated Chinese grammar correction corpus, the pre-processed corpus whose features meet the requirements is entered into the large model for error correction. Step 6: The accuracy of the text content is ensured through manual review. The performance of the automatic generation method of Chinese discourse grammar correction corpus optimized by Lang8 is verified to be excellent and meets the requirements for existing Chinese grammar errors.

2. The method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization according to claim 1, characterized in that: The specific steps of Step 1 are as follows: Step 1.1: Call DeepSeek-V3 and Qwen / QwQ-32B large language model from the silicon-based streaming platform; Step 1.2: Deploy DeepSeek-V3 and the Qwen / QwQ-32B open-source large language model as API interfaces for easy calling; Step 1.3: From the publicly available Chinese grammar correction corpus Lang8, manually select the segments containing typical Chinese grammar errors; use the selected examples to construct the thought chain, existing mature Chinese grammar correction prompt templates, and prompt templates that have been adjusted multiple times through mainstream large models, and manually optimize and organize them to generate suitable prompt templates; Step 1.4: Filter out some prompt templates that cannot be processed by the language model from the prompt templates, and filter out completely identical prompt templates; Step 1.5: Strictly write the generated required format into the template to avoid formatting errors.

3. The method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization according to claim 1, characterized in that: The specific steps of Step 2 are as follows: Step 2.1: To ensure the correctness of the corpus generated by the model, this invention combines existing mature error correction templates and manually constructs multiple test templates; Step 2.2: In order to reduce excessive modifications while maintaining high error correction accuracy, the error correction effect of each prompting scheme was tested independently multiple times from the manually constructed schemes. After selecting the prompting scheme that best meets the requirements, the manually constructed prompting template was optimized using the open source model Qwen / Qwen3-VL-235B-A22B-Instruct. The best prompting template was selected by combining the generation results of multiple models. Step 2.3: The corpus is segmented using DeepSeek-V3 and Qwen / QwQ-32B models respectively. The segmentation is made reasonable by means of semantic boundary priority breakpoint identification, intelligent block size control and context continuity processing. Step 2.4: Input the same text corpus into DeepSeek-V3 and Qwen / QwQ-32B models respectively, and obtain the error correction results of the two models; Step 2.5: While keeping the meaning and logic of Chinese sentences unchanged, the model is required to delete repeated content, unknown symbols, redundant spaces, incorrectly used punctuation, and single short texts with inconsistent content from the Chinese grammar correction corpus, thereby eliminating potentially low-quality corpus in the generated Chinese grammar correction corpus.

4. The method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization according to claim 1, characterized in that: The specific steps of Step 3 are as follows: Step 3.1: Use the Chinese-English text filtering tool to actively delete long English paragraphs in the Chinese grammar correction corpus, while retaining appropriate English proper nouns; Step 3.2: Use data cleaning tools to perform preliminary data cleaning on the processed Chinese grammar correction corpus. Delete texts with short character lengths and ambiguous language from the corpus, delete most paragraphs in the corpus that are exactly the same before and after modification, and retain only a very small number of identical paragraphs as reference examples.

5. The method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization according to claim 1, characterized in that: The specific steps of Step 5 are as follows: Step 5.1: Call the API interface of the deployed QwenLong-L1-32B model to reprocess the filtered candidate error correction corpus. Corpus with evaluation scores greater than the preset threshold is retained, and corpus with scores less than the threshold is reprocessed. A corpus can be processed a maximum of 3 times until the filtering condition value is reached or it is removed to prevent excessive modification. Step 5.2: The QwenLong-L1-32B model is used to comprehensively evaluate the semantic consistency, fluency and error rationality of the initially generated error-correct sentence pairs. Low-quality samples are filtered or corrected, and the corpus with an evaluation score greater than 8.0 is retained.

6. The method for automatically generating Chinese discourse grammar correction corpus based on edit distance optimization according to claim 1, characterized in that: The specific steps of Step 6 are as follows: Step 6.1 Analyze the edit distance optimized Chinese discourse grammar correction corpus, and analyze its data distribution and correlation with publicly available Chinese grammar correction corpora; Step 6.2: Select the mainstream Chinese grammar correction model, namely the sequence-to-sequence model, and conduct experiments on the publicly available Chinese grammar correction corpus Lang8, the Chinese grammar correction corpus generated by multi-model collaborative generation, and the Chinese grammar correction corpus after edit distance evaluation; Step 6.3: Evaluate the error correction performance of the Chinese grammar correction model by calculating evaluation metrics such as accuracy, recall, and F-score.

7. An automatic generation system for Chinese discourse grammar correction corpus based on edit distance optimization, characterized in that, The system includes a module for performing the automatic generation method of Chinese discourse grammar correction corpus based on edit distance optimization as described in any one of claims 1 to 6.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the automatic generation method for Chinese discourse grammar correction corpus based on edit distance optimization as described in any one of claims 1 to 6.