Model training method, text correction method, device, equipment, medium and product
By optimizing the text correction model through text edit distance and reward function, and combining GRPO and RAG training methods, the problem of insufficient training data is solved, and efficient text correction and deep thinking capabilities are achieved with limited data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NEW ORIENTAL EDUCATION & TECH GRP CO LTD
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing text correction models are limited by the limited thought chain data during training, resulting in poor error correction and thinking effects, and making it difficult to achieve efficient error correction with a small amount of training data.
By acquiring erroneous and corrected samples, the text edit distance is calculated, a reward function is generated, and the text correction model is optimized using the GRPO algorithm and RAG method. The model is trained by combining paragraph granularity and contextual information of the entire article, thus expanding the sources of training data.
Under limited training data conditions, the adaptability and error correction accuracy of the text correction model were improved, and the generalization ability and error correction effect of the model in complex scenarios were enhanced.
Smart Images

Figure CN122242491A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a model training method, a text correction method, an apparatus, a device, a medium, and a product. Background Technology
[0002] Artificial intelligence (AI) technologies exist in this field that can correct text errors and demonstrate thought processes. During training, such AI models that can automatically correct errors and demonstrate thought processes require training data that includes sentence pairs consisting of incorrect and correct sentences, as well as thought chain data. The thought chain data is used to demonstrate the thought process of correcting incorrect sentences to arrive at the correct ones. This training data is complex and requires human intervention to generate, thus its quantity is often limited. However, current error correction model training methods require a large amount of this type of training data, including thought chains. Limited by the amount of existing training data, the error correction and thought process performance of existing error correction models is constrained. Summary of the Invention
[0003] Therefore, this disclosure aims to provide a model training method, text correction method, apparatus, device, medium, and product that can expand the sources of training data and produce better model training results.
[0004] In one aspect, this disclosure provides a model training method, comprising: acquiring first training data, the first training data including a first erroneous sample and a first corrected sample; inputting the first erroneous sample into a pre-generated text correction model to obtain a first corrected text; determining a text edit distance based on the first corrected sample and the first corrected text; generating a first reward function based on the text edit distance; and optimizing the text correction model based on the first reward function to obtain a target text correction model, the target text correction model being used to correct errors in the input target text.
[0005] In one possible implementation of this disclosure, the text editing distance includes a first text editing distance and / or a second text editing distance; determining the text editing distance based on the first error-corrected sample and the first error-corrected text includes: taking the editing distance between the first error-corrected sample and the first error-corrected text as the first text editing distance; and / or obtaining the first editing distance between the first error-corrected sample and the first error-contained sample, and the second editing distance between the first error-corrected text and the first error-contained sample, to obtain the second text editing distance.
[0006] In one possible implementation of this disclosure, the text edit distance includes a first text edit distance and a second text edit distance; generating a target reward function based on the text edit distance includes: determining a first candidate function based on the first text edit distance; determining a second candidate function based on the second text edit distance; and performing a weighted summation of the first candidate function and the second candidate function to obtain the first reward function.
[0007] In one possible implementation of this disclosure, the method further includes: dividing the first error-correcting text into multiple text segments; calculating the proportion of repeated text segments in the multiple text segments; generating a second reward function based on the proportion; wherein optimizing the text correction model based on the first reward function to obtain a target text correction model includes: optimizing the text correction model based on the first reward function and the second reward function to obtain a target text correction model.
[0008] In one possible implementation of this disclosure, the text correction model is optimized based on the first reward function and the second reward function to obtain the target text correction model, including: weighted summation of the first reward function and the second reward function to obtain the third reward function; and optimization of the text correction model based on the third reward function to obtain the target correction model.
[0009] In one possible implementation of this disclosure, the first training data also includes the context text corresponding to the first erroneous sample.
[0010] In one possible implementation of this disclosure, the text correction model is pre-generated by: acquiring second training data, which includes a second erroneous sample, a second corrected sample, and thought chain data, wherein the thought chain data is used to describe the derivation process of correcting errors in the first erroneous sample to obtain the first corrected sample; and fine-tuning the pre-trained candidate correction model based on the second training data to obtain the text correction model.
[0011] In one possible implementation of this disclosure, fine-tuning the pre-trained candidate error correction model based on the second training data to obtain the text error correction model includes: fine-tuning the pre-trained candidate error correction model using supervised fine-tuning SFT based on the second training data to obtain the text error correction model.
[0012] In one possible implementation of this disclosure, the thought chain data is obtained by using a pre-generated thought chain model based on a second erroneous sample and a second error-correcting sample.
[0013] In one possible implementation of this disclosure, the text correction model is optimized based on the first reward function to obtain the target text correction model, which is used to correct errors in the input target text. This includes: optimizing the text correction model based on the first reward function using the group relative strategy optimization GRPO algorithm to obtain the target text correction model.
[0014] On the other hand, this disclosure provides a text correction method, including: obtaining erroneous text; inputting the erroneous text into a text correction model to obtain corrected text, wherein the text correction model is trained using the model training method described above.
[0015] In another aspect, this disclosure provides a model training apparatus, comprising: an acquisition module for acquiring first training data, the first training data including a first erroneous sample and a first corrected sample; an input module for inputting the first erroneous sample into a pre-generated text correction model to obtain a first corrected text; a determination module for determining a text edit distance based on the first corrected sample and the first corrected text; a generation module for generating a first reward function based on the text edit distance; and an optimization module for optimizing the text correction model based on the first reward function to obtain a target text correction model, the target text correction model being used to correct errors in the input target text.
[0016] In another aspect, this disclosure provides an electronic device including: a processor; a memory; and an application program stored in the memory and configured to be executed by the processor, the application program including instructions for performing the model training method described above.
[0017] In another aspect, this disclosure provides a computer-readable storage medium storing a computer program for performing the model training method described above.
[0018] In another aspect, this disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the model training method described above.
[0019] According to the model training method, text correction method, apparatus, device, medium, and product disclosed herein, by designing a reward function calculated based on text edit distance, the adaptability of the model to text correction tasks is improved. This allows for the extensive use of training data consisting only of sentence pairs (error and correct sentences) during model training, meaning that the training data may not include thought chain data, yet good training results can still be obtained. This expands the sources of training data, significantly increasing the amount of training data available for model training, as training data consisting only of sentence pairs is far more abundant than training data that includes both sentence pairs and thought chains. This results in a superior performance of the trained text correction model. Attached Figure Description
[0020] The specific embodiments of this disclosure are described in detail below with reference to the accompanying drawings, wherein: Figure 1 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown; Figure 2 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown; Figure 3 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown; Figure 4 Showing according to Figure 3 A schematic diagram of the second text edit distance in the model training method of the embodiment; Figure 5 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown; Figure 6 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown; Figure 7 A flowchart illustrating a text correction method according to an embodiment of the present disclosure is shown; Figure 8 A schematic diagram of the structure of a model training apparatus according to an embodiment of the present disclosure is shown; Figure 9 A schematic diagram of the structure of an electronic device according to an embodiment of the present disclosure is shown. Detailed Implementation
[0021] To enable those skilled in the art to more clearly understand the concepts and ideas of this disclosure, the following detailed description is provided in conjunction with specific embodiments. It should be understood that the embodiments given herein are only a part of all possible embodiments of this disclosure. After reading this specification, those skilled in the art are capable of making improvements, modifications, or substitutions to parts or the entirety of the following embodiments, and such improvements, modifications, or substitutions are also included within the scope of protection claimed in this disclosure.
[0022] In this document, the terms "one," "an," and other similar words are not intended to indicate that only one of the described things exists, but rather that the description refers only to one of the described things, which may have one or more. In this document, the terms "comprising," "including," and other similar words are intended to indicate a logical relationship, not a spatial relationship. For example, "A includes B" means that logically B belongs to A, not that spatially B is located inside A. Furthermore, the meanings of the terms "comprising," "including," and other similar words should be considered open-ended, not closed-ended. For example, "A includes B" means that B belongs to A, but B does not necessarily constitute all of A; A may also include other elements such as C, D, and E.
[0023] In this document, the terms "first," "second," and other similar terms are not intended to imply any order, quantity, or importance, but are merely used to distinguish different elements. The terms "embodiment," "this embodiment," "an embodiment," and "one embodiment" do not indicate that the description applies only to a specific embodiment, but rather that such description may also be applicable to one or more other embodiments. Those skilled in the art will understand that any description made herein with respect to one embodiment can be substituted, combined, or otherwise combined with the descriptions in one or more other embodiments, and the new embodiments resulting from such substitutions, combinations, or other combinations are readily conceived by those skilled in the art and fall within the scope of this disclosure.
[0024] In the various embodiments of this disclosure, a model can refer to a set of mathematical frameworks and algorithms built upon data, capable of learning data patterns and achieving specific tasks. Its core function is to transform input information into output that meets target requirements; essentially, it is an abstraction and simulation of real-world problems or data relationships. Models can encompass traditional machine learning models (such as decision trees and support vector machines) and deep learning models (such as convolutional neural networks and Transformers). Their performance depends on data quality, algorithm design, and parameter scale. They are the core carrier connecting artificial intelligence theory and practical applications, and a key tool for realizing machine intelligence.
[0025] In the various embodiments of this disclosure, model training refers to the core process of using labeled or unlabeled data to iteratively adjust the model's internal parameters through a specific algorithm, enabling the model to gradually learn data patterns and optimize task performance. Essentially, it is a learning phase that transforms the initial model from lacking ability to being able to complete the target task. Through the training process, the model extracts features (such as image edges and text semantics) from a large amount of labeled or unlabeled data, adjusts internal parameters (such as the weights of a neural network), and gradually optimizes its ability to fit data patterns. After training, the model can receive new inputs and perform operations such as prediction, classification, and generation based on the learned patterns (e.g., an image recognition model judging image content, a large language model generating coherent text).
[0026] Among the related technologies disclosed herein, Chinese essay correction is a key task in which AI (Artificial Intelligence) assists the development of the education industry. Achieving good error correction results is crucial. The current conventional approach to error correction tasks is the "naive" SFT (Supervised Fine-Tuning) method, where the model is trained on erroneous sentences as input and on sentences without errors as labels. However, this approach requires high-quality training data and has weak contextual information for individual input sentences.
[0027] In some of the related technologies disclosed herein, Chinese composition error correction is usually accomplished using SFT, and its performance limit is determined by the data, usually requiring a large amount of data.
[0028] In some of the related techniques disclosed herein, model training employs naive sentence-level SFT fine-tuning. Naive sentence-level SFT fine-tuning has the following drawbacks: 1) The sentence granularity is too small, lacking contextual information, resulting in many uncorrected errors. 2) SFT fine-tuning requires a large quantity and high quality of training data; traditional SFT is more like "memorization," and the more high-quality training data, the stronger the model's error correction "ability." 3) Excessive non-COT (Chain of Thought) SFT data significantly reduces the model's exploration ability and is detrimental to error correction generalization.
[0029] In some embodiments of this disclosure, further performance improvements can be achieved with the only available data. Therefore, these embodiments draw on the multi-stage training mode of large models in related technologies and use deep thinking to improve error correction performance with limited data.
[0030] In some embodiments of this disclosure, a K12 (kindergarten through twelfth grade) Chinese composition error correction method with deep thinking is proposed. First, COT data with deep thinking is constructed, and a model is trained using the SFT method. Next, this model is used as the base model for GRPO (Group Relative Policy Optimization) training, and the final model is obtained by modifying the reward function. In addition, these embodiments employ RAG (Retrieval-Augmented Generation) to supplement contextual information and assist in error correction.
[0031] In some embodiments of this disclosure, a two-stage training method for K12 Chinese composition with deep thinking is presented. First, a model with deep thinking is obtained through COT SFT. This model is then used as the base model for RL (Reinforcement Learning) training. These embodiments employ the GRPO algorithm to train the final error-correcting model. These embodiments supplement the training data input with contextual information in RAG format.
[0032] In some embodiments disclosed herein, a K12 Chinese composition error correction method with deep thinking is proposed, consisting of a two-stage training process to improve error correction performance with limited data. First, high-quality COT data is constructed by training an assistant model, and the model is fine-tuned using SFT to improve readability and standardization. Next, a reward function is designed to fit the error correction task, and the GRPO algorithm is trained to improve model performance. Furthermore, these embodiments use RAG (entire article + error-corrected paragraph) method to train data, supplementing contextual information to assist error correction.
[0033] The innovations in some embodiments disclosed herein are as follows: 1) A two-stage fine-tuning method under limited data: COT-SFT stage and GRPO stage. 2) High-quality COT data with deep thinking is obtained by training the error correction explanation assistant, which is used to fine-tune the model for SFT, standardizing the readability and normalization of the output. 3) The GRPO algorithm is designed with a reward function adapted to the task, modifying the accuracy reward function to an edit distance reward function, and introducing N-grams to repeatedly generate a penalty reward function. 4) Paragraph-level training data + RAG method is used, with the entire article supplemented with contextual information for specific error correction paragraphs in the instructions.
[0034] Figure 1A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown.
[0035] like Figure 1 As shown, the pre-trained base model is used with data containing Deep Thinking Questions (COT) and SFT supervised learning to obtain a fine-tuned SFT model. Then, the SFT model is used with data containing Deep Thinking Questions (COT) and GRPO reinforcement learning to obtain an inference model. In the GRPO reinforcement learning process, the reward function is constructed to include a format reward function, an edit distance reward function, and a repetition penalty reward function. In some related technologies, accuracy reward functions and format rewards functions can be used. The text correction task in this embodiment can continue to use the format reward function.
[0036] Figure 2 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown.
[0037] According to this embodiment, the model training method includes steps S210 to S250, and each step is described in detail below.
[0038] S210. Obtain the first training data, which includes the first erroneous sample and the first error-corrected sample.
[0039] In this embodiment, the first training data includes a first erroneous sample and a first corrected sample, wherein the first corrected sample is obtained by correcting errors in the first erroneous text. Obtaining the first corrected sample may require manual intervention; therefore, the amount of data in the first training data containing the first erroneous sample and the first corrected sample is limited. Thus, better training methods are often needed to achieve better training results with limited training data. In some embodiments, the first training data only includes the first erroneous sample and the first corrected sample, excluding thought chain data. Therefore, the first training data has a larger amount of data compared to training data containing thought chain data (such as the second training data described below).
[0040] In this embodiment, the first erroneous sample can also be called src (i.e., source data), and the first error-correcting sample can also be called ref (i.e., reference data) or tgt (i.e., target data).
[0041] S220. Input the first erroneous sample into the pre-generated text correction model to obtain the first corrected text.
[0042] In this embodiment, the pre-generated text correction model can refer to a text correction model that has been fine-tuned using the SFT method, i.e., a text correction model trained in the first stage. Therefore, inputting the first erroneous sample into the pre-generated text correction model can yield a first corrected text that is better than the corrected text generated by the text correction model (candidate correction model) that has not undergone the first stage of training.
[0043] In this embodiment, the first error-correcting text can also be called a hypothesis (i.e., data generated by the model).
[0044] S230. Based on the first error correction sample and the first error correction text, determine the text edit distance.
[0045] In this embodiment, the first error-correction sample can be a manually generated erroneous sentence, and the first error-correction text can be a correct (or nearly correct) sentence obtained by the model after correcting the erroneous sentence. Using text edit distance to determine the similarity between erroneous and correct sentences when processing text or essay error correction tasks can better adapt to the characteristics of text errors.
[0046] S240. Generate the first reward function based on the text edit distance.
[0047] In this embodiment, the first reward function can be determined based on the text edit distance. For example, when the correct sentence generated by the model (i.e., the first corrected text) is closer to the correct sentence annotated by a human (i.e., the first corrected sample), the model can be given a larger reward. The greater the distance between the two, the smaller the reward can be given. In this way, the error correction ability of the text correction model can be effectively trained, and the accuracy of error correction can be improved.
[0048] The following describes the edit distance reward function. The quality of answers to math problems is typically judged using an accuracy reward function. The accuracy reward function evaluates whether the model's response is correct. For math problems, the final answer is either equal to or not equal to. Referring to accuracy reward functions primarily for math problems, this embodiment provides a reward function that reflects the quality of essay correction.
[0049] First, let's introduce edit distance similarity. Edit distance similarity is the minimum number of single-character editing operations required to transform one string into another, including inserting, deleting, or replacing a character. Operationally, this aligns with text correction tasks, as it incorporates relative positional information during the transformation. Therefore, this embodiment chooses edit distance similarity.
[0050] The edit distance reward function can be set to the condition that the first corrected text (hyp) and the first corrected sample (ref) are most likely similar, that is, to calculate the edit distance similarity between the two. An example is shown below.
[0051] First correction text: "I have some friends, like a little monkey, a little dog, a little pig..." The first example of error correction: "I have many friends, such as a monkey, a dog, a pig..." At this point, the edit distance between the first corrected text and the first corrected sample can be calculated, resulting in ed_ratio = 0.8717948717948718. Here, ed_ratio is a floating-point number between 0 and 1; the closer it is to 1, the more similar the text. The formula is: reward(hyp, ref) = ed_ratio(hyp, ref).
[0052] S250. The text correction model is optimized based on the first reward function to obtain the target text correction model, which is used to correct errors in the input target text.
[0053] In this embodiment, optimizing the text correction model using a first reward function can mean using the first reward function as the target and iteratively adjusting the trainable parameters of the text correction model to maximize its reward value, thereby obtaining a better text correction model. When training is complete, the target text correction model is obtained and used to perform the error correction task.
[0054] As an example, the first training data also includes the context text corresponding to the first erroneous sample.
[0055] In this example, the context text corresponding to the first erroneous sample can be the entire or part of the paragraph containing that sentence when the first erroneous sample is a sentence; or the context text can be the entire or part of the article containing that paragraph when the first erroneous sample is a paragraph.
[0056] In this example, RAG can be used to supplement contextual information.
[0057] The inventors of this publication have found in practice that different training data granularities significantly impact error correction effectiveness. At the whole-text level, K12 essays vary considerably in length, making error correction for longer essays time-consuming and negatively affecting the user experience. Furthermore, long texts are more prone to missed errors. At the sentence level, the effectiveness depends on the reasonableness of sentence segmentation. However, elementary school essays often contain punctuation errors, such as misuse of punctuation (using periods instead of commas; using only commas throughout), missing punctuation (absence of punctuation marks), and redundant punctuation. Inappropriate sentence segmentation interferes with error correction, typically resulting in more frequent and inaccurate marking.
[0058] Therefore, paragraph granularity is a relatively appropriate granularity. However, the inventors of this disclosure found in practice that information such as the title and preceding paragraphs can affect the error correction content of the current paragraph. Therefore, supplementing the instruction part of the training data with the entire essay corresponding to the paragraph, i.e., providing the contextual information of the error correction sample, yields the best error correction effect. This method is called RAG. Below is an example of training data.
[0059] In this example, the instruction or context text is as follows: "For the input composition:\nTitle: Whose Feather Is This?\nBody: When spring is in full bloom, there is a little turtle playing in a puddle. It jumps from one puddle to another. This time, it sees a feather and picks it up... (omitted)... The turtle then gives the feather to the bird, who says, 'Thank you for finding my feather.' The turtle politely replies, 'Don't be so rude.'\nGiven the following paragraph, while maintaining the main content and meaning of the original text, focus on the overall meaning of the composition, correcting punctuation errors, word choices, spelling errors, and grammatical errors; enhance sentence fluency and the coherence of the composition:" In this example, the first faulty sample (src) is the following: He searched and searched, and a little monkey, pulling at a vine, saw the little turtle and asked, "Little turtle, do you need any help? Do you need my help?" The little turtle took out a feather and replied, "Of course I do! Is this your feather?" Then the little monkey replied, "No, I don't. Go ask someone else." The little turtle said, "Okay, thank you very much." In this example, the first error-corrected sample (tgt) is the following: "Okay, now I need to carefully review the essay provided by the user to see if there are any areas that need revision. First, I need to read it through to understand the main idea. Next, I will check each paragraph to be corrected, sentence by sentence, to see if there are any issues with punctuation, spelling, or grammar."
[0060] For the sentence: ...omitted...
[0061] Finally, ensure the revised essay retains the original meaning, flows smoothly, and conforms to formal written language. After checking each sentence, integrate and revise to ensure no errors are missed. He searched and searched, and a little monkey, holding onto a vine, saw the little turtle and asked, "Little turtle, what can I do for you? Do you need my help?" The little turtle took out a feather and replied, "Of course I can! Is this your feather?" The little monkey replied, "No, it's not mine. Go ask someone else." The little turtle said, "Okay, thank you for your answer." In this example, the first error-correction sample includes the preceding reasoning part and the subsequent error-correction part.
[0062] As an example, in order to optimize the text correction model based on the first reward function to obtain the target text correction model, the GRPO algorithm with a group relative strategy can be used to optimize the text correction model based on the first reward function to obtain the target text correction model.
[0063] In this example, the GRPO algorithm is used to optimize the text correction model. This allows for dynamic optimization of the correction strategy, avoiding the rigid correction problems caused by over-reliance on fixed reference texts in supervised fine-tuning. It also enables more flexible handling of complex and unseen error types (such as semantic ambiguity errors). Furthermore, the GRPO algorithm continuously iterates model decisions through reinforcement learning, finding a balance between correction accuracy and text naturalness. This reduces over-correction or incomplete correction, improving the model's adaptability and generalization ability in diverse real-world scenarios (such as colloquial text and domain-specific text).
[0064] In this example, the training data for the GRPO algorithm is described below. The input to the GRPO model is the first erroneous sample (src); the model's error correction reference is the first corrected sample (ref), i.e., the ground truth (real data); the model's predicted output consists of the deep thinking part + the error correction part, where the error correction part is the first corrected text (hyp). As a key component of the GRPO reinforcement learning algorithm, the design of the reward function is crucial.
[0065] Figure 3 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown.
[0066] According to this embodiment, the model training method includes steps S310 to S370, and each step is described in detail below.
[0067] S310. Obtain the first training data, which includes the first erroneous sample and the first error-corrected sample.
[0068] S320. Input the first erroneous sample into the pre-generated text correction model to obtain the first corrected text.
[0069] For details regarding the S310 and S320, please refer to the above section. Figure 2 The detailed descriptions of S210 and S220 in the embodiments will not be repeated here.
[0070] S330. The edit distance between the first error correction sample and the first error correction text is taken as the first text edit distance.
[0071] In this embodiment, the first error correction sample is a manually generated error correction result, and the first error correction text is a model-generated error correction result. By comparing the first error correction sample and the first error correction text, the quality of the correct sentence generated by the model can be judged, thereby judging the quality of the model's error correction ability.
[0072] S340. Obtain the first edit distance between the first corrected sample and the first incorrect sample, and the second edit distance between the first corrected text and the first incorrect sample, so as to obtain the second text edit distance.
[0073] In this embodiment, by comparing the first edit distance and the second edit distance, the first erroneous sample can be included in the judgment of the model's error correction ability. This allows the model to judge whether the correct sentence generated by the model is closer to the correct sentence generated by humans, while taking into account the differences of the erroneous samples. This makes the judgment of the model's error correction ability more realistic and more objective and accurate.
[0074] In this embodiment, considering that the text to be corrected, i.e., the first erroneous sample (src) input to the model, also affects the correction result, a reward function can be added. The goal is to make the distance from the first corrected text (hyp) to the first erroneous sample (src) as similar as possible to the distance from the first corrected sample (ref) to the first erroneous sample (src). The formula is expressed as follows: reward(src, ref, hyp)=1-abs(ed_ratio(src, hyp)-ed_ratio(src, ref)) Where abs represents the absolute value; ed_ratio(src, hyp) represents the edit distance between the first erroneous sample and the first corrected text; ed_ratio(src, ref) represents the edit distance between the first erroneous sample and the first corrected sample; reward(src, ref, hyp) represents the reward function. The closer ed_ratio(src, hyp) is to ed_ratio(src, ref), the closer abs(ed_ratio(src, hyp) - ed_ratio(src, ref)) is to 0; therefore, reward(src, ref, hyp) is closer to 1.
[0075] Figure 4 This diagram illustrates the overall edit distance reward function. The goal is for edge C to approach 0, meaning the distance between the first corrected text (hyp) and the first corrected sample (ref) is as close as possible. Edge B approaches the length of edge A, meaning the distance from the first corrected text (hyp) to the first incorrect sample (src) is as similar as possible to the distance from the first corrected sample (ref) to the first incorrect sample (src).
[0076] S350. Determine the first candidate function based on the first text edit distance.
[0077] In this embodiment, the first candidate function can be a reward function determined based on the first text edit distance, which optimizes the model with the goal of minimizing the first text edit distance.
[0078] S360. Determine the second candidate function based on the second text edit distance.
[0079] In this embodiment, the second candidate function can be a reward function determined based on the second text edit distance, which optimizes the model with the goal of minimizing the second text edit distance.
[0080] S370. The first candidate function and the second candidate function are weighted and summed to obtain the first reward function.
[0081] In this embodiment, different weights can be assigned to the first candidate function and the second candidate function according to their different importance, and the weighted sum can be obtained to obtain the first reward function for optimizing and iterating the model.
[0082] S380. The text correction model is optimized based on the first reward function to obtain the target text correction model, which is used to correct errors in the input target text.
[0083] For details regarding the S380, please refer to the above section. Figure 2 The detailed description of S250 in the embodiments will not be repeated here.
[0084] Figure 5 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown.
[0085] According to this embodiment, the model training method includes steps S510 to S580, and each step is described in detail below.
[0086] S510. Obtain the first training data, which includes the first erroneous sample and the first error-corrected sample.
[0087] S520. Input the first erroneous sample into the pre-generated text correction model to obtain the first corrected text.
[0088] S530. Based on the first error correction sample and the first error correction text, determine the text edit distance.
[0089] S540. Generate the first reward function based on the text edit distance.
[0090] For details regarding S510 to S540, please refer to the above section. Figure 2The detailed descriptions of S210 and S240 in the embodiments will not be repeated here.
[0091] S550. Divide the first error-correction text into multiple text segments.
[0092] In this embodiment, the first corrected text can be divided into multiple smaller segments. When the first corrected text is a sentence, it can be divided into shorter sentences or phrases; when the first corrected text is a paragraph, it can be divided into sentences or clauses; when the first corrected text is an article, it can be divided into paragraphs or long sentences.
[0093] S560. Calculate the proportion of repeated text segments among multiple text segments.
[0094] In this embodiment, for the multiple text segments that are divided, the number of text segments that are repeated among each other can be calculated and compared with the total number of text segments to obtain the proportion of repeated text segments.
[0095] S570. Generate the second reward function according to the ratio.
[0096] In this embodiment, a second reward function can be generated based on the proportion of repeated text fragments. The goal of the second reward function is to minimize the number or proportion of repeated text. Therefore, when the proportion of repeated text fragments is larger, the reward value of the second reward function should be smaller, or the penalty should be greater.
[0097] In this embodiment, the reward function for penalizing repeated generation is constructed as follows: The text generated by the segmentation model is divided into segments using the N-gram method; the repetition ratio of these segments is calculated; and text with a high repetition ratio is penalized. The specific penalty formula is: repetition_rewards = max_penalty * (1-len(ngrams) / total) Where max_penalty represents the maximum penalty, set to -1; len(ngrams) is the number of text segments after deduplication; total is the total number of text segments; repetition_rewards is a decimal in the range of -1 to 0, the more severe the repetition, the heavier the penalty, and the closer repetition_rewards is to -1.
[0098] S580. Based on the first reward function and the second reward function, the text error correction model is optimized to obtain the target text error correction model, which is used to correct errors in the input target text.
[0099] In this embodiment, optimizing the text correction model based on the first reward function and the second reward function can refer to comprehensively considering the function values of the first and second reward functions, aiming to maximize the two function values, and continuously iterating and adjusting the parameters of the text correction model to make the reward values of the two reward functions as large as possible.
[0100] As an example, in order to optimize the text correction model based on the first reward function and the second reward function to obtain the target text correction model, we can first perform a weighted summation of the first reward function and the second reward function to obtain the third reward function; then, based on the third reward function, we can optimize the text correction model to obtain the target correction model.
[0101] In this example, the third reward function is obtained by weighted summation of the first and second reward functions. In other words, weights are assigned according to the importance of the first and second reward functions, and the goal is to optimize the text correction model iteratively so that the result generated by the text correction model maximizes the reward value of the third reward function.
[0102] In this example, in practical applications, the results of multiple reward functions can be weighted and summed to represent the final reward function of a text. The following is a formula for calculating a reward function: final_reward = w1 * reward(hyp, ref) + w2 * reward(src, ref, hyp) +w3 * repetition_rewards Where w1, w2, and w3 are weight parameters, which are trainable parameters of the model; reward(hyp, ref) is the first candidate function, which is the reward function calculated based on the first text edit distance; reward(src, ref, hyp) is the second candidate function, which is the reward function calculated based on the second text edit distance; and repetition_rewards is the second reward function, which is the reward function that penalizes repeated generation.
[0103] Figure 6 A schematic flowchart of a model training method according to an embodiment of the present disclosure is shown.
[0104] According to this embodiment, the model training method includes steps S610 to S670, and each step is described in detail below.
[0105] S610. Obtain second training data, which includes a second erroneous sample, a second error-correcting sample, and thought chain data. The thought chain data is used to describe the derivation process of correcting the errors in the first erroneous sample to obtain the first error-correcting sample.
[0106] In this embodiment, the second training data can be data different from the first training data. The second training data includes chain-of-thought data, and the first training data may not include chain-of-thought data. Since the generation and adjustment of chain-of-thought data require human participation, the number of training data containing chain-of-thought data (such as the second training data) is much less than the number of training data without chain-of-thought data (such as the first training data). For the existing relatively small amount of second training data, it can be used to fine-tune the pre-trained error correction model, so that the error correction performance and thinking quality of the model are better.
[0107] In this embodiment, the error correction explanation can be regarded as the error correction thinking process, that is, the link of in-depth thinking. Here is an example of training data with COT.
[0108] The second incorrect sample (src) is as follows: "The long-awaited trip for me has finally arrived! Today, Dad is just free and takes us to the wildlife park to play." The chain-of-thought data and the second error correction sample (tgt) are as follows: "Okay, I have to carefully look at the composition provided by the user now to see if there are any places that need to be corrected. First, I have to read it through to understand the general idea of the article. Next, I will check each sentence in the paragraph to be corrected to see if there are any punctuation, spelling, grammar, etc. problems. <^
[0109] For the sentence: "The long-awaited trip for me has finally arrived!" The characters [己] and [已] are similar in pronunciation but have different meanings. [己] is usually used to represent oneself, while [已] means already. In this sentence, the use of [己] is incorrect because the sentence expresses the meaning of "already". [己] should be changed to {已}.
[0110] Finally, ensure that the revised composition maintains the original meaning, is smooth in language, and conforms to written language. After checking each sentence, integrate the revisions to ensure that no error points are missed. The long-awaited trip for me has finally arrived! Today, Dad is just free and takes us to the wildlife park to play." In this embodiment, the process of splitting the in-depth thinking process (i.e., chain-of-thought data) and the error correction target / result (i.e., the second error correction sample) is such that the chain-of-thought data comes first and the second error correction sample comes after.
[0111] S620. Fine-tune the pre-trained candidate error correction model based on the second training data to obtain a text error correction model.
[0112] In this embodiment, the second training data includes thought chain data. Therefore, by fine-tuning the pre-trained candidate error correction model using the second training data, the error correction capability of the candidate error correction model can be improved, and a higher quality thinking and reasoning process can be provided, thereby making the overall performance of the model better and its application scenarios more extensive.
[0113] S630. Obtain the first training data, which includes the first erroneous sample and the first error-corrected sample.
[0114] S640. Input the first erroneous sample into the pre-generated text correction model to obtain the first corrected text.
[0115] S650. Based on the first error correction sample and the first error correction text, determine the text edit distance.
[0116] S660: Generate the first reward function based on the text edit distance.
[0117] S670. The text correction model is optimized based on the first reward function to obtain the target text correction model, which is used to correct errors in the input target text.
[0118] For details regarding S630 to S670, please refer to the above section. Figure 2 The detailed descriptions of S210 and S250 in the embodiments will not be repeated here.
[0119] As an example, in order to fine-tune the pre-trained candidate error correction model based on the second training data to obtain the text error correction model, the pre-trained candidate error correction model can be fine-tuned using supervised fine-tuning SFT based on the second training data to obtain the text error correction model.
[0120] In this example, Supervised Fine-Tuning (SFT) is used to train the text correction model, enabling the model to accurately align with the error correction task objectives. First, SFT relies on manually labeled "error text - correct text" pairing data, directly conveying explicit error correction rules to the model. This avoids the ambiguity in error correction direction found in unsupervised methods, significantly improving accuracy, especially for specific types of errors such as grammatical and spelling mistakes. Second, it allows pre-trained models to quickly adapt to error correction scenarios without starting from scratch, shortening the training cycle and reducing data requirements, while preserving the pre-trained model's language understanding capabilities, balancing accuracy and text fluency. Furthermore, SFT can adjust the data distribution to specifically optimize the model's ability to correct high-frequency errors, enhancing its practicality in real-world scenarios (such as copywriting and academic writing) and reducing over-correction or under-correction.
[0121] In this example, the training process begins with Supervised Fine-Tuning (SFT). The SFT phase lays a solid foundation for the subsequent GRPO reinforcement learning phase. Compared to starting RL from the base model, the output produced by the SFT model is more readable, exhibits fewer hallucinations, and is less harmful.
[0122] As an example, the thought chain data is obtained by using a pre-generated thought chain model based on the second erroneous sample and the second error-correcting sample.
[0123] In this example, the thought chain model can be a model that automatically generates thought chain data based on error samples (such as incorrect sentences or paragraphs) and correction samples (such as sentences or paragraphs corrected by humans). The thought chain data generated by the model can fully demonstrate the thinking and identification process of how to identify grammatical errors and word errors in error samples, as well as how to correct these errors to obtain the correct sentences or paragraphs.
[0124] In this example, to obtain high-quality response data with long COTs, a model can be used for synthesis. An error correction and interpretation model can be trained as an assistant, taking a pair of "error-infected src" and "error-correcting tgt" as input, and the assistant will provide two parts: error type and error description.
[0125] Figure 7 A schematic flowchart of a text correction method according to an embodiment of the present disclosure is shown.
[0126] According to this embodiment, the model training method includes steps S710 to S720, and each step is described in detail below.
[0127] S710, Get the text with errors.
[0128] In this embodiment, the erroneous text is text with errors entered by the user (e.g., a student), which may be entered while doing practice questions or during an exam.
[0129] S720. Input the erroneous text into the text correction model to obtain the corrected text.
[0130] In this embodiment, after obtaining the erroneous text input by the user, the trained text correction model can identify the errors in the erroneous text. In some cases, it can also provide the thought process or reasoning process and display the correct text after correcting the errors (i.e., the corrected text), so that the user can learn where their mistakes are and learn to write correct sentences or paragraphs.
[0131] In this embodiment, the text correction model can be trained using the aforementioned model training method.
[0132] Figure 8 A schematic diagram of a model training apparatus according to an embodiment of the present disclosure is shown.
[0133] In this embodiment, the model training device 800 includes an acquisition module 810, an input module 820, a determination module 830, a generation module 840, and an optimization module 850. The acquisition module 810 acquires first training data, which includes first erroneous samples and first corrected samples. The input module 820 inputs the first erroneous samples into a pre-generated text correction model to obtain first corrected text. The determination module 830 determines the text edit distance based on the first corrected samples and the first corrected text. The generation module 840 generates a first reward function based on the text edit distance. The optimization module 850 optimizes the text correction model based on the first reward function to obtain a target text correction model, which is used to correct errors in the input target text.
[0134] It should be noted that, Figure 8 The model training device 800 provided in the illustrated embodiment, when executing the model training method, is only illustrated by the division of the above-described functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. Furthermore, the model training device 800 provided in the above embodiment and... Figure 2 The model training method embodiments shown all belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.
[0135] In a particular embodiment of this disclosure, the text editing distance includes a first text editing distance and / or a second text editing distance; the determining module 830 is further configured to: take the editing distance between the first error-correcting sample and the first error-correcting text as the first text editing distance; and / or, obtain the first editing distance between the first error-correcting sample and the first error-contained sample, and the second editing distance between the first error-correcting text and the first error-contained sample, to obtain the second text editing distance.
[0136] In a particular embodiment of this disclosure, the text edit distance includes a first text edit distance and a second text edit distance; the generation module 840 is further configured to: determine a first candidate function based on the first text edit distance; determine a second candidate function based on the second text edit distance; and perform a weighted summation of the first candidate function and the second candidate function to obtain a first reward function.
[0137] In a particular embodiment of this disclosure, the apparatus is further configured to: divide the first error-correcting text into multiple text segments; calculate the proportion of repeated text segments among the multiple text segments; and generate a second reward function based on the proportion; wherein the optimization module 850 is further configured to: optimize the text correction model based on the first reward function and the second reward function to obtain a target text correction model.
[0138] In a particular embodiment of this disclosure, the optimization module 850 is further configured to: perform a weighted summation of the first reward function and the second reward function to obtain a third reward function; and optimize the text correction model based on the third reward function to obtain a target error correction model.
[0139] In a particular embodiment of this disclosure, the first training data also includes the context text corresponding to the first erroneous sample.
[0140] In a particular embodiment of this disclosure, the apparatus is further configured to pre-generate a text correction model by: acquiring second training data, the second training data including a second erroneous sample, a second corrected sample, and thought chain data, the thought chain data being used to describe the derivation process of correcting errors in the first erroneous sample to obtain the first corrected sample; and fine-tuning the pre-trained candidate correction model based on the second training data to obtain the text correction model.
[0141] In a particular embodiment of this disclosure, the apparatus is further configured to: fine-tune the pre-trained candidate error correction model using supervised fine-tuning SFT based on the second training data, to obtain a text error correction model.
[0142] In a particular embodiment of this disclosure, the apparatus is further configured such that the thought chain data is obtained by obtaining the thought chain data through a pre-generated thought chain model based on a second erroneous sample and a second error-correcting sample.
[0143] In a particular embodiment of this disclosure, the optimization module 850 is further configured to: optimize the text correction model using the group relative strategy optimization GRPO algorithm based on the first reward function, thereby obtaining the target text correction model.
[0144] The following combination Figure 9 An electronic device according to an embodiment of the present disclosure is described.
[0145] like Figure 9 As shown, the electronic device 900 includes one or more processors 910 and memory 920.
[0146] The processor 910 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 900 to perform desired functions.
[0147] The memory 920 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and / or cache memory. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 910 may execute the program instructions to implement the model training methods of the various embodiments of this disclosure described above and / or other desired functions.
[0148] In one example, the electronic device 900 may also include an input device 930 and an output device 940, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).
[0149] For example, the input device 930 may be a microphone or microphone array for capturing voice input signals; it may be a communication network connector for receiving the collected input signals from the cloud or other devices; and it may also include, for example, a keyboard, mouse, etc.
[0150] The output device 940 can output various information to the outside, including determined distance information, direction information, etc. The output device 940 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.
[0151] Of course, for the sake of simplicity, Figure 9 Only some of the components of the electronic device 900 relevant to this disclosure are shown, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device 900 may include any other suitable components depending on the specific application.
[0152] Embodiments of this disclosure may also be computer-readable storage media storing computer program instructions that, when executed by a processor, cause the processor to perform the steps in the model training methods according to various embodiments of this disclosure described above.
[0153] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
[0154] The concepts, principles, and ideas of this disclosure have been described in detail above with reference to specific embodiments (including examples and instances). Those skilled in the art should understand that the embodiments of this disclosure are not limited to those given above. After reading this disclosure, those skilled in the art can make any possible improvements, substitutions, and equivalents to the steps, methods, apparatus, and components in the above embodiments, and such improvements, substitutions, and equivalents should be considered to fall within the scope of this disclosure. The scope of protection of this disclosure is limited to the claims.
Claims
1. A model training method, characterized in that, The method includes: Acquire first training data, which includes a first erroneous sample and a first corrected sample; The first erroneous sample is input into a pre-generated text correction model to obtain the first corrected text. Based on the first error-corrected sample and the first error-corrected text, determine the text edit distance; Based on the text edit distance, a first reward function is generated; The text correction model is optimized based on the first reward function to obtain a target text correction model, which is used to correct errors in the input target text.
2. The method according to claim 1, characterized in that, The text edit distance includes a first text edit distance and / or a second text edit distance; determining the text edit distance based on the first error correction sample and the first error correction text includes: The edit distance between the first corrected sample and the first corrected text is taken as the first text edit distance; and / or, Obtain the first edit distance between the first corrected sample and the first incorrect sample, and the second edit distance between the first corrected text and the first incorrect sample, to obtain the second text edit distance.
3. The method according to claim 2, characterized in that, The text edit distance includes the first text edit distance and the second text edit distance; the step of generating a target reward function based on the text edit distance includes: The first candidate function is determined based on the first text edit distance; The second candidate function is determined based on the second text edit distance; The first reward function is obtained by weighted summation of the first candidate function and the second candidate function.
4. The method according to claim 1, characterized in that, The method further includes: Divide the first error-corrected text into multiple text segments; Calculate the proportion of repeated text segments among the multiple text segments; Based on the stated ratio, a second reward function is generated; The step of optimizing the text correction model based on the first reward function to obtain the target text correction model includes: Based on the first reward function and the second reward function, the text correction model is optimized to obtain the target text correction model.
5. The method according to claim 4, characterized in that, The step of optimizing the text correction model based on the first reward function and the second reward function to obtain the target text correction model includes: The first reward function and the second reward function are weighted and summed to obtain the third reward function; Based on the third reward function, the text correction model is optimized to obtain the target correction model.
6. The method according to claim 1, characterized in that, The first training data also includes the context text corresponding to the first erroneous sample.
7. The method according to claim 1, characterized in that, The text correction model is pre-generated in the following manner: Acquire second training data, which includes a second erroneous sample, a second error-correcting sample, and thought chain data. The thought chain data is used to describe the derivation process of correcting the errors in the first erroneous sample to obtain the first error-correcting sample. Based on the second training data, the pre-trained candidate error correction model is fine-tuned to obtain the text error correction model.
8. The method according to claim 7, characterized in that, The step of fine-tuning the pre-trained candidate error correction model based on the second training data to obtain the text error correction model includes: Based on the second training data, the pre-trained candidate error correction model is fine-tuned using supervised fine-tuning SFT to obtain the text error correction model.
9. The method according to claim 7, characterized in that, The thought chain data was obtained through the following methods: The thought chain data is obtained by using the second erroneous sample and the second corrected sample through a pre-generated thought chain model.
10. The method according to any one of claims 1 to 9, characterized in that, The text correction model is optimized based on the first reward function to obtain a target text correction model, which is used to correct errors in the input target text, including: Based on the first reward function, the text correction model is optimized using the group relative strategy optimization GRPO algorithm to obtain the target text correction model.
11. A text error correction method, characterized in that, The method includes: Get the erroneous text; The erroneous text is input into the text correction model to obtain the corrected text. The text correction model is trained using the model training method described in any one of claims 1 to 10.
12. A model training device, characterized in that, The device includes: An acquisition module is used to acquire first training data, the first training data including a first erroneous sample and a first error-corrected sample; The input module is used to input the first erroneous sample into a pre-generated text correction model to obtain the first corrected text; The determination module is used to determine the text edit distance based on the first error correction sample and the first error correction text; The generation module is used to generate a first reward function based on the text edit distance; An optimization module is used to optimize the text correction model based on the first reward function to obtain a target text correction model, which is used to correct errors in the input target text.
13. An electronic device, characterized in that, include: processor; Memory; An application, stored in the memory and configured to be executed by the processor, the application including instructions for performing the model training method of any one of claims 1 to 10.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program for performing the model training method according to any one of claims 1 to 10.
15. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the model training method according to any one of claims 1 to 10.