A two-stage large-scale model training method and system for Chinese text error correction.
By optimizing the Chinese text correction model through a two-stage training method and a task-specific reward function, the problem of over-correction in Chinese text correction is solved. This achieves efficient and accurate correction in Chinese spelling and grammar correction tasks, reduces training and maintenance costs, and improves the model's robustness and cross-scenario applicability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH OF CHINA
- Filing Date
- 2026-06-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing Chinese text correction models struggle to achieve "minimum editing and semantic fidelity" under strict constraints, are prone to overcorrection, and often split spelling and grammar correction tasks into independent models, increasing training and maintenance costs and limiting cross-scenario transfer and reuse.
A two-stage training approach is adopted. First, supervised fine-tuning is performed through low-rank adaptation. Then, reinforcement learning is performed using task-specific reward functions, including anchor action rewards, global edit distance rewards, edit accuracy rewards, and semantic consistency rewards, to optimize Chinese spelling correction and grammar correction tasks, respectively.
It significantly reduces overcorrection, improves correction accuracy and semantic preservation, reduces training and maintenance complexity, enhances cross-task knowledge transfer capabilities, and improves the robustness and practicality of the model in different scenarios.
Smart Images

Figure CN122311352A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing technology, specifically to the interdisciplinary field of text processing and language model training and optimization. In particular, it relates to a two-stage large-scale model training method and system for Chinese text error correction. Background Technology
[0002] With the development of natural language processing and intelligent writing technologies, scenarios such as writing assistance, educational correction, and human-computer interaction have generated widespread demand for automatic error correction of Chinese text. In existing applications, Chinese error correction typically manifests as two main tasks: Chinese spelling correction and Chinese grammar correction. In recent years, large language models have demonstrated strong capabilities in instruction following and open-ended generation, seemingly making them directly applicable for error correction. However, the goals and constraints of Chinese error correction differ significantly from those of general generation, making direct deployment still challenging.
[0003] Unlike general text generation or polishing, error correction tasks emphasize outputting results with "minimal editing and semantic fidelity" under strict constraints: eliminating genuine errors while avoiding unnecessary modifications to already correct content. Therefore, large language models often suffer from "over-correction" in error correction, where the model modifies already correct segments, rewrites based on style preferences, or even produces "improvements" unrelated to the task. This problem is more pronounced in Chinese contexts: Chinese characters are highly ambiguous, word order is relatively flexible, and even minor changes can cause semantic drift; different surface expressions may be equally fluent but not necessarily equally faithful. Specifically, in CSC (Chinese Spelling Correction), the model may perform unnecessary character substitutions, thus disrupting proper nouns, entity names, or domain terminology; in CGEC (Chinese Grammatical Error Correction), the model may reorganize clauses, replace synonyms, or unify style, making the output linguistically acceptable but deviating from the input meaning, thereby reducing correction accuracy and affecting practical usability.
[0004] For Chinese error correction, existing technologies mostly employ supervised fine-tuning or prompt-based generation to adapt to large language models. However, in high-precision applications, Supervised Fine-Tuning (SFT) typically learns by imitating reference answers based on likelihood targets, often treating different reference outputs as equivalent ideal results. This makes it difficult to directly encode key constraints such as "editing only when necessary," "avoiding meaningless rewriting," and "preserving semantics and entities." It also lacks discriminative supervision of error boundaries and editing costs, leading to model behavior that may still be dominated by fluency-driven rewriting. When the training data contains multiple effective correction methods or annotation noise, these problems are more easily amplified, manifesting as over-correction and decreased generalization ability. Furthermore, spelling correction and grammar correction are often separated into different models or independent pipelines in engineering implementations, relying on task-specific engineering designs, which objectively increases training and maintenance costs and limits the transfer and reuse between different error correction settings.
[0005] Therefore, existing technologies urgently need a learning and optimization mechanism that can better fit the goals of error correction tasks, so that the model is more inclined to correct the real error and maintain the original meaning among many fluent candidates, thereby achieving a more controllable balance between accuracy and conservatism, reducing the accuracy loss caused by over-correction, and improving cross-scenario robustness and deployability. Summary of the Invention
[0006] To address the aforementioned issues, this invention provides a two-stage large-scale model training method and system for the field of Chinese text error correction.
[0007] The first aspect discloses a two-stage large-scale model training method for the field of Chinese text error correction, the method comprising:
[0008] A first training dataset is obtained, and the samples in the first training dataset are preprocessed to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction tasks and Chinese grammar correction tasks. The second training dataset includes cue-response sample pairs, which include domain-specific cue words and target correction text with Chinese text correction task characteristics. The cue words are used to indicate the structured output results of the model. The text correction model to be trained is supervised and fine-tuned using a low-rank adaptation method based on the second training dataset to obtain an initial text correction model. The initial text correction model is optimized and trained using a reinforcement learning method based on the reward function of the text correction task to obtain a trained text correction model. The text correction tasks include Chinese grammar correction tasks and Chinese spelling correction tasks, and different text correction tasks correspond to different reward functions. The reward function of the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function of the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.
[0009] The second aspect discloses a two-stage large-scale model training system for the field of Chinese text error correction, the system comprising:
[0010] The training dataset acquisition module is used to acquire a first training dataset and preprocess the samples in the first training dataset to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction and Chinese grammar correction tasks. The second training dataset includes cue-response sample pairs, which include domain-specific cue words and target correction text with characteristics of Chinese text correction tasks. The cue words are used to indicate the structured output results of the model. The supervised fine-tuning module is used to perform supervised fine-tuning of the text correction model to be trained using a low-rank adaptation method based on the second training dataset to obtain an initial text correction model. The task reinforcement learning optimization module is used to optimize and train the initial text correction model using a reinforcement learning method based on the reward function of the text correction task to obtain a trained text correction model. The text correction tasks include Chinese grammar correction and Chinese spelling correction tasks, and different text correction tasks correspond to different reward functions. The reward function for the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function for the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.
[0011] As can be seen from the above technical solutions, the present invention has the following beneficial effects:
[0012] This invention optimizes the training backbone through a unified supervised fine-tuning and grouping strategy, while simultaneously adapting pluggable reward functions for Chinese spelling correction and Chinese grammar correction respectively. This avoids building independent model structures and training pipelines for the two tasks, reducing training and maintenance complexity and enhancing cross-task knowledge transfer capabilities. Furthermore, by generating diverse samples based on fine-grained error type annotations, the robustness and generalization ability of the model are improved. Further, a task-adaptive reward function system that eliminates the need for additional reward models is designed. This system adaptively adjusts reward weights based on sub-indicators such as error correction accuracy, modification cost, and semantic consistency according to different error correction task types, achieving multi-task optimization within a unified framework.
[0013] This invention also addresses the over-correction problem in text correction using large language models by establishing explicit constraint mechanisms. For Chinese grammar correction, semantic consistency rewards are used to suppress semantic drift; for Chinese spelling correction, methods such as error-free input preservation, prohibition of incorrect editing at correct positions, and clean position penalties are used to reduce unnecessary modifications to originally correct content, thereby significantly alleviating over-correction. Furthermore, this invention improves the stability of reinforcement learning training by constructing denser reward signals. For Chinese spelling correction, anchor-level action scores and global edit distance improvements are used to provide effective learning signals even if candidate outputs are not exactly equal to the standard answer; for Chinese grammar correction, a balance between error correction recall and conservative editing is achieved through the combined effects of high-level error detection, character-level weighted scores, and semantic preservation, improving the model's robustness and practicality in external scenarios. Attached Figure Description
[0014] Figure 1 The present invention provides an overall flowchart of a two-stage large model training method for Chinese text error correction.
[0015] Figure 2 This invention provides a schematic diagram of a two-stage large-scale model training system architecture for the field of Chinese text error correction. Detailed Implementation
[0016] To make the objectives, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Several embodiments of the present invention are shown in the drawings. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that the disclosure of the present invention will be thorough and complete.
[0017] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0018] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal communication between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances. The term "and / or" as used herein includes any and all combinations of one or more of the related listed items.
[0019] Existing supervised fine-tuning-based error correction methods typically rely on imitation learning from reference answers, making it difficult to directly characterize error correction constraints such as minimizing modifications and semantic consistency. This can easily lead to over-correction, causing correct content to be incorrectly modified, thus affecting error correction accuracy and generalization ability. This invention introduces group-based relative policy optimization after supervised fine-tuning and designs task-specific reward functions for different task types, enabling the model to simultaneously maintain error correction accuracy, edit conservatism, and semantic preservation within a unified training framework. This invention aims to provide a two-stage large-scale model training method and system for the field of Chinese text error correction, used to uniformly solve Chinese spelling and grammar error correction tasks.
[0020] In one embodiment, the present invention provides a two-stage large model training method for the field of Chinese text error correction, such as... Figure 1 As shown, the specific steps include:
[0021] S101. Obtain a first training dataset and preprocess the samples in the first training dataset to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction task and Chinese grammar correction task. The second training dataset includes prompt-response sample pairs. The prompt-response sample pairs include domain-specific prompt words with Chinese text correction task features and target correction text. The prompt words are used to indicate the structured results output by the model.
[0022] This invention is based on a controllable data augmentation and quality filtering method with fine-grained error annotation. It generates diverse samples according to error type and removes low-quality and semantically drifting data, thereby improving the robustness and generalization ability of the text correction model.
[0023] Specifically, in another embodiment, the first training dataset is obtained through the following steps, including:
[0024] Obtain an initial training dataset, which includes multiple initial training samples, each of which includes source text and target correction text;
[0025] The initial training samples are identified and labeled with error types using a pre-trained text error correction model, resulting in a sample set with error type labels.
[0026] When the error type label is Chinese spelling correction task, the source text in the sample is enhanced with character-level obfuscation under the constraints of edit distance and sentence length to generate corresponding noisy text.
[0027] When the error type label is Chinese grammar correction task, grammatical phenomena are injected into the source text under semantic consistency constraints to generate corresponding noise text, thereby obtaining noise text-target correction text sample pairs for the corresponding Chinese spelling correction task and Chinese grammar correction task.
[0028] Specifically, an initial training dataset is obtained and fine-grained feature annotation is performed. First, an initial training dataset for Chinese text correction is obtained, which includes source text and corresponding target text for correction. A pre-trained large language model is selected as the base model, preferably Qwen3-8B. Then, the base model is used to perform error type annotation on the initial training samples, resulting in labeled samples with fine-grained error type labels, enabling the model to distinguish different error phenomena. The error types include at least one or more of the following: character confusion errors, redundant function words, missing function words, word order errors, and collocation errors.
[0029] Specifically, the source text and the target text to be corrected are first aligned using character-level or phrase-level minimal editing to obtain candidate segments for replacement, insertion, deletion, and word order adjustment. These candidate segments and their contexts are then input into a base model, which outputs error types according to a pre-defined labeling system. After rule-based constraint verification, a sample set with fine-grained error labels is obtained. Subsequently, data augmentation is performed using the error type labels as control conditions to generate new noisy text-correct text sample pairs.
[0030] Furthermore, Chinese spelling correction tasks typically correspond to length-preserving corrected outputs, while Chinese grammar correction tasks allow insertion or deletion operations, thus corresponding to length-variable corrected outputs. In the Chinese spelling correction scenario, the enhancement operation is preferably character-level obfuscation enhancement, which samples reasonable replacement candidates from a preset obfuscation set and generates erroneous text under edit distance and sentence length constraints, while the original correct text remains unchanged. Preferably, the edit distance threshold can be set to 10, and the sentence length range can be set to a sentence length not exceeding 50.
[0031] In Chinese grammar correction scenarios, the preferred augmentation operation is the injection of grammatical phenomena, including but not limited to: redundant function words, missing function words, word order errors, and collocation errors. During the augmentation process, semantic consistency constraints are preferably added to prevent the augmented sample from deviating from the true meaning of the original sentence.
[0032] To improve the quality of the training samples, the augmented samples were further screened, including: limiting sentence length to a preset medium range, removing simple samples that the basic model could reliably correct, and removing noisy samples that altered the semantics of the original sentences. The augmented samples were then merged with the original training samples to obtain noisy text-target correction text sample pairs for the corresponding Chinese spelling correction and Chinese grammar correction tasks.
[0033] Furthermore, the second training dataset, namely the supervised fine-tuning training dataset, is obtained by preprocessing the samples in the first training dataset.
[0034] Specifically, the samples in the first training dataset are preprocessed and converted into cue-response sample pairs. ,in, These are domain-specific prompts with characteristics of Chinese text correction tasks, and consist of a task template, task type identifier, output format constraints, and input text. Assembled according to the preset format Correct the text to the target. (Hint words) The prompts should clearly define the task scenario the model is designed for, the error types, the linguistic knowledge base, and the input text conditions. For example, constraints such as "You are a Chinese text correction expert. Please use your knowledge of modern Chinese lexical, syntactic, semantic, and discourse coherence to correct and modify the following text, and explain the basis for your modifications" can be added. This will guide the model to focus on key aspects of Chinese text correction tasks, such as misspelling identification, word order adjustment, collocation correction, ambiguity resolution, and discourse coherence optimization, thereby improving the model's professionalism and reasoning ability in Chinese error correction tasks.
[0035] Suppose the input text is a sequence of characters:
[0036] (1)
[0037] The target correction text is:
[0038] (2)
[0039] in, Indicates the length of the input text. Indicates the target text length to be corrected; in Chinese spelling correction scenarios, it typically satisfies... In Chinese grammar correction scenarios, since insertion or deletion operations are allowed, it usually satisfies the requirements. The objective of this invention is to generate corrected results. The goal is to eliminate text errors while preserving the original semantics as much as possible.
[0040] S102. Based on the second training dataset, supervised fine-tuning of the text correction model to be trained is performed using a low-rank adaptation method to obtain the initial text correction model.
[0041] The above steps specifically include:
[0042] The second training dataset is input into the text correction model to be trained. The trainable parameters of the text correction model to be trained are optimized and trained with the goal of minimizing the loss function to obtain the initial text correction model. The trainable parameters of the text correction model to be trained include the low-rank trainable incremental parameters introduced by the attention projection matrix and / or feedforward network weights in the model, and the other original parameters of the text correction model to be trained are frozen.
[0043] Among them, the loss function in the supervised fine-tuning phase The specific details are shown in the formula:
[0044] (3)
[0045] in, Expressing expectations, This represents the second training dataset; The parameter is The strategy model in the prompt conditions The conditional probability of generating the target output is determined. Supervised fine-tuning is performed by minimizing the negative log-likelihood of the target corrected text under cue conditions. In this embodiment, supervised fine-tuning can be implemented using Low-Rank Adaptation (LoRA), which freezes the original parameters of the text correction model to be trained and introduces low-rank trainable incremental parameters only into the attention projection matrix and / or feedforward network weights to significantly reduce the number of parameters to be updated and the memory usage. Specifically, the low-rank adaptation rank parameter is 16, the scaling factor is 2, and the supervised fine-tuning learning rate is 3e-5. Initial checkpoints are obtained from these checkpoints before subsequent reinforcement learning, forming the initialization policy model for the reinforcement learning phase.
[0046] This invention employs supervised fine-tuning of the training dataset to enable the text correction model to acquire basic error-correcting capabilities, resulting in an initial text correction model. During training, the model not only focuses on the difference between the predicted results and the actual text but also strives to ensure that the corrected text maintains consistency with the original context in word usage, thereby improving the accuracy of Chinese text correction. Furthermore, the supervised fine-tuning utilizes the LoRA method, saving GPU memory required during training while preserving the fluency of the large language model.
[0047] S103. The initial text correction model is optimized and trained using a reinforcement learning method based on the reward function of the text correction task to obtain a trained text correction model. The text correction task includes Chinese grammar correction task and Chinese spelling correction task, and different text correction tasks correspond to different reward functions. The reward function of the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function of the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.
[0048] The specific steps involved in optimizing and training the initial text correction model using a reinforcement learning method based on the reward function of the text correction task to obtain the trained text correction model include:
[0049] For each prompt-response sample pair, multiple candidate output texts are sampled from the text correction model of the current strategy;
[0050] The scalar reward for each candidate output text is calculated based on the reward function corresponding to the task type identifier, where the task type identifier is used to indicate the type of Chinese spelling correction task and Chinese grammar correction task;
[0051] Calculate the relative advantage value based on the scalar reward for each candidate output text;
[0052] A group relative strategy optimization algorithm is adopted to update the text correction model based on the relative advantage value of the candidate output text, and to apply KL regularization constraint to the current strategy using the reference strategy, thereby obtaining the trained text correction model.
[0053] Specifically, after supervised fine-tuning, group-relative policy optimization is performed based on the initialized policy model. For each input cue... From the current strategy Sampling K candidate outputs:
[0054] (4)
[0055] in, Indicates that in a given Under the conditions, from the strategy Sampling K outputs The tilde (~) indicates sampling from a given distribution. Indicates a given The distribution below.
[0056] Then, based on the task type identifier The corresponding reward function calculates the scalar reward for each candidate output text, as shown in the formula:
[0057] (5)
[0058] in, This indicates the input text. Indicates the first One candidate correction result, Indicates the target correction text, Indicates task type identifier The corresponding reward function.
[0059] Then, through the objective function Update the current policy model, as shown in the formula:
[0060] (6)
[0061] in, Indicates the first The advantage value of each candidate output text relative to other candidate output texts in the same group. This represents the average reward among candidates in the same group, as shown in the formula:
[0062] (7)
[0063] in, This indicates that the current policy model generates samples relative to the old policy model. The probability ratio is shown in the formula:
[0064] (8)
[0065] in, This is used to measure the magnitude of the policy update's adjustment to the probability of the sample, and combined with a truncation mechanism to constrain the update step size, thereby improving training stability. Indicates the current strategy In the given Output under the condition The probability, Indicates the old strategy In the given Output under the condition The probability of.
[0066] in addition, This indicates the constraint range of the truncation mechanism, in this embodiment. Set it to 0.2. The value was set to 0.28 to improve the diversity generated during model updates. Furthermore, Indicates a reference strategy; Indicates the KL regularization constraint coefficient; This represents the Kullback-Leibler divergence between the current policy and the reference policy. This objective function can increase the probability of generating high-reward candidate outputs while constraining the policy distribution shift.
[0067] It should also be noted that in this invention, Chinese spelling correction and Chinese grammar correction share the same supervised fine-tuning process, the same reinforcement learning optimizer, and the same policy update objective. The difference between the two is not reflected in the model structure bifurcation, but in the reward function. In different designs, the reward value does not depend on an additional reward model.
[0068] In one embodiment, the step of updating the text correction model based on the relative advantage value of candidate output text using a group relative strategy optimization algorithm specifically includes:
[0069] When the task type identifier indicates a Chinese spelling correction task, the relative advantage value corresponding to each candidate output text is obtained according to the reward function of the Chinese spelling correction task. The reinforcement learning strategy is updated for the task branch of the first text correction model to obtain the optimized parameters of the text correction model corresponding to the Chinese spelling correction task.
[0070] When the task type identifier indicates a Chinese grammar correction task, the relative advantage value corresponding to each candidate output text is obtained according to the reward function of the Chinese grammar correction task. The reinforcement learning strategy is then updated for the second text correction model task branch to obtain the optimized parameters of the text correction model corresponding to the Chinese grammar correction task. The first text correction model task branch and the second text correction model task branch have the same structure and the same initial parameters.
[0071] During training, firstly, shared-supervised fine-tuning is performed using a hybrid second training dataset to obtain a unified initialization checkpoint. Then, two identical text correction model task branches with the same structure and initial parameters are obtained: the first text correction model task branch and the second text correction model task branch. Next, reinforcement learning policies are updated independently using the Chinese spelling correction reward function and the Chinese grammar correction reward function, respectively, yielding optimized parameters for the text correction model corresponding to the Chinese spelling correction task and the Chinese grammar correction task. This implementation uses AdamW as the optimal optimizer. Training cutoff conditions can be set to at least one of the following: reaching a preset number of training rounds or update steps, validation set rewards or task evaluation metrics failing to improve for a certain number of consecutive times, or the KL divergence between the current policy and the reference policy exceeding a preset threshold.
[0072] This invention addresses the overcorrection problem in text correction using large language models by establishing explicit constraint mechanisms. For Chinese spelling correction, it reduces unnecessary modifications to originally correct content by preserving error-free input, prohibiting erroneous editing at correct positions, and penalizing clean positions, thereby significantly mitigating overcorrection. For Chinese grammar correction, it suppresses semantic drift through semantic consistency rewards.
[0073] Furthermore, this invention improves the stability of reinforcement learning training by constructing a denser reward signal. For Chinese spelling correction, it utilizes anchor-level action scores and global edit distance improvements to provide effective learning signals even if candidate outputs are not exactly equal to the standard answer. For Chinese grammar correction, it achieves a balance between error correction recall and conservative editing through the combined effects of high-level error detection, character-level weighted scores, and semantic preservation, thereby enhancing the model's robustness and practicality in out-of-domain scenarios.
[0074] In one embodiment, the reward function for the Chinese grammar correction task includes an editing accuracy reward and a semantic consistency reward, as shown in the formula:
[0075] (9)
[0076] in, To assign a weight to the accuracy of editing rewards, a value of 0.8 is set. This represents the semantic consistency reward weight, and its value can be set to 0.2. Rewards are given for accurate editing. This represents a semantic consistency reward. This represents the reward function for the Chinese spelling correction task.
[0077] The primary goal of Chinese text correction is to accurately identify and correct errors. Therefore, editing accuracy should be the main optimization objective. Semantic consistency is mainly used to constrain the model to avoid semantic shifts or excessive rewriting. Using it as an auxiliary weight can maintain the semantic stability of the original sentence while ensuring the accuracy of error correction, thereby achieving a balance between error correction effect and semantic preservation.
[0078] In one embodiment, the calculation process for the editing accuracy reward includes:
[0079] The structured results predicted by the model are compared with the reference structured results, wherein the predicted structured results include at least error flags and output text;
[0080] If the prediction error flag does not match the reference error flag, the editing accuracy reward is set to 0; if the model makes a prediction error but the output text matches the original input, the editing accuracy reward is set to 0.
[0081] When the predicted error flag is consistent with the reference error flag and the output text is inconsistent with the original input, the character-level weighted score between the model output text and the target corrected text is calculated, where the character-level weighted score represents the fine-grained editing quality.
[0082] The editing accuracy reward is calculated based on the character-level weighted score.
[0083] Specifically, based on the prompt word design, the model outputs a structured result, including at least error flags and corrected text. The model's predicted structured result is compared with a reference structured result: if the predicted error flags do not match the reference error flags, the editing accuracy reward is set to 0; if the model predicts an error but the output text matches the original input (i.e., only the error is "detected" but not actually corrected), the editing accuracy reward is also set to 0. When the high-level error judgments are consistent, a character-level weighted score between the model output and the target corrected text is further calculated. Specifically, as shown in the formula:
[0084] (10)
[0085] in, Used to refine the quality of fine-grained editing. Indicates character-level correction accuracy. Represents character-level corrected recall. This represents the coefficient that balances precision and recall. In this example... A value of 2 is acceptable, focusing on improving model recall and reducing missed corrections. The value can also be freely set according to different scenarios.
[0086] Finally, the editing accuracy bonus is calculated based on a character-level weighted score, as shown in the formula:
[0087] (11)
[0088] in, This represents the basic bias term, and its value can be set to 0.1; This represents the scaling factor, which can be set to 0.9. The bias term provides a non-zero base reward after the model makes a correct high-level decision, thereby improving optimization stability.
[0089] In one embodiment, the calculation process for the semantic consistency reward includes:
[0090] A frozen semantic encoder is used to encode the noisy text and the model output text, and the cosine similarity between the two is calculated to obtain the semantic similarity.
[0091] When the semantic similarity is less than the semantic consistency threshold, the semantic consistency reward is set to 0; otherwise, the semantic consistency reward is the semantic similarity.
[0092] Specifically, because large language models possess strong language rewriting capabilities in generation tasks, when optimization relies solely on editing accuracy metrics, the model may tend to significantly rewrite the original sentence to achieve a more fluent or common expression. However, this process may alter the true semantics or informational content of the original sentence, resulting in semantic shift. To suppress the semantic shift caused by the model's pursuit of sentence fluency during grammatical error correction, a frozen semantic encoder is employed. For input text and model output text Encode the two sequences and calculate their cosine similarity to obtain semantic similarity. :
[0093] (12)
[0094] Furthermore, to explicitly penalize semantic drift, a semantic consistency reward is defined, as shown in the formula:
[0095] (13)
[0096] in, This represents the semantic consistency threshold, with a value of 0.7. When the model output deviates too much from the input semantically, its semantic reward is directly set to zero, thus preventing the model from receiving unreasonable rewards for generating more fluent sentences.
[0097] In Chinese spelling correction scenarios, the input text is typically natural language sentences, while the actual number of spelling errors is often small, especially in longer sentences, usually containing only one or two misspelled characters. This "error sparsity" characteristic leads to sparse reward signals in reinforcement learning training. That is, the model only receives a valid reward when it makes a correct correction at the exact location of the error, while the model's behavior contributes little to the final reward in most other positions, thus reducing training efficiency and affecting model learning stability. Furthermore, since Chinese spelling correction is usually processed character by character, and most positions in a sentence are actually correct characters, relying solely on overall sentence-level rewards or simple edit distances as training signals makes it difficult for the model to accurately learn "where corrections are needed and where the model must remain unchanged." Especially in long sentence scenarios, even a small number of errors made by the model to correct characters can significantly impact the overall output quality. Therefore, relying solely on sentence-level evaluation metrics is insufficient to effectively guide the model's fine-grained editing behavior.
[0098] To address the aforementioned issues, this invention designs a structured reward mechanism for Chinese spelling correction tasks. This mechanism maps the difference between the model's output and the target corrected text into a set of editing actions anchored at the input character positions. Different types of editing actions are assigned rewards or penalties, thus achieving fine-grained control over the model's editing behavior. This reward mechanism not only focuses on the overall similarity between the final generated text and the target corrected text but also evaluates the model's editing decisions at each character position. For example, a positive reward is given when the model performs a correct replacement at a true error position; a negative penalty is given when the model fails to perform a modification at a required position; a partial reward is given when the model attempts to edit but does not completely hit the correct character; and an additional penalty is imposed when the model makes unnecessary modifications to originally correct characters. In this way, the model gradually learns more precise editing strategies during training.
[0099] Furthermore, this invention incorporates the overall improvement in edit distance as an auxiliary reward term, used to measure whether the model output is closer to the reference corrected text overall compared to the original input. This design not only effectively alleviates the reward sparsity problem but also significantly reduces unnecessary modifications to correct characters by the model, thereby improving the model's editing accuracy and stability in Chinese spelling correction tasks and further reducing the probability of over-correction.
[0100] In one embodiment, the reward function for the Chinese spelling correction task... This includes anchor point action rewards and global edit distance rewards, as shown in the formula:
[0101] (14)
[0102] in, As a reward for anchor point actions, Global edit distance bonus;
[0103] The calculation process for anchor point action rewards includes:
[0104] For the current noisy text and the model output text, the edit distance is aligned between the two, and the alignment result is mapped to the edit action anchored on the noisy text to obtain at least one anchor point position;
[0105] The predicted action for each anchor point position is compared with the standard action, and the action score corresponding to each anchor point position is calculated based on the comparison results.
[0106] The action scores corresponding to all anchor points are normalized using the total expected edit weight to obtain the anchor point action reward.
[0107] Given source text With candidate output text First, Levenshtein edit distance alignment is performed on both text, and the alignment result is mapped to editing actions anchored on the source text. For each source character position, the action space is defined as follows: . This means keeping the original characters unchanged. This indicates that the character at the current position will be replaced with a new character. , This indicates that the character will be deleted. This indicates that characters can be inserted at sentence gaps. Further definition... This represents the standard action sequence induced from the source input text to the target correction text. This represents the predicted action sequence induced by the input text and the model's output. Subsequent rewards are calculated through comparison. and This is used to measure whether the model editing behavior is reasonable.
[0108] If the input text itself does not need to be modified, then the following condition is met:
[0109] (15)
[0110] in, Let edit distance be the value, then the Chinese spelling correction reward function is defined as follows:
[0111] (16)
[0112] in, This indicates the model output text. This is the penalty coefficient, used to reward complete preservation of the input when it is originally correct, and to penalize any unnecessary modifications. It is an important component in suppressing overcorrection. The penalty coefficient can be adjusted according to the specific scenario; in this example, the penalty coefficient is... Set it to 0.1.
[0113] When the input contains spelling errors, the predicted action is compared for each anchor point. with standard action For positions that should be edited, a positive reward is given if the predicted action is exactly the same as the standard action; a trial reward is given if the model performs a non-empty edit but does not completely hit the correct action; and a negative reward is given if the model does not perform an edit at the position that should be edited. For positions that should remain unchanged, an overcorrection penalty is applied if the model still performs an edit. Different edit types can be assigned different weights to reflect the relative difficulty and importance of deletion, replacement, and insertion. Furthermore, when both the standard action and the predicted action are... At this time, the anchor point action bonus is set to 0 to reduce the impact of unmodified parts on the overall bonus. Specific action scores. The formula is as follows:
[0114] (17)
[0115] in, This represents the trial editing reward coefficient, which can be set to 0.4. This represents the overcorrection penalty coefficient for incorrect editing at the correct position, and can be set to 0.6. This represents the action weight value at the corresponding position, specifically including: replacement weight. The weight is 0.5, so the weight is removed. The insertion weight is 0.3. The value is 0.2. Furthermore, it should be noted that the specific values of the three weight parameters are not limited to specific values and can be flexibly adjusted according to the specific application scenario. In this embodiment, the weight settings are derived from the statistical analysis of the frequency of various character-level error types in the spelling error dataset, ensuring that the weight settings reflect the importance of different types of spelling errors in the actual data distribution.
[0116] Finally, the action scores corresponding to all anchor points are normalized using the total expected edit weight to maintain the stability of the reward scale across different sentences. The specific anchor point action reward is shown in the formula:
[0117] (18)
[0118] in, The total number of anchor points. For the first The weight of each anchor point For the total weight, The sum of the weights of each anchor point For the first The action score corresponding to each anchor point.
[0119] In one embodiment, during reinforcement learning training, if the model behavior is evaluated solely based on anchor edit rewards, the model may still obtain partial rewards through some local but ineffective edits, while the overall output still deviates from the correct correction result. Therefore, this invention further introduces a relative edit distance improvement measure, which evaluates whether the model-generated result is closer to the correct answer overall by comparing the degree of change in the edit distance between the model's output text and the target corrected text. The relative edit distance improvement measure is defined as follows:
[0120] (19)
[0121] in, This is an application parameter used to prevent the denominator from being 0; it is set to 0.01.
[0122] The calculation process for the global edit distance reward includes:
[0123] (20)
[0124] in, Indicates the global edit distance reward. This represents the distance reward scaling factor. This means that the relative edit distance improvement is clipped to the interval [-1,1]. This term can encourage the model output to move closer to the overall reference answer even when the local anchor points are not yet completely correct. The relative edit distance improvement is used to evaluate whether the output of the text correction model is closer to the correct answer overall by comparing the change in edit distance between the output text of the text correction model and the target corrected text.
[0125] This invention employs a two-stage training method based on task-pluggable reward functions in reinforcement learning. The specific process is as follows: First, based on the initialized error correction policy model, multiple candidate outputs are sampled for each input prompt, and rewards are calculated according to the task type: for Chinese grammar correction tasks, the reward combines at least editing accuracy and semantic consistency; for Chinese spelling correction tasks, the reward combines at least error-free input preservation, anchor-level editing action quality, and global edit distance improvement. Then, a group relative policy optimization algorithm is used to update the policy model based on the relative advantage values of the candidate outputs, and KL regularization constraints are applied to the current policy using a reference policy to increase the probability of generating high-reward candidate outputs while maintaining language quality and training stability. Furthermore, the Chinese spelling correction task and the Chinese grammar correction task share the same optimizer and reinforcement learning objective; task switching is achieved only by replacing the reward function.
[0126] Finally, during the deployment phase, the system invokes the corresponding text correction model parameters based on the task type identifier in the input prompt, generating the corrected Chinese text output as the error correction result. Specifically, after training, optimized model parameters for Chinese spelling correction and optimized model parameters for Chinese grammar correction are obtained separately. During deployment, the system receives the text to be corrected and the corresponding task prompt information, where the task prompt information indicates the type of error correction task being executed. When the task prompt information indicates a Chinese spelling correction task, the system invokes the corresponding Chinese spelling correction model parameters to perform character-level error correction on the input text and outputs the spelling correction result; when the task prompt information indicates a Chinese grammar correction task, the system invokes the corresponding Chinese grammar correction model parameters to perform grammar correction on the input text and outputs the grammar correction result. Through this method, corresponding model checkpoints can be invoked according to different error correction tasks, achieving task-oriented deployment of the model and improving deployment flexibility and practical application convenience.
[0127] This invention optimizes the training backbone through a unified supervised fine-tuning and grouping relative strategy, while using pluggable reward functions to adapt to Chinese spelling correction and Chinese grammar correction respectively. This avoids building independent model structures and independent training pipelines for the two types of tasks, reduces training and maintenance complexity, and enhances cross-task knowledge transfer capabilities.
[0128] This application also provides a training system corresponding to the method embodiments described above. Since the system embodiments are basically similar to the method embodiments, the description is relatively simple. For details of the relevant technical features and their effects, please refer to the corresponding descriptions of the method embodiments provided above. This invention provides a two-stage large-scale model training system for the field of Chinese text error correction, such as... Figure 2 As shown, the system mainly includes:
[0129] The training dataset acquisition module is used to acquire a first training dataset and preprocess the samples in the first training dataset to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction tasks and Chinese grammar correction tasks. The second training dataset includes prompt-response sample pairs, which include domain-specific prompt words with Chinese text correction task features and target correction text. The prompt words are used to indicate the structured results output by the model.
[0130] The supervised fine-tuning module is used to perform supervised fine-tuning of the text correction model to be trained using a low-rank adaptation method based on the second training dataset, so as to obtain the initial text correction model.
[0131] The task reinforcement learning optimization module is used to optimize and train the initial text correction model using a reinforcement learning method based on the reward function of the text correction task, so as to obtain a trained text correction model. The text correction task includes Chinese grammar correction task and Chinese spelling correction task, and different text correction tasks correspond to different reward functions. The reward function of the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function of the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.
[0132] This application also provides an electronic device, which includes a processor and a memory. The memory stores at least one instruction or at least one program, which is loaded and executed by the processor. The two-stage large model training method for Chinese text error correction provided in the above-described method embodiments is also provided.
[0133] Furthermore, the electronic device may participate in or include the apparatus or system provided in the embodiments of this application. The electronic device may include one or more processors (processors may include, but are not limited to, processing devices such as microprocessors (MCUs) or programmable logic devices (FPGAs), memory for storing data, and transmission devices for communication functions. In addition, it may also include: a display, an input / output interface (I / O interface), a universal serial bus (USB) port (which may be included as one of the ports of the I / O interface), a network interface, a power supply, and / or a camera.
[0134] It should be noted that the aforementioned one or more processors and / or other data processing circuits are generally referred to herein as "data processing circuits". These data processing circuits can be implemented wholly or partially as software, hardware, firmware, or any other combination. Furthermore, the data processing circuits can be a single, independent processing module, or wholly or partially integrated into any other element within the device (or mobile device). As involved in the embodiments of this application, the data processing circuit serves as a processor control mechanism (e.g., selection of a variable resistor termination path connected to an interface).
[0135] The memory can be used to store software programs and modules of application software, such as the program instructions / data storage device corresponding to the method described in the embodiments of this application. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby realizing the above-mentioned data processing method. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to electronic devices via a network. Examples of the above-mentioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0136] The transmission device is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the device's communication provider. In one example, the transmission device includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device may be a Radio Frequency (RF) module, used for wireless communication with the Internet.
[0137] The display can be, for example, a touchscreen liquid crystal display (LCD), which allows users to interact with the user interface of an electronic device (or mobile device).
[0138] This application also provides a computer storage medium storing at least one instruction or at least one program, which is loaded and executed by a processor to implement the two-stage large model training method for Chinese text error correction provided in the above-described method embodiments.
[0139] Optionally, in this embodiment, the aforementioned computer storage medium may be located at at least one of the multiple network servers in a computer network. Optionally, in this embodiment, the aforementioned storage medium may include, but is not limited to, various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0140] This application also provides a computer program product or computer program that includes computer instructions stored in a computer storage medium. The processor of an electronic device reads the computer instructions from the computer storage medium and executes the computer instructions, causing the electronic device to perform the two-stage large model training method for Chinese text error correction provided in the above-described method embodiments.
[0141] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. Furthermore, specific embodiments have been described above. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than that shown in the embodiments and still achieve the desired result. Additionally, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0142] It should be understood that the above description of the preferred embodiments is quite detailed, but it should not be considered as a limitation on the scope of protection of this invention. Those skilled in the art, under the guidance of this invention, can make substitutions or modifications without departing from the scope of protection of the claims of this invention, and all such substitutions or modifications fall within the scope of protection of this invention. The scope of protection of this invention should be determined by the appended claims.
Claims
1. A two-stage large-scale model training method for Chinese text error correction, characterized in that, The method includes: A first training dataset is obtained and the samples in the first training dataset are preprocessed to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction task and Chinese grammar correction task. The second training dataset includes prompt-response sample pairs. The prompt-response sample pairs include domain-specific prompt words with Chinese text correction task features and target correction text. The prompt words are used to indicate the structured results output by the model. Based on the second training dataset, the text correction model to be trained is subjected to supervised fine-tuning using a low-rank adaptation method to obtain the initial text correction model. The initial text correction model is optimized and trained using a reinforcement learning method based on the reward function of the text correction task to obtain a trained text correction model. The text correction task includes Chinese grammar correction task and Chinese spelling correction task, and different text correction tasks correspond to different reward functions. The reward function of the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function of the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.
2. The method according to claim 1, characterized in that, The method further includes: Obtain an initial training dataset, which includes multiple initial training samples, each of which includes source text and target correction text; The initial training samples are identified and labeled with error types by using a pre-trained text error correction model, resulting in a sample set with error type labels. When the error type label is Chinese spelling correction task, the source text in the sample is enhanced with character-level obfuscation under the constraints of edit distance and sentence length to generate corresponding noisy text. When the error type label is Chinese grammar correction task, grammatical phenomena are injected into the source text under semantic consistency constraints to generate corresponding noise text, thereby obtaining noise text-target correction text sample pairs for the corresponding Chinese spelling correction task and Chinese grammar correction task.
3. The method according to claim 1, characterized in that, The step of performing supervised fine-tuning of the text correction model to be trained using a low-rank adaptation method based on the second training dataset to obtain the initial text correction model includes: The second training dataset is input into the text correction model to be trained. The trainable parameters of the text correction model to be trained are optimized and trained with the goal of minimizing the loss function to obtain the initial text correction model. The trainable parameters of the text correction model to be trained include the low-rank trainable incremental parameters introduced by the attention projection matrix and / or feedforward network weights in the model, and the other original parameters of the text correction model to be trained are frozen.
4. The method according to claim 1, characterized in that, The method of optimizing and training the initial text correction model using a reinforcement learning approach based on a reward function for the text correction task to obtain a trained text correction model includes: For each prompt-response sample pair, multiple candidate output texts are sampled from the text correction model of the current strategy; The scalar reward for each candidate output text is calculated based on the reward function corresponding to the task type identifier, where the task type identifier is used to indicate the type of Chinese spelling correction task and Chinese grammar correction task; Calculate the relative advantage value based on the scalar reward for each candidate output text; A group relative strategy optimization algorithm is adopted to update the text correction model based on the relative advantage value of the candidate output text, and to apply KL regularization constraint to the current strategy using the reference strategy, thereby obtaining the trained text correction model.
5. The method according to claim 4, characterized in that, The method employing a group relative strategy optimization algorithm to update the text correction model based on the relative advantage value of candidate output texts includes: When the task type identifier indicates a Chinese spelling correction task, the relative advantage value corresponding to each candidate output text is obtained according to the reward function of the Chinese spelling correction task. The reinforcement learning strategy is updated for the task branch of the first text correction model to obtain the optimized parameters of the text correction model corresponding to the Chinese spelling correction task. When the task type identifier indicates a Chinese grammar correction task, the relative advantage value corresponding to each candidate output text is obtained according to the reward function of the Chinese grammar correction task. The reinforcement learning strategy is then updated for the second text correction model task branch to obtain the optimized parameters of the text correction model corresponding to the Chinese grammar correction task. The first text correction model task branch and the second text correction model task branch have the same structure and the same initial parameters.
6. The method according to claim 5, characterized in that, The calculation process for the editing accuracy reward includes: The structured results predicted by the model are compared with the reference structured results, wherein the predicted structured results include at least error flags and output text; If the prediction error flag does not match the reference error flag, the editing accuracy reward is set to 0; if the model makes a prediction error but the output text matches the original input, the editing accuracy reward is set to 0. When the predicted error flag is consistent with the reference error flag and the output text is inconsistent with the original input, the character-level weighted score between the model output text and the target corrected text is calculated, where the character-level weighted score represents the fine-grained editing quality. The editing accuracy reward is calculated based on the character-level weighted score.
7. The method according to claim 5, characterized in that, The calculation process for the semantic consistency reward includes: A frozen semantic encoder is used to encode the noisy text and the model output text, and the cosine similarity between the two is calculated to obtain the semantic similarity. When the semantic similarity is less than the semantic consistency threshold, the semantic consistency reward is set to 0; otherwise, the semantic consistency reward is the semantic similarity.
8. The method according to claim 5, characterized in that, The calculation process for the anchor point action reward includes: For the current noisy text and the model output text, the edit distance is aligned between the two, and the alignment result is mapped to the edit action anchored on the noisy text to obtain at least one anchor point position; The predicted action for each anchor point position is compared with the standard action, and the action score corresponding to each anchor point position is calculated based on the comparison results. The edit reward score corresponding to all anchor point positions is normalized by the total expected edit weight to obtain the anchor point action reward.
9. The method according to claim 5, characterized in that, The calculation process for the global edit distance reward includes: ; in, Indicates the global edit distance reward. This represents the distance reward scaling factor. This indicates that the relative edit distance improvement will be clipped to the interval [-1, 1]. The relative edit distance improvement is used to evaluate whether the output of the text correction model is closer to the correct answer overall by comparing the change in edit distance between the output text of the text correction model and the target corrected text.
10. A two-stage large-scale model training system for Chinese text error correction, characterized in that, The system includes: The training dataset acquisition module is used to acquire a first training dataset and preprocess the samples in the first training dataset to obtain a second training dataset. The first training dataset includes noisy text-target correction text sample pairs corresponding to Chinese spelling correction tasks and Chinese grammar correction tasks. The second training dataset includes prompt-response sample pairs, which include domain-specific prompt words with Chinese text correction task features and target correction text. The prompt words are used to indicate the structured results output by the model. The supervised fine-tuning module is used to perform supervised fine-tuning of the text correction model to be trained using a low-rank adaptation method based on the second training dataset, so as to obtain the initial text correction model. The task reinforcement learning optimization module is used to optimize and train the initial text correction model using a reinforcement learning method based on the reward function of the text correction task, so as to obtain a trained text correction model. The text correction task includes Chinese grammar correction task and Chinese spelling correction task, and different text correction tasks correspond to different reward functions. The reward function of the Chinese spelling correction task includes anchor action reward and global edit distance reward, and the reward function of the Chinese grammar correction task includes edit accuracy reward and semantic consistency reward.