Text error correction method and device, storage medium and electronic equipment
By using a candidate word classification model and a custom loss function to filter and correct text, the problem of low accuracy in existing text correction methods is solved, and efficient and accurate spell correction is achieved in low-resource scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NEW ORIENTAL EDUCATION & TECH GRP CO LTD
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
Smart Images

Figure CN122242490A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of natural language processing technology, and more specifically, to a text error correction method, apparatus, storage medium, and electronic device. Background Technology
[0002] With the rapid development of the internet, text information has experienced explosive growth, leading to a continuous expansion of the application scenarios for automatic text correction technology. Text correction is primarily used to automatically identify spelling errors in text and provide corresponding correction suggestions. However, current text correction methods suffer from low accuracy, and the results often fail to meet user expectations. Therefore, accurately correcting text is a pressing technical problem that needs to be solved. Summary of the Invention
[0003] The purpose of this disclosure is to provide a text correction method, apparatus, storage medium, and electronic device to solve the technical problems existing in the related art.
[0004] In a first aspect, this disclosure provides a text error correction method, the method comprising: Obtain the text to be corrected and identify the erroneous words in the text; The incorrect word is input into the candidate word classification model to obtain a candidate word list, which includes multiple candidate words. The loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of a multi-classification task. Based on the text to be corrected, multiple corrected texts are generated according to the candidate word list, and the multiple corrected texts are filtered to obtain the target corrected text.
[0005] Optionally, the method further includes: Obtain a training dataset, which includes real graded data, publicly available data, and simulation-generated data; The target large model is trained in a supervised manner based on the training dataset to obtain the candidate word classification model. The loss function of the candidate word classification model is a custom loss function that incorporates the Top-K mechanism.
[0006] Optionally, the custom loss function is obtained by weighted summation of the first loss function, the second loss function, and the third loss function. The first loss function is the standard cross-entropy loss, the second loss function is a function related to the Top-K mechanism, and the third loss function is a function related to the specified word reward mechanism.
[0007] Optionally, the formula for calculating the first loss function is as follows: , Where N is the batch size and C is the number of categories. This is the labeled true value, meaning that the true class of sample i is c. It is the sample predicted by the model, that is, the probability that sample i belongs to c. The formula for calculating the second loss function is as follows: , in, , It is the set of the top-5 word segments predicted by the model for the i-th sample; The formula for calculating the third loss function is as follows: , in, K12 refers to the vocabulary content covered in the K12 education stage; The formula for calculating the custom loss function is as follows: , in, , and These are the weighting coefficients, and their sum is 1.
[0008] Optionally, the step of filtering the plurality of corrected texts to obtain the target corrected text includes: Obtain a comprehensive score for each of the corrected texts, and select the corrected text with the highest comprehensive score as the target text for error correction.
[0009] Optionally, the overall score is obtained based on at least one of the following parameters: The perplexity score is obtained by evaluating the reasonableness of the corrected text. Edit distance score, which is obtained by evaluating the similarity between the erroneous word and the replacement word; The confidence score of the replacement word is obtained by evaluating the reliability of the replacement word. The grammatical correctness score is obtained by evaluating the grammatical compliance of the corrected text. Semantic consistency score, which is obtained by evaluating the consistency between the corrected text and the semantic context; Domain fit score, which is obtained by evaluating the domain vocabulary to which the replaced word belongs.
[0010] Secondly, this disclosure provides a text correction device, the device comprising: The determination module is configured to acquire the text to be corrected and determine the erroneous words in the text to be corrected; The input module is configured to input the erroneous word into the candidate word classification model to obtain a candidate word list, the candidate word list including multiple candidate words, and the loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of a multi-classification task; The filtering module is configured to generate multiple corrected texts based on the candidate word list based on the text to be corrected, and to filter the multiple corrected texts to obtain the target corrected text.
[0011] Thirdly, this disclosure provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the steps of the method described in any of the first aspects.
[0012] Fourthly, this disclosure provides an electronic device comprising: A memory on which computer programs are stored; A processor for executing the computer program in the memory to implement the steps of the method of any one of the first aspects.
[0013] Fifth aspect: This disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in any of the first aspects.
[0014] This disclosure, after obtaining the text to be corrected, uses a candidate word classification model to obtain a candidate word list. Based on this, by filtering the corrected text generated from the candidate words, a more accurate target text for correction can be obtained. First, the text to be corrected is obtained, and the erroneous words in the text are identified. Then, the erroneous words are input into the candidate word classification model to obtain a candidate word list. This candidate word list includes multiple candidate words. The loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of multi-classification tasks. Finally, multiple corrected texts are generated based on the candidate word list on the text to be corrected, and these corrected texts are filtered to obtain the target text for correction. This effectively improves the fluency and accuracy of the overall sentence in spelling correction.
[0015] Other features and advantages of this disclosure will be described in detail in the following detailed description section. Attached Figure Description
[0016] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the following detailed description to explain the present disclosure, but do not constitute a limitation thereof. In the drawings: Figure 1This is a flowchart illustrating a text correction method according to an exemplary embodiment.
[0017] Figure 2 This is an example diagram illustrating the construction of a training dataset in a text correction method according to an exemplary embodiment.
[0018] Figure 3 This is a training example diagram of a candidate word classification model in a text correction method according to an exemplary embodiment.
[0019] Figure 4 This is a specific example diagram illustrating a text correction method for obtaining target text to be corrected, according to an exemplary embodiment.
[0020] Figure 5 This is a block diagram illustrating a text correction apparatus according to an exemplary embodiment of the present disclosure.
[0021] Figure 6 This is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure. Detailed Implementation
[0022] The specific embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for illustration and explanation only and are not intended to limit this disclosure.
[0023] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0024] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.
[0025] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.
[0026] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0027] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0028] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0029] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.
[0030] English spell correction is a relatively common area in the development of NLP (Natural Language Processing). With the development of the times, the methods of spell correction have become more and more diverse. At the same time, with the current growth of computing resources, spell correction often requires more resources.
[0031] For example, an English essay correction app, as a standalone tutoring app, can be used to help primary and secondary school students improve their English writing skills. Among these features, the spelling correction module is indispensable; therefore, optimizing spelling correction capabilities is particularly important for learning devices.
[0032] Furthermore, spell correction methods vary across different domains. Spell correction in primary and secondary school settings differs fundamentally from spell correction in higher-level settings.
[0033] Among related technologies, error correction methods based on statistical models and knowledge graphs are limited by the number of parameters, making it difficult to achieve high accuracy in the domain. Models based on BERT for fine-tuning tend to modify existing characters due to model structure issues, making it difficult to reconstruct the entire sentence and handle scenarios where words need to be split when they are connected. In addition, due to the limitations of the training method, the selection of which word to use can only be based on probability, resulting in relatively poor error correction performance. Various methods based on large models require high computing power and have relatively higher error correction time, making them difficult to be effective in low-resource situations.
[0034] To address the aforementioned issues, this disclosure proposes a text correction method. This method utilizes a candidate word classification model to obtain more accurate candidate words. Based on this, an optimal path selection algorithm is used to filter the corrected text, further ensuring that the final corrected text is more accurate. In other words, spell correction is achieved by combining a candidate word classification model with a perplexity model, which assists the spell correction language model in making more accurate corrections. At the same time, further sentence path credibility analysis can be performed on the correction results, thereby improving spell correction performance.
[0035] Figure 1 This is a text correction method illustrated according to an exemplary embodiment, such as... Figure 1 As shown, the text correction method may include the following steps: In step S110, the text to be corrected is obtained, and the erroneous words in the text to be corrected are identified.
[0036] In this embodiment of the disclosure, the text to be corrected can be a sentence, paragraph, or text, etc. The text to be corrected can be the text that the user inputs according to the actual situation and needs to be checked or corrected. The text to be corrected can be Chinese text or English text, etc., and it can be composed of multiple initial words.
[0037] For example, the text to be corrected in this embodiment of the disclosure can be an English sentence.
[0038] As an optional approach, after receiving the text to be corrected input by the user, embodiments of this disclosure can identify erroneous words in the text to be corrected, that is, detect problematic words in the text to be corrected.
[0039] In this process, embodiments of this disclosure can perform a word matching detection operation and a spelling error judgment operation. The word matching operation is used to compare each word in the input text one by one against a pre-constructed domain word list to determine whether it exists in the domain word list. The spelling error judgment operation is used to determine that the text has no spelling errors and terminate the error correction process when all words in the text are detected to be included in the domain word list.
[0040] In other words, after obtaining the text to be corrected, this embodiment of the disclosure can match each initial word in the text with each word in the target domain vocabulary. If it is determined that a word matching the initial word exists in the target domain vocabulary, then the initial word is determined to exist in the target domain vocabulary. Conversely, if it is determined that no word matching the initial word exists in the target domain vocabulary, then the initial word is determined not to exist in the target domain vocabulary.
[0041] The target domain vocabulary can be obtained from the text to be corrected; that is, different texts will have different target domain vocabularys. In other words, before performing vocabulary matching on the text to be corrected, this embodiment can first classify the text to be corrected, that is, determine the target domain to which the text belongs, and use the domain vocabulary corresponding to that target domain as the target domain vocabulary. For example, if the text to be corrected belongs to K12 English, then the target domain vocabulary can be the domain vocabulary corresponding to K12. Similarly, if the text to be corrected belongs to K9 English, then the target domain vocabulary can be the domain vocabulary corresponding to K9.
[0042] After obtaining the erroneous word in the text to be corrected, this embodiment of the disclosure can input the erroneous word into the candidate word classification model to obtain a candidate word list through the candidate word classification model, that is, proceed to step S120.
[0043] In step S120, the error is input into the candidate word classification model to obtain a candidate word list.
[0044] As described above, after obtaining the erroneous word in the text to be corrected, this embodiment of the present disclosure can input the erroneous word into the candidate word classification model to obtain a candidate word list through the candidate word classification model.
[0045] In other words, when it is determined that an initial word in the text to be corrected is not included in the target domain vocabulary, this embodiment of the disclosure can treat it as an erroneous word and call the candidate word classification model (candidate word selection model) to generate a candidate word list for the erroneous word. This candidate word list can also be called a candidate replacement word list.
[0046] As an optional approach, the text to be corrected may contain one, multiple, or none erroneous words. When multiple erroneous words are included, this embodiment of the disclosure can input each erroneous word separately into a candidate word classification model to obtain a candidate word list corresponding to each erroneous word. In other words, each erroneous word can correspond to a candidate word list.
[0047] Here, the candidate word list can include multiple candidate words, and the loss function of the candidate word classification model can be obtained by optimizing the cross-entropy loss function of the multi-classification task.
[0048] Optionally, embodiments of this disclosure may obtain a training dataset, which may include, for example: Figure 2 The examples shown are real grading data 201, publicly available data 202, and simulation-generated data 203. It is evident that the training data in this embodiment primarily originates from real essay grading data, publicly available datasets, and the generation of simulation data.
[0049] In addition, based on Figure 2It is understood that simulation data can be generated based on at least one of the following methods: generating misspelled words by building an agent based on a large model; obtaining specialized misspelled words, i.e., through a misspelled English character mapping table; or obtaining data based on statistical models, etc. For example, the simulation data in this embodiment of the disclosure can be generated based on simulation data generation tools such as a misspelled English character mapping table or a statistical model (e.g., neuspell); Furthermore, the training dataset in this embodiment may include raw data and labeled data, wherein the raw data may be sentences containing incorrect words; and the labeled data may be sentences with the incorrect words corrected. For example, the raw data may be "In my opinion, an egalitarian society is one in which overyone has the same rights and the same opportunities," and the labeled data may be "In my opinion, anegalitarian society is one in which everyone has the same rights and the same opportunities."
[0050] After obtaining the training dataset, this embodiment of the disclosure can perform supervised training on the target large model based on the training dataset to obtain a candidate word classification model. The loss function of the candidate word classification model can be a custom loss function that incorporates a Top-K mechanism.
[0051] For example, after collecting and obtaining a high-quality training dataset of spelling errors, this embodiment of the disclosure can load it into a pre-trained semantic model to prepare for subsequent fine-tuning, thus ensuring the correct data format. Additionally, this embodiment of the disclosure can also preprocess the training dataset during this process, such as performing word segmentation and data cleaning on the data in the training dataset.
[0052] In this embodiment of the disclosure, the custom loss function can be obtained by weighted summation of the first loss function, the second loss function, and the third loss function. The first loss function can be the standard cross-entropy loss; the second loss function can be a function related to the Top-K mechanism; and the third loss function can be a function related to a specified word reward mechanism.
[0053] For example, the formula for calculating the first loss function is shown below: , Where N is the batch size; C is the number of categories, i.e., the size of the target vocabulary (domain vocabulary); This is the labeled true value, meaning that the true class of sample i is c. If the true class of sample i is c, then... =1, otherwise 0; It is the sample predicted by the model, that is, the probability that sample i belongs to c.
[0054] Optionally, the second loss function can be a function related to a Top-5 reward mechanism. The goal of this Top-5 reward mechanism is to award a certain reward (1) if the target token is among the Top-5 tokens predicted by the model, thus reducing the loss. Here, the formula for calculating the second loss function is as follows: , in, , It is the set of the top-5 tokens predicted by the model for the i-th sample, and N is the batch size. Optionally, the third loss function can be a function related to a specific vocabulary reward mechanism, such as the K12 vocabulary reward mechanism. The calculation formula for the third loss function is as follows: , in, K12 refers to the vocabulary content covered in the K12 education stage.
[0055] In this embodiment of the disclosure, a custom loss function can be obtained by weighted summing of the first loss function, the second loss function, and the third loss function. The specific calculation formula of the custom loss function is as follows: , in, , and These are the weighting coefficients, and their sum is 1. When When the value is larger, the model focuses more on the standard cross-entropy loss; when When the value is larger, the model focuses more on the Top-5 word loss; when When the size is larger, the model pays more attention to the K12 vocabulary recall loss.
[0056] In this embodiment of the disclosure, , and The weighting coefficients can be fixed, such as =0.7, =0.2, =0.1, meaning the weight of the first loss function can be greater than the weight of the second loss function, and the weight of the second loss function can be greater than the weight of the third loss function.
[0057] Optionally, , and The weighting coefficients can also be flexibly adjusted according to changes in the training data. For example, during the process of training a model using data from the training dataset, this embodiment of the disclosure can obtain the category of the training data, such as determining whether the training data is K12-related data; if so, the weights can be increased. and reduce , Conversely, if it is determined that the training data is not K12-related, the impact can be reduced. and increase , .
[0058] In summary, by combining standard cross-entropy loss, Top-5 reward mechanism, and K12 vocabulary reward mechanism, a more accurate custom loss function can be obtained. The model's loss function is optimized based on the cross-entropy loss function commonly used in multi-class classification tasks, incorporating a top-k mechanism. If the corresponding token appears in the top-k list, the task model's inference is considered correct. This approach is more suitable for tasks like obtaining candidate word lists compared to the traditional cross-entropy loss function.
[0059] Optionally, before training the candidate word classification model, embodiments of this disclosure may first perform a model-fine-tuning framework comparison and evaluation operation. This comparison and evaluation operation is used to select an open-source model with better semantic understanding performance as the base model under low-resource conditions, such as the BERT series or AlBERT series models. For example, embodiments of this disclosure may select the BERT-Base model as the base model / foundation model for the K12 English spelling correction task, and compare and evaluate different fine-tuning frameworks (such as PyTorch, Keras, etc.) to select the most suitable framework.
[0060] Based on this, the embodiments of this disclosure can employ a supervised fine-tuning approach to train the base model. This involves writing scripts based on a selected fine-tuning framework and injecting the optimized training data into the open-source large model. The parameter adjustments (loss function optimization) involved in the fine-tuning process have been described in detail in the above embodiments and will not be repeated here. Furthermore, since the candidate word selection task differs from traditional text multi-classification tasks, domain-specific optimization of the loss function ensures that the model can better learn the task characteristics.
[0061] Optionally, this disclosure can monitor metrics and fine-tune hyperparameters when training the candidate word classification model. That is, during the model training process, the embodiments of this disclosure can monitor the fluctuations of training loss, validation loss and other curves in real time, and optimize and fine-tune various hyperparameters used (such as learning rate, batch size, etc.) to improve the fitting accuracy of the model.
[0062] It should be noted that after obtaining the candidate word classification model through training, the embodiments of this disclosure can perform quality evaluation on the model. For example, a spelling correction quality evaluation strategy can be used to conduct quality evaluation tests on the model to ensure that the model's performance on the K12 English spelling correction task meets the preset standard, and necessary iterative optimization can be performed based on the evaluation results.
[0063] To better illustrate the training process of the candidate word classification model, embodiments of this disclosure provide, as follows: Figure 3 The example diagram shown is based on Figure 3 As can be seen, after obtaining the training dataset, this embodiment can first load and prepare the data, and simultaneously compare and evaluate the model and the fine-tuning framework. Then, supervised fine-tuning is performed, and the model is trained based on the optimized loss function to indicate supervision and hyperparameter fine-tuning. After obtaining the trained model, a quality assessment can be performed to determine whether the candidate word classification model obtained through training meets the standards. If it is determined that the candidate word classification model obtained through training meets the standards, this embodiment can end the training of the candidate word classification model. Conversely, if it is determined that the candidate word classification model obtained through training does not meet the standards, this embodiment can re-perform the comparison and evaluation of the model and the fine-tuning framework, and repeat the subsequent operations until a candidate word classification model that meets the standards is obtained.
[0064] It should be noted that after obtaining erroneous words (words not included in the target domain vocabulary) by searching the target domain vocabulary, this embodiment can, on the one hand, input them into a candidate word classification model to obtain a candidate word list. On the other hand, this embodiment can also dynamically split the erroneous words, that is, perform a dynamic splitting operation on the unincluded words, such as splitting a single word into multiple sub-words. Based on this, it is determined whether all the sub-words obtained from the splitting exist in the target domain vocabulary. If all the sub-words obtained from the splitting exist in the target domain vocabulary, then the sub-words obtained from the splitting are included as candidates in the candidate word list.
[0065] In other words, if all the sub-words obtained from the splitting are in the vocabulary, the sub-words are added to the candidate word list, i.e., the candidate word list is updated. Conversely, if at least one word obtained from the splitting is not in the vocabulary, the original candidate word list is left unchanged.
[0066] In step S130, multiple corrected texts are generated based on the candidate word list from the text to be corrected, and the multiple corrected texts are filtered to obtain the target corrected text.
[0067] As an optional approach, after obtaining the candidate word list, embodiments of this disclosure can generate multiple corrected texts based on the candidate word list on the text to be corrected, and filter these corrected texts to obtain the target corrected text.
[0068] For example, after completing the entire sentence error filtering, if there are n erroneous words in the sentence, and each erroneous word has k candidate words, and considering the two most common splitting methods generated by dynamic splitting, this embodiment of the disclosure can generate (k+2)^n possible sentence correction paths (corrected text). Based on this, this embodiment of the disclosure can use an optimal path addressing algorithm to evaluate and compare all possible correction paths to determine and output the optimal corrected sentence.
[0069] Specifically, in the process of filtering the multiple corrected texts, this embodiment of the disclosure can obtain a comprehensive score for each corrected text and take the corrected text with the highest comprehensive score as the target error correction text.
[0070] Here, the overall score for each corrected text can be obtained based on at least one of the following parameters: perplexity score, edit distance score, confidence score of replaced words, grammatical correctness score, semantic consistency score, and domain fit score.
[0071] The perplexity score (PPL Score) is obtained by evaluating the reasonableness of the corrected text. It can be obtained by evaluating the probability of the sentence through a language model. The lower the perplexity score, the more accurate the model's prediction of the text and the higher the reasonableness of the sentence.
[0072] The edit distance score is an evaluation of the similarity between the misspelled word and the replacement word. In other words, the edit distance score is mainly used to measure the similarity between the original misspelled word and the replacement word.
[0073] The confidence score of the replacement word can be obtained by evaluating the reliability of the replacement word. In this embodiment of the disclosure, the confidence of the replacement word can be evaluated by a spelling correction model to obtain the confidence score of the replacement word. The higher the confidence score, the more reliable the replacement is.
[0074] The grammar correct score can be obtained by evaluating the grammatical compliance of the corrected text. In this embodiment of the disclosure, a grammar checking tool (such as LanguageTool) can be used to determine whether the replaced text conforms to the grammatical rules in order to obtain the grammar correct score. The higher the score, the more the replaced text (corrected text) conforms to the grammatical rules.
[0075] The semantic consistency score can be obtained by evaluating the consistency between the corrected text and the context semantics. In this embodiment of the disclosure, a semantic similarity model (such as the BERT model) can be used to determine whether the replaced text (corrected text) is consistent with the corresponding context semantics.
[0076] The Field Adaptation Score is obtained by evaluating the domain vocabulary to which the replacement word belongs. In other words, it determines whether the selected word to be corrected exists in the domain vocabulary. For example, if there are two replacement words, the first replacement word belongs to the domain vocabulary corresponding to K12, while the second replacement word does not belong to the domain vocabulary corresponding to K12. Then the field adaptation score of the first replacement word can be 1, while the field adaptation score of the second replacement word is 0.
[0077] For example, the overall score of the corrected text can be calculated using the following formula: Score(x)=α×PPL(x)+β×EditDst(x)+γ×(1-Conf(x))+δ×GE(x)+ε×(1-Similarity(x))+ ×(1-Field(x); in, , , , and All are weighting parameters used to adjust the importance of each indicator. PPL(x) is the perplexity of path x; EditDst(x) is the average edit distance between the replaced word and the original word in path x; Conf(x) is the average confidence of the replaced word in path x; GE(x) is the number of grammatical errors in path x; Similarity(x) is the semantic similarity between path x and the context; Field(x) is the domain fit score of path x.
[0078] For each path x, this embodiment of the disclosure can calculate its comprehensive score Score(x) and select the path with the lowest score as the optimal path, as shown in the following formula: .
[0079] Here, path x can be any corrected text.
[0080] To better illustrate the process of obtaining the target error-corrected text (optimal corrected sentence), embodiments of this disclosure provide the following... Figure 4 The example diagram shown is based on Figure 4 As can be seen, based on vocabulary matching, this embodiment of the disclosure can determine whether each word in the text to be corrected is in the vocabulary. If all words are in the vocabulary, the correction ends. Otherwise, this embodiment of the disclosure can call a candidate word classification model to generate a candidate word list, and can dynamically split words that are not included in the vocabulary.
[0081] During this process, if it is determined that the split word is in the vocabulary, this embodiment of the disclosure can add the split sub-word to the candidate word list. Alternatively, if it is determined that the split word is not in the vocabulary, this embodiment of the disclosure can retain the original candidate word list. Based on this, all possible sentence correction paths are calculated, and an optimal path addressing algorithm is used to select the optimal sentence to output the optimal corrected sentence (target error-corrected text).
[0082] As a specific implementation method, in the process of performing text correction, this embodiment of the disclosure can input the sentence to be corrected, such as the original English sentence to be corrected, and then obtain the location of the erroneous words in the sentence. An English vocabulary list can be obtained through a self-built English vocabulary list or a network, thus pre-obtaining the location and related information of the erroneous English words in the sentence. Based on this, a candidate path generation operation is performed, that is, based on the candidate word classification model obtained through model fine-tuning, to infer and predict the location of the erroneous words, obtaining k (k is an optional parameter) candidate English words (candidate word list).
[0083] For example, if there are a total of n erroneous words in a sentence, then (k+2)^n candidate paths (corrected text) can be obtained. Next, the embodiment of this disclosure can determine the optimal path, that is, for the multiple candidate paths generated above, the score-optimal path (target corrected text) can be calculated based on a custom k12 spell correction optimal path addressing algorithm.
[0084] It should be noted that after obtaining the optimal path, this embodiment of the disclosure can also perform sentence post-processing operations. That is, based on the sentence with the optimal path obtained above (the target text for error correction), this embodiment of the disclosure can further format and perform rule post-processing on the sentence, and finally return the final error correction result text. Here, formatting and rule post-processing are mainly used to make the target text for error correction more consistent with the actual situation of the sentence. For example, when the erroneous word is not the first word, its first letter can be capitalized, and the format of periods, commas, etc. in the target text for error correction can be uniformly processed.
[0085] Through the above implementation methods, the embodiments of this disclosure can achieve text correction more efficiently and accurately, especially for K12 English sentence spelling correction in educational scenarios, ensuring greater accuracy and significantly improving the accuracy of English spelling correction results in low-resource scenarios. Furthermore, the optimized loss function proposed in these embodiments can be used for fine-tuning downstream task models for candidate word acquisition, and can also be extended to other vertical domains to assist in improving model performance within those domains. In addition, the above-mentioned optimal path addressing algorithm also has significant reference value for certain domains with specific requirements for output text.
[0086] This disclosure can be applied to all products in English spelling correction scenarios, such as learning machines for teachers or students. This disclosure provides more accurate automated AI English spelling correction capabilities, thus assisting in further enhancing the capabilities of English spelling correction models in various sub-scenarios. Furthermore, this disclosure also offers valuable reference for fine-tuning subjective text spelling correction models in other languages.
[0087] In this embodiment, after obtaining the text to be corrected, a candidate word list can be obtained through a candidate word classification model. Based on this, the corrected text generated from the candidate words is filtered to obtain a more accurate target text for correction. First, the text to be corrected is obtained, and the erroneous words in the text are identified. Then, the erroneous words are input into the candidate word classification model to obtain a candidate word list. This candidate word list includes multiple candidate words. The loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of multi-classification tasks. Finally, multiple corrected texts are generated based on the candidate word list on the text to be corrected, and these corrected texts are filtered to obtain the target text for correction. This effectively improves the fluency and accuracy of the overall sentence in spelling correction.
[0088] Figure 5 This is a text correction device illustrated according to an exemplary embodiment, such as... Figure 5 The text correction device 500 shown may include a determining module 510, an input module 520, and a filtering module 530.
[0089] The determining module 510 is configured to acquire the text to be corrected and to determine the erroneous words in the text to be corrected; The input module 520 is configured to input the erroneous word into the candidate word classification model to obtain a candidate word list, the candidate word list including multiple candidate words, and the loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of a multi-classification task; The filtering module 530 is configured to generate multiple corrected texts based on the candidate word list on the text to be corrected, and to filter the multiple corrected texts to obtain the target corrected text.
[0090] In some embodiments, the text correction device 500 may further include: The dataset acquisition module is configured to acquire a training dataset, which includes real graded data, publicly available data, and simulation-generated data. The training module is configured to perform supervised training on the target large model based on the training dataset to obtain the candidate word classification model. The loss function of the candidate word classification model is a custom loss function that incorporates the Top-K mechanism.
[0091] In some implementations, the custom loss function is obtained by weighted summation of the first loss function, the second loss function, and the third loss function. The first loss function is the standard cross-entropy loss, the second loss function is a function related to the Top-K mechanism, and the third loss function is a function related to a specified word reward mechanism.
[0092] In some implementations, the first loss function is calculated using the following formula: , Where N is the batch size and C is the number of categories. This is the labeled true value, meaning that the true class of sample i is c. It is the sample predicted by the model, that is, the probability that sample i belongs to c. The formula for calculating the second loss function is as follows: , in, , It is the set of the top-5 word segments predicted by the model for the i-th sample; The formula for calculating the third loss function is as follows: , in, K12 refers to the vocabulary content covered in the K12 education stage; The formula for calculating the custom loss function is as follows: , in, , and These are the weighting coefficients, and their sum is 1.
[0093] In some implementations, the filtering module 530 is further configured to obtain a comprehensive score for each of the corrected texts and to select the corrected text with the highest comprehensive score as the target error-corrected text.
[0094] In some implementations, the overall score is obtained based on at least one of the following parameters: The perplexity score is obtained by evaluating the reasonableness of the corrected text. Edit distance score, which is obtained by evaluating the similarity between the erroneous word and the replacement word; The confidence score of the replacement word is obtained by evaluating the reliability of the replacement word. The grammatical correctness score is obtained by evaluating the grammatical compliance of the corrected text. Semantic consistency score, which is obtained by evaluating the consistency between the corrected text and the semantic context; Domain fit score, which is obtained by evaluating the domain vocabulary to which the replaced word belongs.
[0095] In this embodiment, after obtaining the text to be corrected, a candidate word list can be obtained through a candidate word classification model. Based on this, the corrected text generated from the candidate words is filtered to obtain a more accurate target text for correction. First, the text to be corrected is obtained, and the erroneous words in the text are identified. Then, the erroneous words are input into the candidate word classification model to obtain a candidate word list. This candidate word list includes multiple candidate words. The loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of multi-classification tasks. Finally, multiple corrected texts are generated based on the candidate word list on the text to be corrected, and these corrected texts are filtered to obtain the target text for correction. This effectively improves the fluency and accuracy of the overall sentence in spelling correction.
[0096] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0097] Figure 6 This is a block diagram illustrating an electronic device 600 according to an exemplary embodiment. For example... Figure 6 As shown, the electronic device 600 may include a processor 601 and a memory 602. The electronic device 600 may also include one or more of a multimedia component 603, an input / output (I / O) interface 604, and a communication component 605.
[0098] The processor 601 controls the overall operation of the electronic device 600 to complete all or part of the steps in the text correction method described above. The memory 602 stores various types of data to support the operation of the electronic device 600. This data may include, for example, instructions for any application or method operating on the electronic device 600, and application-related data such as contact data, sent and received messages, pictures, audio, video, etc. The memory 602 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 603 may include a screen and audio components. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in memory 602 or transmitted via communication component 605. The audio component also includes at least one speaker for outputting audio signals. I / O interface 604 provides an interface between processor 601 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual or physical buttons. Communication component 605 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless communication may include Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination thereof; therefore, the corresponding communication component 605 may include a Wi-Fi module, a Bluetooth module, or an NFC module.
[0099] In an exemplary embodiment, the electronic device 600 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the text error correction method described above.
[0100] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the text correction method described above. For example, the computer-readable storage medium may be the memory 602 including the program instructions described above, which may be executed by the processor 601 of the electronic device 600 to complete the text correction method described above.
[0101] In another exemplary embodiment, a computer program product is also provided, which includes a computer program executable by a processor, which, when executed by the processor, implements the steps of the text correction method described above.
[0102] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the text error correction method described above.
[0103] In another exemplary embodiment, a computer program product is also provided, which includes a computer program executable by a processor, which, when executed by the processor, implements the steps of the text correction method described above.
[0104] The preferred embodiments of this disclosure have been described in detail above with reference to the accompanying drawings. However, this disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this disclosure, various simple modifications can be made to the technical solutions of this disclosure, and these simple modifications all fall within the protection scope of this disclosure.
[0105] It should also be noted that the various specific technical features described in the above specific embodiments can be combined in any suitable manner without contradiction. In order to avoid unnecessary repetition, this disclosure will not describe the various possible combinations separately.
[0106] Furthermore, various different embodiments of this disclosure can be combined in any way, as long as they do not violate the spirit of this disclosure, they should also be regarded as the content disclosed in this disclosure.
Claims
1. A text error correction method, characterized in that, The method includes: Obtain the text to be corrected and identify the erroneous words in the text; The incorrect word is input into the candidate word classification model to obtain a candidate word list, which includes multiple candidate words. The loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of a multi-classification task. Based on the text to be corrected, multiple corrected texts are generated according to the candidate word list, and the multiple corrected texts are filtered to obtain the target corrected text.
2. The text correction method according to claim 1, characterized in that, The method further includes: Obtain a training dataset, which includes real graded data, publicly available data, and simulation-generated data; The target large model is trained in a supervised manner based on the training dataset to obtain the candidate word classification model. The loss function of the candidate word classification model is a custom loss function that incorporates the Top-K mechanism.
3. The text correction method according to claim 2, characterized in that, The custom loss function is obtained by weighted summation of the first loss function, the second loss function, and the third loss function. The first loss function is the standard cross-entropy loss, the second loss function is a function related to the Top-K mechanism, and the third loss function is a function related to the specified word reward mechanism.
4. The text correction method according to claim 3, characterized in that, The formula for calculating the first loss function is as follows: , Where N is the batch size and C is the number of categories. This is the labeled true value, meaning that the true class of sample i is c. It is the sample predicted by the model, that is, the probability that sample i belongs to c. The formula for calculating the second loss function is as follows: , in, , It is the set of the top-5 word segments predicted by the model for the i-th sample; The formula for calculating the third loss function is as follows: , in, K12 refers to the vocabulary content covered in the K12 education stage; The formula for calculating the custom loss function is as follows: , in, , and These are the weighting coefficients, and their sum is 1.
5. The text correction method according to claim 1, characterized in that, The step of filtering the plurality of corrected texts to obtain the target corrected text includes: Obtain a comprehensive score for each of the corrected texts, and select the corrected text with the highest comprehensive score as the target text for error correction.
6. The text correction method according to claim 5, characterized in that, The overall score is obtained based on at least one of the following parameters: The perplexity score is obtained by evaluating the reasonableness of the corrected text. Edit distance score, which is obtained by evaluating the similarity between the erroneous word and the replacement word; The confidence score of the replacement word is obtained by evaluating the reliability of the replacement word. The grammatical correctness score is obtained by evaluating the grammatical compliance of the corrected text. Semantic consistency score, which is obtained by evaluating the consistency between the corrected text and the semantic context; Domain fit score, which is obtained by evaluating the domain vocabulary to which the replaced word belongs.
7. A text error correction device, characterized in that, The device includes: The determination module is configured to acquire the text to be corrected and determine the erroneous words in the text to be corrected; The input module is configured to input the erroneous word into the candidate word classification model to obtain a candidate word list, the candidate word list including multiple candidate words, and the loss function of the candidate word classification model is obtained by optimizing the cross-entropy loss function of a multi-classification task; The filtering module is configured to generate multiple corrected texts based on the candidate word list based on the text to be corrected, and to filter the multiple corrected texts to obtain the target corrected text.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the steps of the method described in any one of claims 1-6.
9. An electronic device, characterized in that, include: A memory on which computer programs are stored; A processor for executing the computer program in the memory to implement the steps of the method according to any one of claims 1-6.
10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the method described in any one of claims 1-6.