A cascaded Chinese error correction method, device and equipment
By employing a hierarchical error correction method, a cascaded Chinese error correction method combining pinyin perturbation mechanism and hash value comparison, and combining an improved masked language model and a large language model, the problems of large inference latency and uncontrollable error correction process in existing technologies are solved, thereby improving the accuracy and efficiency of Chinese text error correction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU POWER SUPPLY BUREAU GUANGDONG POWER GRID CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing Chinese text correction methods suffer from technical problems such as large inference delays, uncontrollable correction processes, and large input models that process a lot of information, resulting in low accuracy and efficiency in actual text correction.
The extracted set of positive words is obfuscated using a Pinyin perturbation mechanism to obtain an obfuscated word library. Then, the text to be replaced and corrected by comparing hash values using the obfuscated word library is obtained as the "delete" correction text.
It implements a hierarchical error correction method to progressively correct the target input text, reducing the amount of information processing in large models, improving inference efficiency, and reducing inference latency, making text processing more flexible and controllable, and meeting the text error correction needs of different complex scenarios.
Smart Images

Figure CN122197869A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of large language models, and in particular to a cascaded Chinese error correction method, apparatus and device. Background Technology
[0002] Chinese text correction (CTEC) is a fundamental component of applications such as intelligent writing, official document proofreading, and content review. Due to the lack of explicit morphological changes, the high density of homophones and similar-looking characters, and the heavy reliance on contextual semantics, Chinese text correction is far more challenging than that of alphabetic languages. In recent years, Large Language Models (LLMs) have demonstrated superior contextual understanding capabilities in generative correction.
[0003] However, direct end-to-end rewriting of the entire text using large language models presents three main problems: First, high inference costs and significant latency make it difficult to meet the demands of real-time or large-scale scenarios. Second, the generation process is uncontrollable, easily introducing semantic drift and style shift. Third, insufficient processing of the original text leads to a waste of computational power for the large model. For example, existing cascaded solutions only perform preliminary filtering using rules or small models, still feeding the original text back into the large model for re-understanding. This results in the large model performing a second scan of the entire text, leading to high computational consumption and an inability to constrain the scope of its modifications. Therefore, current Chinese text correction methods fall short of expectations, limiting their application. Summary of the Invention
[0004] This application provides a cascaded Chinese error correction method, apparatus, and device to solve the technical problems of large inference delay, uncontrollable error correction process, and low accuracy and efficiency of actual text error correction due to large input model processing of a large amount of information in the prior art.
[0005] In view of this, the first aspect of this application provides a cascaded Chinese error correction method, including:
[0006] The extracted set of positive words is obfuscated using a Pinyin perturbation mechanism to obtain an obfuscated word library, which includes multiple pairs of positive and incorrect Chinese words.
[0007] The target input text is subjected to replacement and error correction processing based on hash value comparison using the obfuscation dictionary to obtain the initial corrected text;
[0008] An improved masked language model is used to perform confidence prediction analysis based on the initial error-corrected text to obtain an uncertain word sequence, which includes triplet information;
[0009] The target corrected text is obtained by performing secondary error correction analysis on the initial corrected text and the suspected erroneous sentences in the uncertain sub-word sequence using a pre-built large language model.
[0010] Preferably, the step of using a pinyin perturbation mechanism to confuse the extracted set of positive words to obtain a confused word library includes:
[0011] After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering.
[0012] Obtain the pinyin of each positive word in the positive word set, and construct an inverted index of pinyin-Chinese characters;
[0013] Based on the inverted index, each character in the positive word set is subjected to a three-level pronunciation confusion variation, and mapped back to the corresponding character to obtain single-character mispronunciations;
[0014] Two characters are randomly selected from the set of correct words and their corresponding pinyin is obtained. The pinyin is then combined using a Cartesian product method and mapped back to the corresponding characters to obtain two-character misspelled words.
[0015] An obfuscated word library is generated by combining the positive word set, the single-character misspelled words, and the two-character misspelled words.
[0016] Preferably, the step of performing hash-value comparison-based replacement and error correction processing on the target input text using the obfuscation dictionary to obtain the initial corrected text includes:
[0017] All misspelled words in the obfuscation lexicon are grouped by length, and the rolling hash value of each group of misspelled words is calculated to construct a three-level inverted index of length-rolling hash value-word form;
[0018] After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window.
[0019] The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain a corrected text.
[0020] If there is no match, the suspected error information is recorded, including the location of the misspelled word, sentence-level offset, paragraph ID, and document ID;
[0021] The initial error correction text is generated by combining the suspected error information and the first error correction text.
[0022] Preferably, the step of using an improved masked language model to perform confidence prediction analysis based on the initial error-corrected text to obtain an uncertain word sequence includes:
[0023] The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters.
[0024] The continuous characters are segmented into sub-word segments according to the encoded vocabulary, and each sub-word segment is mapped to an integer word number to obtain a token sequence;
[0025] The token sequence is input into the improved masked language model to predict the confidence level;
[0026] If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
[0027] Preferably, the step of performing secondary error correction analysis on the initial error-corrected text and the suspected erroneous sentences in the uncertain sub-word sequence using a pre-set large language model to obtain the target error-corrected text includes:
[0028] Suspected erroneous sentences are identified based on the initial error-corrected text and the triple information in the uncertain sub-word sequence;
[0029] The suspected erroneous sentence is segmented using a character batching mechanism to obtain batched characters.
[0030] The batched characters are input into a pre-set large language model for secondary error correction analysis to obtain the target error-corrected text.
[0031] A second aspect of this application provides a cascaded Chinese error correction device, comprising:
[0032] The obfuscation processing unit is used to obfuscate the extracted set of positive words using a pinyin perturbation mechanism to obtain an obfuscated word library, which includes multiple pairs of positive and incorrect Chinese words.
[0033] The initial error correction unit is used to perform replacement error correction processing on the target input text based on hash value comparison using the obfuscation dictionary to obtain the initial error-corrected text;
[0034] The model prediction unit is used to perform confidence prediction analysis based on the initial error-corrected text using an improved masked language model to obtain an uncertain word sequence, wherein the uncertain word sequence includes triple information;
[0035] The secondary error correction unit is used to perform secondary error correction analysis on the initial error-corrected text and the suspected erroneous sentences in the uncertain sub-word sequence using a pre-set large language model to obtain the target error-corrected text.
[0036] Preferably, the obfuscation processing unit is specifically used for:
[0037] After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering.
[0038] Obtain the pinyin of each positive word in the positive word set, and construct an inverted index of pinyin-Chinese characters;
[0039] Based on the inverted index, each character in the positive word set is subjected to a three-level pronunciation confusion variation, and mapped back to the corresponding character to obtain single-character mispronunciations;
[0040] Two characters are randomly selected from the set of correct words and their corresponding pinyin is obtained. The pinyin is then combined using a Cartesian product method and mapped back to the corresponding characters to obtain two-character misspelled words.
[0041] An obfuscated word library is generated by combining the positive word set, the single-character misspelled words, and the two-character misspelled words.
[0042] Preferably, the initial error correction unit is specifically used for:
[0043] All misspelled words in the obfuscation lexicon are grouped by length, and the rolling hash value of each group of misspelled words is calculated to construct a three-level inverted index of length-rolling hash value-word form;
[0044] After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window.
[0045] The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain a corrected text.
[0046] If there is no match, the suspected error information is recorded, including the location of the misspelled word, sentence-level offset, paragraph ID, and document ID;
[0047] The initial error correction text is generated by combining the suspected error information and the first error correction text.
[0048] Preferably, the model prediction unit is specifically used for:
[0049] The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters.
[0050] The continuous characters are segmented into sub-word segments according to the encoded vocabulary, and each sub-word segment is mapped to an integer word number to obtain a token sequence;
[0051] The token sequence is input into the improved masked language model to predict the confidence level;
[0052] If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
[0053] A third aspect of this application provides a cascaded Chinese error correction device, the device including a processor and a memory;
[0054] The memory is used to store program code and transmit the program code to the processor;
[0055] The processor is used to execute the cascaded Chinese error correction method described in the first aspect according to the instructions in the program code.
[0056] As can be seen from the above technical solutions, the embodiments of this application have the following advantages:
[0057] This application provides a cascaded Chinese error correction method, comprising: using a pinyin perturbation mechanism to confuse the extracted set of positive words to obtain a confused word library, the confused word library including multiple pairs of positive and incorrect Chinese words; using the confused word library to perform replacement error correction processing on the target input text based on hash value comparison to obtain an initial error-corrected text; using an improved masked language model to perform confidence prediction analysis based on the initial error-corrected text to obtain an uncertain sub-word sequence, the uncertain sub-word sequence including triple information; and using a pre-set large language model to perform secondary error correction analysis on suspected erroneous sentences in the initial error-corrected text and the uncertain sub-word sequence to obtain the target error-corrected text.
[0058] The cascaded Chinese error correction method provided in this application progressively corrects the target input text through hierarchical error correction, rather than directly inputting the original text into a large model for overall error correction. This minimizes the information processing load of the large model, thereby improving inference efficiency and reducing inference latency. Furthermore, the hierarchical error correction process allows for clear control over the text processing at each stage, such as initial error correction and confidence prediction based on a masked language model, making text processing more flexible and controllable, and better meeting the text correction needs of various complex scenarios. Therefore, this application solves the technical problems of existing technologies, such as large inference latency, uncontrollable error correction process, and the need for large models to process a large amount of information, resulting in low accuracy and efficiency in actual text correction. Attached Figure Description
[0059] Figure 1 A flowchart illustrating a cascaded Chinese error correction method provided in an embodiment of this application;
[0060] Figure 2 A schematic diagram of a cascaded Chinese error correction device provided in an embodiment of this application;
[0061] Figure 3 An example diagram illustrating the uncertainty triple prediction process using a masked language model, provided for embodiments of this application. Detailed Implementation
[0062] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.
[0063] For easier understanding, please refer to Figure 1 An embodiment of a cascaded Chinese error correction method provided in this application includes:
[0064] Step 101: Use a Pinyin perturbation mechanism to confuse the extracted set of positive words to obtain a confused word library. The confused word library includes multiple pairs of positive and incorrect Chinese words.
[0065] It should be noted that, in order to establish a Chinese error mapping resource with linguistic consistency and engineering scalability, this embodiment uses a word-driven, pinyin-based, and rule-constrained approach to automatically generate obfuscated word pairs. This automatic generation method does not rely on manual enumeration, making it more flexible and comprehensive. Furthermore, each word pair includes a correct word and its corresponding incorrect word. Since incorrect words are obtained by applying different obfuscation processes to correct words, one correct word may correspond to multiple incorrect words. In addition, for ease of analysis, the correct words in the correct word set of this embodiment are all two-character phrases or four-character idioms.
[0066] The Pinyin perturbation mechanism in this embodiment is based on the perturbation rules set in Pinyin. This mechanism allows for random or regular confusion and replacement of words in the set of correct words, generating new words, i.e., incorrect words. Associating these incorrect words with the current correct words creates a Pinyin-incorrect word Chinese pair. It is understood that Chinese Pinyin includes pronunciation types such as initials, finals, and whole-word pronunciations, allowing for various combinations and arrangements to generate new incorrect words. The same pronunciation can produce different Chinese words, and different combinations of initials and finals can also produce different words; specific combinations are not limited here.
[0067] Further, step 101 includes:
[0068] After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering.
[0069] Obtain the pinyin of each correct word in the correct word set, and construct an inverted index of pinyin-Chinese characters;
[0070] According to the inverted index, perform three-level pronunciation confusion mutation on each character in the correct word set, and map it back to the corresponding character to obtain single-character misspelled words;
[0071] Arbitrarily select two characters from the correct word set and obtain the corresponding character pinyin. Use the Cartesian product method to combine the character pinyin, and then map it back to the corresponding characters to obtain two-character misspelled words;
[0072] Combine the correct word set, single-character misspelled words and two-character misspelled words to generate a confusion word library.
[0073] It should be noted that the correct words in this embodiment are words extracted from multiple heterogeneous corpus systems and preprocessed. The heterogeneous corpus systems include but are not limited to government bulletins, mainstream news, input method desensitization logs, and social media; the length of the extracted word forms should not be less than the length threshold, such as 2. The obtained word forms need to go through various preprocessings to form a reliable correct word set; in addition to deduplication and sensitive word filtering, other methods can also be designed to optimize the data quality, which is not limited here.
[0074] Each word form in the correct word set has a corresponding pinyin. These pinyins can be extracted using a pinyin tool. Construct an inverted index of Chinese characters with the same pinyin and the corresponding syllables, that is, the index pinyin. For example, the Chinese characters corresponding to the inverted index of "bu" include "不" and "部"; the constructed inverted index can be used as a perturbation sequence in the subsequent confusion operation process.
[0075] The three-level pronunciation confusion in this embodiment refers to introducing a three-level confusion mechanism of "initial-final-syllable". In the initial dimension, according to the results of dialect phonetic surveys, establish common pronunciation deviation tables such as z-zh, c-ch, s-sh, n-l, etc.; in the final dimension, cover the confusion of front and back nasal sounds and the phenomenon of final loss in Mandarin. The confusion of front and back nasal sounds can be en-eng, in-ing, and the final loss can be ian-ia; in the syllable dimension, directly include dialect homophone mappings, such as shí-sí.
[0076] For any target word in the correct word set, either single-character replacement based on three-level pronunciation confusion mutation or two-character replacement based on Cartesian product combination can be performed. Eventually, new misspelled words can be formed and associated with the correct words to constitute a confusion word library.
[0077] Single-word replacement means that while keeping other words unchanged, the pinyin of the current word is regularly mutated according to the three-level confusion mechanism of "initial-final-syllable" and then mapped back to the corresponding current word; while double-word replacement is to select words at adjacent or non-adjacent positions of the current word, perform a Cartesian product combination on their pinyin, and then map back to the current word; for example, for "not saying a word", the non-adjacent word "fa" of "yi" is selected for replacement.
[0078] It should be noted that in order to control the scale of candidate misspelled words and ensure the rationality of similar shapes, this embodiment can also introduce a hard constraint on the edit distance, for example, the edit distance does not exceed 2; a positive-word anti-filtering mechanism can also be added, that is, if the generated misspelled word already exists in the positive-word set, it is regarded as an invalid misspelled word and can be discarded.
[0079] The misspelled words corresponding to each positive word in the confusion dictionary can be sorted according to the similarity of shapes and the occurrence frequency, and several items at the head are intercepted and stored in JSON format to form a positive-word - misspelled-word mapping resource. The method for constructing the confusion dictionary in this embodiment can be continuously iteratively updated with new words, new dialect accents, and new domain corpora without manual annotation, realizing the self-growing update of the confusion dictionary, and can provide more complete and less redundant prior knowledge support for subsequent text error correction.
[0080] Step 102: Perform replacement error correction processing on the target input text through the confusion dictionary based on hash value comparison to obtain the initial error correction text.
[0081] It can be understood that the target input text needs to be basically preprocessed, such as segmentation, to obtain a word format similar to that in the confusion dictionary, so that comparative analysis can be carried out; the comparison based on hash value is to calculate the hash values of the words in the confusion dictionary and the words in the target input text respectively, and then compare and analyze. If they are the same, it means that the two are the same misspelled word, and directly replacing the misspelled word in the target input text with the corresponding positive word can complete the initial error correction; of course, there will still be some words whose hash values cannot be matched, and they may not be okay; it can be judged according to the information recorded in the confusion dictionary, retained and subjected to subsequent error correction analysis; after the initial error correction is completed, it still forms a text format as the prior knowledge for subsequent error correction analysis. Such a hierarchical progressive error correction mode can reduce the error correction pressure of the subsequent large model, speed up the error correction reasoning progress, and improve the error correction efficiency.
[0082] Further, step 102 includes:
[0083] Group all the misspelled words in the confusion dictionary according to their lengths, calculate the rolling hash values of each group of misspelled words, and construct a three-level inverted index of length - rolling hash value - word form;
[0084] After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window.
[0085] The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain the corrected text.
[0086] If there is no match, the suspected error information is recorded. The suspected error information includes the location of the misspelled word, sentence-level offset, paragraph ID, and document ID.
[0087] The initial error correction text is generated by combining the suspected error information and the first error correction text.
[0088] It should be noted that the replacement correction is the initial correction in this embodiment, based on the obfuscation lexicon and hash values. Specifically, the information resources recorded in the Chinese pairs of correct and incorrect words in the obfuscation lexicon are first obtained, including the incorrect word form, the corresponding correct word form, and manually labeled tags. Tag 0 is a definite error, and tag 1 is a suspected error. For definite errors, the replacement correction is performed directly, while for suspected errors, the specific suspected error information can be recorded for subsequent error correction analysis.
[0089] First, all misspelled words in the obfuscation dictionary are grouped by length, and the rolling hash value of each group is calculated. The cardinality of the rolling hash is 131, which is used to spread the characters to a 64-bit space and reduce the probability of different substrings clashing. Based on the calculated rolling hash value, a three-level inverted index of "length-rolling hash value-lexical" can be constructed. In order to ensure that the same hash only involves a constant number of string comparisons, this embodiment controls the matching complexity to the O(n+m) level, where n is the text length and m is the total word length of the dictionary.
[0090] The target input text can first be split into sentence units using a sentence segmentation module. Each sentence can then have its substring hash value calculated using a preset sliding window of length L, yielding the current hash value. This current hash value is then compared to the rolling hash value. If a match is found, a word replacement is performed based on a three-level inverted index, achieving the first text correction and producing the corrected text. This embodiment uses a single-pass scan to detect all dictionary errors, eliminating the need for word-by-word traversal.
[0091] If a match fails, or if the dictionary tags indicate a suspected error, the suspected error information is recorded. This information includes, but is not limited to, the location of the misspelled word, sentence-level offset, paragraph ID, and document ID. The corrected text and this suspected error information can be used to generate a new text format for subsequent secondary error correction analysis, providing richer prior knowledge for the secondary error correction.
[0092] Step 103: Using an improved masked language model, perform confidence prediction analysis based on the initial error-correcting text to obtain an uncertain sub-word sequence, which includes triple information.
[0093] The improved masked language model in this embodiment is MacBERT, a variant of BERT designed based on the Chinese masked language modeling task. It can obtain the probability distribution of each token sequence in text segmentation. The model has only 110M parameters and can complete batch inference at a rate of about 9000 sentences / s on a single RTX 4090 with a VRAM usage of less than 3GB. Therefore, the model's computing power consumption is also very low, which can further reduce inference time and improve inference efficiency.
[0094] Confidence indicates how likely the current word is to be correct in this context, i.e., whether the word's position in the text is reasonable and appropriate. Furthermore, before improving the masked language model from the initial error-correcting text input, word segmentation is performed, transforming it into a word sequence. Therefore, the result after prediction is also a word sequence, but with the addition of triplet information containing confidence.
[0095] This embodiment uses an improved masked language model for prediction. Low confidence scores can be judged as suspected errors, while high confidence scores can be judged as correct. The suspected errors can be handed over to the subsequent large model for error correction. This fully realizes the analysis of prior knowledge of the text, reduces the reasoning difficulty of the large model, or provides the large model with more accurate and reliable soft hints.
[0096] Further, step 103 includes:
[0097] The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters.
[0098] Based on the encoded vocabulary, continuous characters are segmented into sub-word fragments, and each sub-word fragment is mapped to an integer word number to obtain a token sequence;
[0099] Input the token sequence into the improved masked language model to predict confidence;
[0100] If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
[0101] Specifically, in this embodiment, a Chinese word segmenter is first used to scan the initial error-correcting text character by character to obtain an initial sub-word sequence, which includes multiple consecutive characters. These consecutive characters can be segmented and encoded using an encoding vocabulary. The purpose is to encode the consecutive characters into integer word numbers, which is the token sequence. It should be noted that obtaining the token sequence does not affect the order of the original sentence and does not truncate it.
[0102] Then, the sentence of the current token sequence is input into the improved masked language model. The model's calculation mechanism is to first determine the sentence length N, and then output the logit vector for the i-th token sequence. Where V represents the vocabulary size, and the prediction probability can be obtained through the Softmax transformation. :
[0103]
[0104] Wherein, if the ID of the current token sequence is Then the prediction confidence of this sequence can be expressed as: And if ,and ,in, If the confidence threshold is set, then the current token sequence is considered to have a suspected error. The confidence threshold can be set to 0.7. In this case, the triplet information of the current token sequence can be recorded, including the suspected error start offset, sequence length, and confidence level. For a detailed example of the prediction process, please refer to [link / reference]. Figure 3 The information from triples and word sequences can form an uncertain word sequence.
[0105] Understandably, the triplet information is only used as an input of uncertainty signals into the subsequent large model for error correction analysis. No replacement error correction is provided here. This avoids introducing model noise too early. This process can achieve high recall screening of context-sensitive typos without introducing any fine-tuning parameters, while keeping the average document latency as low as possible and the experimental latency less than 5ms, which fully meets the requirements of the cascaded framework for low-cost priors.
[0106] Furthermore, in the actual implementation of the program, to ensure that the alignment error between the sub-word fragment and the original Chinese character sequence is 0, this embodiment can adopt a character-by-character length accumulation strategy, making the character length corresponding to the k-th sub-word 0. Then dynamically maintain pointers ,in, The offset of the k-th subword from the end character in the entire sentence. It always points to the position after the last character of the current subword, so the interval It perfectly covers the original Chinese character segment corresponding to the sub-word.
[0107] Step 104: Perform secondary error correction analysis on the initial error-corrected text and suspected erroneous sentences in the uncertain sub-word sequence using a pre-set large language model to obtain the target error-corrected text.
[0108] It's important to note that the initial corrected text is the text after one correction, which still contains potentially erroneous sentences. The uncertain sub-word sequence, on the other hand, is uncertainty information obtained after confidence prediction analysis by the masked language model, and it manifests as potentially erroneous text in the main text. Based on the results obtained after these layers of processing, the potential erroneous text information, i.e., potentially erroneous sentences, can be identified in the specific input to the large language model. In other words, after filtering, the amount of information actually entering the large model's error correction process is significantly reduced, controlling the amount of model resource usage.
[0109] By performing secondary error correction using a pre-built large language model, a cascaded error correction architecture is formed with primary error correction and confidence analysis. This results in greater controllability and efficiency in the error correction process, while effectively reducing the inference pressure on the large model, ensuring a more efficient and reliable overall error correction process.
[0110] Further, step 104 includes:
[0111] Identify potentially erroneous sentences based on triple information in the initial error-corrected text and the uncertain sub-word sequence;
[0112] A character batching mechanism is used to segment suspected erroneous sentences into batches of characters.
[0113] The characters are input in batches into a pre-set large language model for secondary error correction analysis to obtain the target error-corrected text.
[0114] To further reduce the actual number of calls to the large model and to avoid context loss due to long text truncation, this embodiment employs a character batching mechanism to segment suspected erroneous sentences. The sentence length is dynamically accumulated, and when the accumulated characters exceed a character threshold, the sentences are batched. The character threshold can be designed according to actual conditions; in this embodiment, it is set to 3800. This ensures that each prompt does not exceed the model's 4K context limit.
[0115] In this embodiment, the prompt template in the large model appears in natural language form. Please refer to [link / reference]. Figure 3 Furthermore, a fully greedy decoding method can be used to ensure decoding determinism, reduce the risk of random rewriting, and the output error-correcting text can be preprocessed by filtering and removing some auxiliary sequence numbers to obtain the target error-correcting text corresponding to the input order.
[0116] Understandably, if error correction fails due to format drift, the entire line can be rolled back to the original sentence to ensure system robustness; the target text to be corrected is then written back to the document according to its original structure; additionally, the corrected parts can be highlighted to create a visual difference.
[0117] The cascaded Chinese error correction method provided in this application progressively corrects the target input text through hierarchical error correction, rather than directly inputting the original text into a large model for overall error correction. This minimizes the information processing load of the large model, thereby improving inference efficiency and reducing inference latency. Furthermore, the hierarchical error correction process allows for clear control of text processing at each stage, such as initial error correction and confidence prediction based on a masked language model, making text processing more flexible and controllable, and better meeting the text correction needs of various complex scenarios. Therefore, this application's embodiments can solve the technical problems of existing technologies, such as large inference latency, uncontrollable error correction process, and the need for large models to process a large amount of information, resulting in low accuracy and efficiency in actual text correction.
[0118] For easier understanding, please refer to Figure 2 This application provides an embodiment of a cascaded Chinese error correction device, comprising:
[0119] The obfuscation processing unit 201 is used to perform obfuscation processing on the extracted positive word set using a pinyin perturbation mechanism to obtain an obfuscated word library, which includes multiple pairs of positive word-incorrect word Chinese pairs;
[0120] The initial error correction unit 202 is used to perform replacement error correction processing on the target input text based on hash value comparison through the obfuscation dictionary to obtain the initial error-corrected text;
[0121] The model prediction unit 203 is used to perform confidence prediction analysis based on the initial error-corrected text using an improved masked language model to obtain an uncertain sub-word sequence, which includes triple information.
[0122] The secondary error correction unit 204 is used to perform secondary error correction analysis on the initial error correction text and suspected erroneous sentences in the uncertain sub-word sequence through a pre-set large language model to obtain the target error correction text.
[0123] Furthermore, the obfuscation processing unit 201 is specifically used for:
[0124] After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering.
[0125] Obtain the pinyin of each word in the word set and construct an inverted index of pinyin-Chinese characters;
[0126] Based on the inverted index, each character in the positive word set is subjected to a three-level pronunciation confusion variation, and then mapped back to the corresponding character to obtain single-character mispronunciations;
[0127] Two characters are randomly selected from the vocabulary set and their corresponding pinyin is obtained. The pinyin is then combined using the Cartesian product method and mapped back to the corresponding characters to obtain two-character misspelled words.
[0128] A confusion lexicon is generated by combining the correct word set, single-character misspelled words, and two-character misspelled words.
[0129] Furthermore, the initial error correction unit 202 is specifically used for:
[0130] Group all misspelled words in the obfuscation dictionary by length and calculate the rolling hash value of each group of misspelled words to construct a three-level inverted index of length-rolling hash value-word form;
[0131] After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window.
[0132] The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain the corrected text.
[0133] If there is no match, the suspected error information is recorded. The suspected error information includes the location of the misspelled word, sentence-level offset, paragraph ID, and document ID.
[0134] The initial error correction text is generated by combining the suspected error information and the first error correction text.
[0135] Furthermore, the model prediction unit 203 is specifically used for:
[0136] The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters.
[0137] Based on the encoded vocabulary, continuous characters are segmented into sub-word fragments, and each sub-word fragment is mapped to an integer word number to obtain a token sequence;
[0138] Input the token sequence into the improved masked language model to predict confidence;
[0139] If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
[0140] This application also provides a cascaded Chinese error correction device, which includes a processor and a memory;
[0141] The memory is used to store program code and transfer the program code to the processor;
[0142] The processor is used to execute the cascaded Chinese error correction method in the above method embodiment according to the instructions in the program code.
[0143] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0144] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0145] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0146] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions for executing all or part of the steps of the methods described in the various embodiments of this application through a computer device (which may be a personal computer, server, or network device, etc.). The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing program code.
[0147] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A cascaded Chinese error correction method, characterized in that, include: The extracted set of positive words is obfuscated using a Pinyin perturbation mechanism to obtain an obfuscated word library, which includes multiple pairs of positive and incorrect Chinese words. The target input text is subjected to replacement and error correction processing based on hash value comparison using the obfuscation dictionary to obtain the initial corrected text; An improved masked language model is used to perform confidence prediction analysis based on the initial error-corrected text to obtain an uncertain word sequence, which includes triplet information; The target corrected text is obtained by performing secondary error correction analysis on the initial corrected text and the suspected erroneous sentences in the uncertain sub-word sequence using a pre-built large language model.
2. The cascaded Chinese error correction method according to claim 1, characterized in that, The extracted set of positive words is obfuscated using a phonetic perturbation mechanism to obtain an obfuscated word library, including: After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering. Obtain the pinyin of each positive word in the positive word set, and construct an inverted index of pinyin-Chinese characters; Based on the inverted index, each character in the positive word set is subjected to a three-level pronunciation confusion variation, and mapped back to the corresponding character to obtain single-character mispronunciations; Two characters are randomly selected from the set of correct words and their corresponding pinyin is obtained. The pinyin is then combined using a Cartesian product method and mapped back to the corresponding characters to obtain two-character misspelled words. An obfuscated word library is generated by combining the positive word set, the single-character misspelled words, and the two-character misspelled words.
3. The cascaded Chinese error correction method according to claim 1, characterized in that, The step of performing hash-value comparison-based replacement and error correction processing on the target input text using the obfuscation dictionary to obtain the initial corrected text includes: All misspelled words in the obfuscation lexicon are grouped by length, and the rolling hash value of each group of misspelled words is calculated to construct a three-level inverted index of length-rolling hash value-word form; After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window. The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain a corrected text. If there is no match, the suspected error information is recorded, including the location of the misspelled word, sentence-level offset, paragraph ID, and document ID; The initial error correction text is generated by combining the suspected error information and the first error correction text.
4. The cascaded Chinese error correction method according to claim 1, characterized in that, The improved masked language model is used to perform confidence prediction analysis based on the initial error-corrected text to obtain an uncertain word sequence, including: The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters. The continuous characters are segmented into sub-word segments according to the encoded vocabulary, and each sub-word segment is mapped to an integer word number to obtain a token sequence; The token sequence is input into the improved masked language model to predict the confidence level; If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
5. The cascaded Chinese error correction method according to claim 1, characterized in that, The step involves performing secondary error correction analysis on the initial error-corrected text and suspected erroneous sentences in the uncertain sub-word sequence using a pre-set large language model to obtain the target error-corrected text, including: Suspected erroneous sentences are identified based on the initial error-corrected text and the triple information in the uncertain sub-word sequence; The suspected erroneous sentence is segmented using a character batching mechanism to obtain batched characters. The batched characters are input into a pre-set large language model for secondary error correction analysis to obtain the target error-corrected text.
6. A cascaded Chinese character error correction device, characterized in that, include: The obfuscation processing unit is used to obfuscate the extracted set of positive words using a pinyin perturbation mechanism to obtain an obfuscated word library, which includes multiple pairs of positive and incorrect Chinese words. The initial error correction unit is used to perform replacement error correction processing on the target input text based on hash value comparison using the obfuscation dictionary to obtain the initial error-corrected text; The model prediction unit is used to perform confidence prediction analysis based on the initial error-corrected text using an improved masked language model to obtain an uncertain word sequence, wherein the uncertain word sequence includes triple information; The secondary error correction unit is used to perform secondary error correction analysis on the initial error-corrected text and the suspected erroneous sentences in the uncertain sub-word sequence using a pre-set large language model to obtain the target error-corrected text.
7. The cascaded Chinese error correction device according to claim 6, characterized in that, The obfuscation processing unit is specifically used for: After extracting words of a preset word length from multiple heterogeneous corpus systems, preprocessing is performed to obtain a positive word set. The preprocessing includes deduplication and sensitive word filtering. Obtain the pinyin of each positive word in the positive word set, and construct an inverted index of pinyin-Chinese characters; Based on the inverted index, each character in the positive word set is subjected to a three-level pronunciation confusion variation, and mapped back to the corresponding character to obtain single-character mispronunciations; Two characters are randomly selected from the set of correct words and their corresponding pinyin is obtained. The pinyin is then combined using a Cartesian product method and mapped back to the corresponding characters to obtain two-character misspelled words. An obfuscated word library is generated by combining the positive word set, the single-character misspelled words, and the two-character misspelled words.
8. The cascaded Chinese error correction device according to claim 6, characterized in that, The initial error correction unit is specifically used for: All misspelled words in the obfuscation lexicon are grouped by length, and the rolling hash value of each group of misspelled words is calculated to construct a three-level inverted index of length-rolling hash value-word form; After splitting the target input text into independent sentences, the current hash value of each independent sentence within the window is calculated using a preset sliding window. The current hash value is compared with the rolling hash value. If they match, positive word replacement is performed based on the three-level inverted index to obtain a corrected text. If there is no match, the suspected error information is recorded, including the location of the misspelled word, sentence-level offset, paragraph ID, and document ID; The initial error correction text is generated by combining the suspected error information and the first error correction text.
9. The cascaded Chinese error correction device according to claim 6, characterized in that, The model prediction unit is specifically used for: The initial error-correcting text is scanned character by character using a preset Chinese word segmenter to obtain an initial sub-word sequence, which includes multiple consecutive characters. The continuous characters are segmented into sub-word segments according to the encoded vocabulary, and each sub-word segment is mapped to an integer word number to obtain a token sequence; The token sequence is input into the improved masked language model to predict the confidence level; If the confidence level is lower than the confidence level threshold, it is determined that there is a potential error in the current token sequence. The triple information corresponding to the current token sequence is recorded, and an uncertain sub-word sequence is generated. The triple information includes the starting offset, sequence length, and confidence level.
10. A cascaded Chinese character error correction device, characterized in that, The device includes a processor and a memory; The memory is used to store program code and transmit the program code to the processor; The processor is used to execute the cascaded Chinese error correction method according to any one of claims 1-5 according to the instructions in the program code.