Text proofreading methods and apparatus, electronic devices, storage media
By using a seq2seq model to combine the probability distribution of the current output vocabulary and the confusion set during the text proofreading process, the problem of BERT model being affected by typos is solved, and higher text proofreading accuracy is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHUHAI KINGSOFT OFFICE SOFTWARE
- Filing Date
- 2021-12-17
- Publication Date
- 2026-06-30
Smart Images

Figure CN116362238B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, such as a text proofreading method and apparatus, electronic device, and storage medium. Background Technology
[0002] Users encounter numerous errors when inputting text via keyboard or voice, causing inconvenience. Currently, artificial intelligence is widely used in the field of Natural Language Processing (NLP). Among these, the seq2seq model, which includes a decoder and an encoder, has achieved significant results in areas such as part-of-speech tagging, semantic dependency analysis, and machine translation.
[0003] In the relevant technologies for implementing text proofreading, the semantic vector and error correction label corresponding to the word vector sequence of the input text are first obtained through the seq2seq model, and then the proofread text corresponding to the word vector sequence, semantic vector and error correction label of each input word is obtained through the seq2seq model.
[0004] In the process of implementing the embodiments of this application, at least the following problems were found in the related technology:
[0005] Existing technologies typically utilize error correction models based on Bidirectional Encoder Representations from Transformers (BERT). However, since the input text contains misspellings, these misspellings can easily interfere with the BERT-based error correction model, resulting in poor proofreading accuracy. Summary of the Invention
[0006] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.
[0007] This application provides a text proofreading method, apparatus, electronic device, and storage medium to improve the accuracy of text proofreading.
[0008] In some embodiments, the text proofreading method includes: during the current decoding process, obtaining a current sub-vector in the encoding vector of the input text, wherein the position of the current sub-vector in the encoding vector corresponds to the number of decoding processes completed; performing decoding processing on the current sub-vector to obtain a current output vocabulary and a probability distribution of the current output vocabulary; determining a word to be proofread in the input text, wherein the position of the word to be proofread in the input text corresponds to the number of decoding processes completed; determining a proofreading result corresponding to the word to be proofread based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution; and determining the proofread text based on the proofreading result.
[0009] Optionally, determining the proofreading result corresponding to the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution includes: determining at least one word in the current output vocabulary as the target vocabulary set; determining the intersection of the confusion set and the target vocabulary set; if the intersection is an empty set, determining the word to be proofread as the proofreading result; if the intersection is not an empty set, determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread.
[0010] Optionally, determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread includes: if the intersection does not contain the word to be proofread, determining the word with the highest probability in the intersection as the proofreading result; if the intersection contains the word to be proofread, determining the word to be proofread as the proofreading result.
[0011] Optionally, determining at least one word in the current output vocabulary as the target word set includes: determining a set number of words in the current output vocabulary as the target word set; wherein the probability of any word in the target word set is greater than or equal to the probability of any word in the current output vocabulary that is not in the target word set.
[0012] Optionally, the step of decoding the current subvector to obtain the current output vocabulary and its probability distribution includes: in the case of the first decoding process, decoding the initial value to obtain the current output vocabulary and its probability distribution.
[0013] Optionally, the step of decoding the current subvector to obtain the current output vocabulary and the probability distribution of the current output vocabulary includes: in the case of non-first decoding processing, decoding the previous proofreading result and the current subvector to obtain the current output vocabulary and the probability distribution of the current output vocabulary, wherein the previous proofreading result is the proofreading result obtained in the previous decoding process.
[0014] Optionally, the step of decoding the previous proofreading result and the current sub-vector to obtain the current output vocabulary and its probability distribution includes: obtaining an intermediate decoding state during the previous decoding process; determining the weight of each sub-vector in the encoding vector based on the matching degree between the intermediate decoding state and each sub-vector in the encoding vector; determining a dynamic semantic encoding vector by weighting the sum of each sub-vector in the encoding vector and its corresponding weight; and decoding the dynamic semantic encoding vector, the previous proofreading result, and the current sub-vector to obtain the current output vocabulary and its probability distribution.
[0015] Optionally, the text proofreading method further includes: after obtaining the proofreading result of the current decoding process, concatenating all obtained proofreading results to obtain the currently proofread text; and ending the decoding process if the length of the currently proofread text is the same as the length of the input text.
[0016] Optionally, the encoding vector of the input text is determined by: performing text embedding and position embedding on the input text to obtain an input vector; processing the input vector multiple times using at least one sub-encoder; and determining the output of the last sub-encoder as the encoding vector.
[0017] Optionally, determining the proofread text based on the proofreading results includes: after multiple decoding processes, sequentially concatenating the proofreading results obtained in each decoding process according to the order of the decoding processes to obtain the proofread text.
[0018] In some embodiments, the text proofreading apparatus includes an acquisition module, a decoding module, a first determination module, a second determination module, and a third determination module; the acquisition module is configured to acquire a current sub-vector in the encoding vector of the input text during the current decoding process, wherein the position of the current sub-vector in the encoding vector corresponds to the number of decoding processes completed; the decoding module is configured to perform decoding processing on the current sub-vector to obtain a current output vocabulary and a probability distribution of the current output vocabulary; the first determination module is configured to determine a word to be proofread in the input text, wherein the position of the word to be proofread in the input text corresponds to the number of decoding processes completed; the second determination module is configured to determine a proofreading result corresponding to the word to be proofread based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution; the third determination module is configured to determine the proofread text based on the proofreading result.
[0019] Optionally, the second determining module includes a first determining unit, a second determining unit, and a third determining unit; the first determining unit is configured to determine at least one word in the current output vocabulary as the target vocabulary set; the second determining unit is configured to determine the intersection of the confusion set and the target vocabulary set; the third determining unit is configured to determine the word to be proofread as the proofreading result when the intersection is an empty set; and to determine the proofreading result based on the inclusion relationship between the intersection and the word to be proofread when the intersection is a non-empty set.
[0020] Optionally, determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread includes: if the intersection does not contain the word to be proofread, determining the word with the highest probability in the intersection as the proofreading result; if the intersection contains the word to be proofread, determining the word to be proofread as the proofreading result.
[0021] Optionally, the first determining unit is specifically configured to determine at least one word in the current output word list as the target word set, including: determining a set number of words in the current output word list as the target word set; wherein the probability of any word in the target word set is greater than or equal to the probability of any word in the current output word list that is not in the target word set.
[0022] Optionally, the decoding module includes a first decoding unit, which is configured to decode the initial value in the case of the first decoding process to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0023] Optionally, the decoding module includes a second decoding unit, which is configured to decode the previous proofreading result and the current subvector in the absence of the first decoding process, to obtain the current output vocabulary and the probability distribution of the current output vocabulary, wherein the previous proofreading result is the proofreading result obtained in the previous decoding process.
[0024] Optionally, the second decoding unit is specifically configured to: obtain the intermediate decoding state in the previous decoding process; determine the weight of each sub-vector in the encoding vector based on the matching degree between the intermediate decoding state and each sub-vector in the encoding vector; determine the dynamic semantic encoding vector by weighting the sub-vector in the encoding vector and its corresponding weight; and perform decoding processing on the dynamic semantic encoding vector, the previous proofreading result, and the current sub-vector to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0025] Optionally, the text proofreading device further includes an end module, which is configured to, after obtaining the proofreading result of the current decoding process, connect all the obtained proofreading results to obtain the currently proofread text; and end the decoding process if the length of the currently proofread text is the same as the length of the input text.
[0026] Optionally, the text proofreading apparatus further includes an encoding module configured to perform text embedding and position embedding on the input text to obtain an input vector; process the input vector multiple times using at least one sub-encoder; and determine the output of the last sub-encoder as the encoded vector.
[0027] Optionally, the third determining module is specifically configured to, after multiple decoding processes, sequentially connect the proofreading results obtained in each decoding process according to the order of the decoding process to obtain the proofread text.
[0028] In some embodiments, the electronic device includes a processor and a memory storing program instructions, the processor being configured to execute the text proofreading method provided in the foregoing embodiments when executing the program instructions.
[0029] In some embodiments, the storage medium stores program instructions that, when executed, perform the text proofreading method provided in the foregoing embodiments.
[0030] The text proofreading method, apparatus, electronic device, and storage medium provided in this application embodiment can achieve the following technical effects:
[0031] In the process of decoding the current sub-vector in the encoding vector of the input text, the current output vocabulary and its probability distribution are first obtained. Then, the current output vocabulary and its probability distribution, along with the confusion set of the words to be proofread, are combined to reduce the impact of typos in the input text on the proofreading model, and finally, a proofreading result with a relatively high accuracy is obtained. In this way, the accuracy of text proofreading is improved.
[0032] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description
[0033] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrative descriptions and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are considered similar elements, and wherein:
[0034] Figure 1 This is a schematic diagram of a text proofreading method provided in an embodiment of this application;
[0035] Figure 2 This is a schematic diagram of a text proofreading method model provided in an embodiment of this application;
[0036] Figures 3a to 3f This is a detailed schematic diagram of a text proofreading method model provided in an embodiment of this application;
[0037] Figure 4 This is a schematic diagram of a text proofreading device provided in an embodiment of this application;
[0038] Figure 5 This is a schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0039] To provide a more detailed understanding of the features and technical content of the embodiments of this application, the implementation of the embodiments of this application will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this application. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.
[0040] The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.
[0041] Unless otherwise stated, the term "multiple" means two or more.
[0042] In this embodiment, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.
[0043] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.
[0044] The text proofreading method provided in this application is based on a sequence-to-sequence (seq2seq) model. First, an encoder is used to encode the input text to obtain an encoded vector. Then, the encoded vector is decoded multiple times to obtain the final proofread text. The text proofreading method provided in this application details each decoding process.
[0045] Figure 1 This is a schematic diagram of a text proofreading method provided in an embodiment of this application.
[0046] Combination Figure 1 As shown, text proofreading methods include:
[0047] S101. In the current decoding process, obtain the current sub-vector in the encoded vector of the input text.
[0048] The encoding vector of the input text can be determined as follows: embedding and position embedding are performed on the input text to obtain the input vector; the input vector is processed multiple times using at least one sub-encoder; and the output of the last sub-encoder is determined as the encoding vector.
[0049] In the text embedding process, the input text is converted from natural language into word embedding vectors, which include the vector corresponding to each word in the input text. In the positional embedding process, the position of each word in the input text is marked to obtain positional embedding vectors, and the dimension of the positional embedding vectors is the same as that of the word embedding vectors. The sum of the word embedding vectors and the positional embedding vectors is used to determine the input vector.
[0050] In this embodiment, the process of obtaining the encoded vector of the input text is referred to as the encoding process, and the module that performs the encoding process is referred to as the encoder. In this embodiment, an encoder may include multiple sub-encoders, wherein the multiple sub-encoders are connected sequentially; processing the input vector multiple times using at least one sub-encoder may include: determining the output of the previous sub-encoder as the input of the next sub-encoder; wherein the first sub-encoder encodes the input vector; and the output of the last sub-encoder is the output of the encoder, i.e., the encoded vector.
[0051] In any sub-encoder, a multi-head self-attention network and a feedforward neural network may be included. The input of the multi-head self-attention network is the input of the sub-encoder, the output of the multi-head self-attention network is the input of the feedforward neural network, and the output of the feedforward neural network is the output of the sub-encoder.
[0052] Furthermore, residual networks can be used to optimize multi-head self-attention networks and feedforward neural networks.
[0053] In some applications, BERT can be used to encode input text to obtain encoded vectors. BERT has strong representational capabilities, and relatively accurate text features can be obtained using BERT.
[0054] Of course, during the encoding process of the input text, a pre-trained BERT is used. That is, in practical applications, after building BERT, it needs to be trained using the correct training set. During training, BERT can be fine-tuned on the training set. In this way, BERT can be trained well using a smaller training set.
[0055] Encoding the input text using BERT can improve the representational power of the encoded vector and increase the recall rate.
[0056] It should be understood that the above examples of using BERT to encode input text are merely illustrative. Those skilled in the art can also use other models with encoding functions in the prior art, such as the Long Short-Term Memory (LSTM) model or the Recurrent Neural Network (RNN) model. The embodiments of this application do not specifically limit the encoding model.
[0057] The number of sub-vectors in the encoding vector corresponds to the length of the input text. For example, when the input text is "Think Different", each character can be encoded as a sub-vector. In this case, the number of sub-vectors in the encoder vector is 4, and the 4 sub-vectors are, in order from first to last: the sub-vector corresponding to "Think", the sub-vector corresponding to "Different", the sub-vector corresponding to "Think", and the sub-vector corresponding to "Different".
[0058] The position of the current sub-vector in the encoding vector corresponds to the number of completed decoding processes. For example, when the number of completed decoding processes is 0, the current decoding process belongs to the first decoding process. In order from first to last, the first sub-vector in the encoding vector is determined as the current sub-vector; when the number of completed decoding processes is 1, the current decoding process belongs to the second decoding process. In order from first to last, the second sub-vector in the encoding vector is determined as the current sub-vector; when the number of completed decoding processes is 2, the current decoding process belongs to the third decoding process. In order from first to last, the third sub-vector in the encoding vector is determined as the current sub-vector; and so on, which will not be elaborated here.
[0059] In this way, based on the number of completed decoding processes, the current sub-vector can be obtained from the encoding vector of the input text.
[0060] S102. Decode the current sub-vector to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0061] Each element in the current output vocabulary here consists of one or more characters. For example, each element in the current output vocabulary can be a Chinese character, or a word (consisting of two or more Chinese characters); or each element in the current output vocabulary can be a letter (such as a single letter or a letter in a word), or a word (consisting of two or more letters), or a phrase (consisting of two or more words).
[0062] The decoder for decoding the current sub-vector here can be an LSTM, or an RNN, or a Convolutional Neural Networks (CNN) model, as well as a Transformer model, etc.
[0063] In the process of decoding the encoded vector, the decoder used is also a trained decoder. That is, in practical applications, after building the encoder and decoder, it is necessary to train the encoder and decoder using the correct training set. Then, the trained encoder is used to encode the input text to obtain the encoded vector, and the trained decoder is used to decode the encoded vector to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0064] During training, the cross-entropy loss function can be used to train the encoder for encoding the input text and the decoder for decoding the encoded vectors. This allows the model to converge faster during training, facilitating the rapid acquisition of a trained encoder and decoder.
[0065] The decoder provided in this application provides one verification result for each decoding process, and multiple verification results are obtained through multiple decoding processes. In the case of the first decoding process, the initial value (e.g., ...) can be adjusted. <sos>) and perform decoding processing on the current sub-vector to obtain the current output vocabulary and the probability distribution of the current output vocabulary. For example, if the texts corresponding to the sub-vectors in the encoded vector are "not", "the same", "ordinary", and "thought" in sequence from first to last, then in the case of the first decoding processing, the initial value <sos>And the subvector corresponding to "not" is decoded.
[0066] In cases other than the first decoding process, the previous proofreading result and the current subvector are decoded to obtain the current output vocabulary and its probability distribution. The previous proofreading result is the result obtained in the previous decoding process. For example, if the current decoding process is the second decoding process, and the previous (i.e., the first) decoding process yielded a proofreading result of "not", then the previous proofreading result "not" and the corresponding subvector are decoded.
[0067] The following example illustrates a text proofreading method using LSTM or RNN as the decoder. When using LSTM or RNN as the decoder, decoding the current sub-vector to obtain the current output vocabulary and its probability distribution can include: processing the previous proofreading result and the current sub-vector using LSTM or RNN to obtain the current output vocabulary and its probability distribution, where the previous proofreading result is the proofreading result obtained in the previous decoding process.
[0068] When using an RNN as the decoder, the RNN is used to process the previous proofreading result, the intermediate decoding state of the previous decoding process, and the current subvector to obtain the current output vocabulary and the probability distribution of the current output vocabulary; wherein, the previous proofreading result is obtained in the previous decoding process.
[0069] When using LSTM as the decoder, LSTM is used to process the previous proofreading result, the intermediate decoding state of the previous decoding process, and the current subvector to obtain the current output vocabulary and the probability distribution of the current output vocabulary; wherein, the previous proofreading result is obtained in the previous decoding process.
[0070] Further, the current sub-vector is decoded to obtain the current output vocabulary and its probability distribution. This may include: obtaining the intermediate decoding state in the previous decoding process; determining the weight of each sub-vector in the encoding vector based on the matching degree between the intermediate decoding state and each sub-vector in the encoding vector; determining the dynamic semantic encoding vector as the weighted sum of each sub-vector in the encoding vector and its corresponding weight; and decoding the dynamic semantic encoding vector, the previous proofreading result, and the current sub-vector to obtain the current output vocabulary and its probability distribution.
[0071] Decoders, such as RNNs or LSTMs, typically include an output layer, a hidden layer, and an input layer. The intermediate decoding state refers to the output of the hidden layer of the decoder during a single decoding process.
[0072] The degree of matching between the intermediate decoding state and each subvector in the encoding vector can be obtained using the following method:
[0073] e ij =a(s i-1 h j )
[0074] Among them, e ij s represents the degree of matching between the intermediate decoding state in the previous decoding process and the j-th subvector in the encoded vector. i-1 h represents the intermediate decoding state during the previous decoding process. j Let be the j-th sub-vector in the encoded vector, 'a' be an FNN (Feedforward Neural Network) whose parameters can be obtained through training, and 'i' be the total number of decoding iterations, including the current decoding process; ... i-1 and h j Both are inputs of 'a', e ij Let be the output of 'a'.
[0075] During application, s i-1 and h j When the input is given to the trained a, the output of a is e. ij .
[0076] The weight of each subvector in the encoded vector can be determined as follows:
[0077]
[0078] Where, α ij e represents the weight of the j-th subvector in the encoded vector. ij Let be the degree of matching between the intermediate decoding state in the previous decoding process and the j-th sub-vector in the encoded vector, where n is the total number of sub-vectors in the encoded vector, and e is the mean. ik The degree of matching between the intermediate decoding state and the k-th sub-vector in the encoding vector during a decoding process is represented by , where i is the total number of decoding iterations, including the current decoding process.
[0079] Furthermore, the dynamic semantic encoding vector can be determined in the following way:
[0080]
[0081] Among them, c i For dynamic semantic encoding vectors, α ij The weight h of the j-th subvector in the encoding vector. j Let be the j-th subvector in the encoded vector, n be the total number of subvectors in the encoded vector, and i be the total number of decoding iterations, including the current decoding process.
[0082] Finally, the dynamic semantic encoding vector, the previous proofreading result, and the current subvector can be decoded using the following method:
[0083] p(y i )=g(y i-1 s i c i x i )
[0084] s i =f(y i-1 c i s i-1 )
[0085] Among them, y i For the current output vocabulary, p(y) i y represents the probability distribution of the current output vocabulary, g and f are the connection structures in the RNN or LSTM, whose parameters can be obtained through training. i-1 For the previous proofreading result, s i This represents an intermediate decoding state in the current decoding process, s i-1 This represents the intermediate decoding state during the previous decoding process, c i x is a dynamic semantic encoding vector. i Let i be the current subvector, and let i be the total number of decoding iterations, including the current decoding process.
[0086] During application, y can be i-1 c i and s i-1 The input is fed into the trained f, and the output of f is s. i Then y i-1 s i c i and x i When input into the trained g, the output of g is p(y). i ).
[0087] In some practical applications, the Bahdanau attention model can be used to process the encoded vector, and the encoded vector processed by Bahdanau attention can be decoded. Of course, the Bahdanau attention model here is only an exemplary mode, and those skilled in the art can also use other attention models in the prior art to process the encoded vector, such as self-attention, etc., without making specific limitations here.
[0088] For the first decoding process, the initial value (e.g.) can be used. <sos>)Determine it as the previous proofreading result, and determine another initial value (such as a null value) as the intermediate decoding state of the previous decoding process.
[0089] S103. Determine the word to be proofread in the input text.
[0090] The word to be proofread here consists of one or more characters. For example, the word to be proofread can be a Chinese character, or a word (consisting of two or more Chinese characters); or the word to be proofread can be a letter (such as a single letter or a letter in a word), or a word (consisting of two or more letters), or a phrase (consisting of two or more words).
[0091] The position of the word to be proofread in the input text corresponds to the number of times of the completed decoding process. For example, when the word to be proofread is a Chinese character, and the input text is "different extraordinary thinking", when the number of times of the completed decoding process is 0, the current decoding process belongs to the first decoding process, and the first Chinese character "不" in the input text is determined as the word to be proofread; when the number of times of the completed decoding process is 1, the current decoding process belongs to the second decoding process, and the second Chinese character "同" in the input text is determined as the word to be proofread; when the number of times of the completed decoding process is 2, the current decoding process belongs to the third decoding process, and the third Chinese character "凡" in the input text is determined as the word to be proofread; and so on, which will not be elaborated here.
[0092] S104. Determine the proofreading result corresponding to the word to be proofread according to the word to be proofread, the confusion set corresponding to the word to be proofread, the current output word list, and the probability distribution.
[0093] The confusion set of the above-mentioned word to be proofread includes words with similar pronunciations and words with similar glyphs of the word to be proofread.
[0094] For example, when the word to be proofread is "想", the confusion set can be [翔享响险像厢香息乡巷相向祥项线象详箱]; when the word to be proofread is "对", the confusion set can be [队对堆隧怼都]; when the word to be proofread is "禾", the confusion set can be [合和河木].
[0095] In practical applications, the confusion set can be manually marked and stored in the database. After obtaining the word to be proofread, the confusion set of the word to be proofread can be obtained by querying the database.
[0096] Optionally, based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution, the proofreading result corresponding to the word to be proofread is determined, including: determining at least one word in the current output vocabulary as the target vocabulary set, and determining the intersection of the confusion set and the target vocabulary set; if the intersection is an empty set, determining the word to be proofread as the proofreading result; if the intersection is not an empty set, determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread.
[0097] The proofreading result here refers to the final proofreading result obtained in the current decoding process, which corresponds to the word to be proofread.
[0098] In conventional decoder processing, after obtaining the current output vocabulary and its probability distribution, the word with the highest probability in the current output vocabulary is usually determined as the proofreading result. In this embodiment, after obtaining the current output vocabulary and its probability distribution, the decoding result (proofreading result) for this decoding process is not directly determined based on the probability distribution of the current output vocabulary. Instead, the intersection of the target word set of the current output vocabulary and the confusion set of the words to be proofread is taken, and the proofreading result is determined by combining the intersection. That is, the proofreading result is determined by combining the original input text in each decoding process.
[0099] Such proofreading results are closer to the meaning of the input text, making the proofreading more accurate. In existing online culture, many derived terms frequently emerge, such as "lying flat" and "Buddhist youth." Using the text proofreading scheme provided in this application, semantic analysis is performed during the encoding process. Upon initial encounter with these derived terms, the decoding reveals that the current output vocabulary often does not contain them. In such cases, this scheme will identify the word to be proofread as the proofreading result. Therefore, the text proofreading method provided in this application improves the accuracy of text proofreading.
[0100] Specifically, the proofreading result is determined based on the inclusion relationship between the intersection and the word to be proofread. This includes: if the intersection does not contain the word to be proofread, the word with the highest probability in the intersection is determined as the proofreading result; if the intersection contains the word to be proofread, the word to be proofread is determined as the proofreading result. In this process of determining the proofreading result, if the intersection contains the word to be proofread, then that word is still determined as the proofreading result, which can further improve the accuracy of the text proofreading method. For example, when the word to be proofread is a homophone or a homograph intentionally used to express a specific meaning, the text proofreading scheme provided in this application embodiment can still maintain the original input text, further improving the accuracy of text proofreading.
[0101] In a specific application, determining at least one word in the current output vocabulary as the target word set may include: determining a set number of words in the current output vocabulary as the target word set; where the probability of any word in the target word set is greater than or equal to the probability of any word in the non-target word set of the current output vocabulary. For example, in a proofreading process, the input text is "Think Different". After the fourth decoding, the obtained current output vocabulary and its probability distribution are "[I: 0.0001, you: 0.000003, think: 0.00002, sound: 0.45, toward: 0.1,..., UNK: 0.005]". The word to be proofread in the input text corresponding to the fourth decoding is "think", and the confusion set of "think" is [fly, enjoy, sound, risk, image, compartment, fragrant, information, rural, alley, face each other, auspicious, item, line, elephant, detailed, box]. When the set number is 2, the target vocabulary and its probability distribution are: [sound: 0.45, toward: 0.1]. At this time, the intersection of the confusion set and the target vocabulary and its probability distribution are [sound: 0.45, toward: 0.1]; if the set number is 3, the target vocabulary and its probability distribution are [sound: 0.45, toward: 0.1, UNK: 0.005]. At this time, the intersection and its probability distribution are [sound: 0.45, toward: 0.1]; if the set number is 4, the target vocabulary and its probability distribution are [sound: 0.45, toward: 0.1, UNK: 0.005, I: 0.0001], and the intersection and its probability distribution are [sound: 0.45, toward: 0.1]; and so on. The above "UNK" is short for Unknown Words. In the current output vocabulary and its probability distribution, some words with relatively small probabilities are usually replaced by "UNK" to improve the calculation efficiency.
[0102] The set number here is used to balance the proofreading speed and proofreading accuracy of the text proofreading method. The larger the set number, the faster the proofreading speed of the text proofreading method and the higher the proofreading accuracy; the smaller the set number, the slower the proofreading speed of the text proofreading method and the lower the proofreading accuracy. The embodiments of the present application do not limit the specific value of the set number. Those skilled in the art can adaptively determine the specific value of the set number according to the actual requirements for the proofreading speed and proofreading accuracy.
[0103] In some specific applications, the input text is "Think Different". After the fourth decoding, the obtained current output vocabulary and its probability distribution are "[Me: 0.0001, You: 0.000003, Think: 0.00002, Ring: 0.45, Toward: 0.1,..., UNK: 0.005]". The word to be proofread in the input text corresponding to the fourth decoding is "Think", and the confusion set of "Think" is [Fly, Enjoy, Ring, Risk, Image, Compartment, Scent, Information, Village, Lane, Toward, Auspicious, Item, Wire, Image, Detailed, Box]. When the set number is set to 2, 3, or 4, the intersection and its probability distribution are [Ring: 0.45, Toward: 0.1]. The intersection does not contain the word to be proofread. Therefore, the word "Ring" with the highest probability in the intersection is determined as the proofreading result. When the set number is set to 5, the intersection and its probability distribution are [Ring: 0.45, Toward: 0.1, Think: 0.00002]. The intersection contains the word to be proofread. Therefore, the word to be proofread "Think" is determined as the proofreading result.
[0104] Thus, it can be seen that the word to be proofread in the input text participates in each decoding process. In the first decoding process, the first word to be proofread in the input text participates in the first decoding process in the form of its confusion set, and the first word to be proofread in the input text affects the first proofreading result. In the non-first decoding process, the proofreading result obtained in the current decoding process is also related to the proofreading results obtained in the previous decoding processes, and the proofreading results obtained in the previous decoding processes are all affected by the input text. In this way, the input text affects the current result directly and indirectly. Thus, it can be seen that after the decoding is completed, the entire content of the input text participates in the decoding process. In this way, the degree of fit between the proofreading result obtained by each decoding process of this text proofreading method and the input text is relatively high, and the proofreading is more accurate.
[0105] Especially when using LSTM as the decoder, since LSTM has the function of long-term memory, it can make the input text have a higher participation degree in the decoding process. As the number of decoding times increases, the participation of the input text in the decoding process becomes higher and higher, the degree of fit between the proofreading result obtained by each decoding process and the input text becomes higher and higher, and the proofreading accuracy becomes higher and higher. <l
[0106] S105. Determine the text after proofreading according to the proofreading result.
[0107] In the process of decoding the current sub-vector in the encoded vector of the input text, first obtain the current output vocabulary and the probability distribution of the current output vocabulary, and then combine the current output vocabulary and its probability distribution with the confusion set of the word to be proofread to reduce the impact of the misspelled word information contained in the input text on the proofreading model, and finally obtain a proofreading result with relatively high accuracy. In this way, the accuracy of text proofreading is improved.
[0108] Each decoding can obtain a proofreading result. The above technical solution describes the process of one-time decoding to obtain the proofreading result and the proofread text. In specific applications, after multiple decoding processes, according to the sequence of the decoding processes, the proofreading results obtained from each decoding process are connected in sequence to obtain the proofread text.
[0109] In some application scenarios, if the proofreading result is the same as the word to be proofread, it is determined that the word to be proofread is correct; if the proofreading result is different from the word to be proofread, it is determined that the word to be proofread may be incorrect, and the word to be proofread can be marked. For example, in a Word text, the word to be proofread that may be incorrect can be marked with highlighting, underlining, changing the font color or type, etc., to prompt the user.
[0110] Each decoding can obtain a proofreading result. After obtaining the proofreading result of the current decoding process, the obtained all proofreading results are connected to obtain the currently proofread text; it is judged whether the length of the currently proofread text is the same as the length of the input text. If the length of the currently proofread text is less than the length of the input text, continue decoding; if the length of the currently proofread text is the same as the length of the input text, end the decoding process.
[0111] For example, the length of the input text "different concept" is 4. The "different concept" is encoded to obtain an encoded vector, and then the encoded vector is decoded. The first decoding obtains the proofreading result "not", and at this time the currently proofread text is "not", and its length is 1, which is less than the length of the input text, so continue the second decoding, and obtain the proofreading result "same", and at this time the currently proofread text is "different", and its length is 2, which is less than the length of the input text, so continue the third decoding, and obtain the proofreading result "ordinary", and at this time the currently proofread text is "different ordinary", and its length is 3, still less than the length of the input text, so continue the fourth decoding, and obtain the proofreading result "ring", and at this time the currently proofread text is "extraordinary", and its length is 4, which is equal to the length of the input text, and at this time stop decoding.
[0112] Figure 2 It is a schematic diagram of a model of a text proofreading method provided by an embodiment of the present application. Taking the input text "different concept" as an example, the text proofreading method is exemplarily described. Figure 2 The lower part is a schematic diagram of the encoding process. Figure 2 The upper part is a schematic diagram of the decoding process, showing four decoding processes at different times from first to last. Each decoding process uses the decoding unit h for decoding; the input text "Think Different" is encoded using an encoder to obtain the encoding vector [x1, x2, x3, x4], where x1 is the sub-vector corresponding to "不", x2 is the sub-vector corresponding to "同", x3 is the sub-vector corresponding to "凡", and x4 is the sub-vector corresponding to "想"; in the decoding process, the decoder h decodes the sub-vectors in the encoding vector in sequence and obtains the decoding results corresponding to each sub-vector: first, the initial value <sos>The first sub-vector x1 in the encoding vector (the sub-vector corresponding to "not") is input into the decoding unit h of the decoder to obtain the first proofreading result "not"; at the next moment, the first proofreading result "not", the intermediate decoding state in the decoding process at the previous moment, and the second sub-vector x2 (the sub-vector corresponding to "the same") are input into the decoding unit h of the decoder again to obtain the second proofreading result "the same"; at the next moment after that, the second proofreading result "the same", the intermediate decoding state in the decoding process at the previous moment, and the third sub-vector x3 (the sub-vector corresponding to "all") are input into the decoding unit h of the decoder again to obtain the third proofreading result "all", and at the last moment, the third proofreading result "all", the intermediate decoding state in the decoding process at the previous moment, and the fourth sub-vector x4 (the sub-vector corresponding to "think") are input into the decoding unit h of the decoder again to obtain the fourth proofreading result "ring". At this time, the proofread text obtained by connecting all the proofreading results is "extraordinary", and its length is the same as that of the input text, and the decoding is stopped.
[0113] Figures 3a to 3f It is a detailed schematic diagram of a model of a text proofreading method provided by an embodiment of the present application. Taking the input text "extraordinary thought" as an example, the text proofreading method is exemplarily described. For ease of explanation, in the order from the earliest to the latest time, the same decoding unit in different decoding processes is successively called the first decoding unit h1, the second decoding unit h2, the third decoding unit h3, and the fourth decoding unit h4. The structures of the above four decoding units are the same, but the input quantities and intermediate decoding states are different.
[0114] Combined with Figure 3a As shown, the input text "extraordinary thought" is encoded by the encoder to obtain the corresponding encoding vector X = [x1, x2, x3, x4]. Here, the encoder can be BERT, or it can also be RNN or LSTM.
[0115] Combined with Figure 3b As shown, at the first moment, the decoder performs the first decoding on the encoding vector. Here, the decoder can be LSTM, and the intermediate decoding state (not shown in the figure, the intermediate decoding state at the first moment is 0), the initial value <sos>The first sub-vector x1 is input into the first decoding unit h1 of the decoder, and the first output vocabulary and its probability distribution y1 are obtained.
[0116] Combined with Figure 3c As shown, since it is the first decoding, the confusion set of the first word to be proofread, "不" in the input text "不同凡想", is obtained. Combining this confusion set, the first output vocabulary and its probability distribution y1, the first proofreading result "不" is obtained.
[0117] This proofreading result "不" is the same as the first word "不" in the input text, determining that there is no error in the first word of the input text.
[0118] At this time, the text that has been proofread is "不", and its length is less than the length of "不同凡想", so continue decoding.
[0119] Combined with Figure 3d As shown, at the second moment (the second moment is after the first moment), the first proofreading result "不", the intermediate decoding state of the first decoding unit h1, and the second sub-vector x2 in the encoding vector are input into the second decoding unit h2 of the decoder to obtain the second output vocabulary and its probability distribution y2. Then, the confusion set of the second word to be proofread, "同" in the input text "不同凡想", is obtained. Combining this confusion set, the second output vocabulary and its probability distribution y2, the second proofreading result "同" is obtained.
[0120] This proofreading result is the same as the second word "同" in the input text, determining that there is no error in the second word of the input text.
[0121] At this time, the text that has been proofread is "不同", and its length is less than the length of "不同凡想", so continue decoding.
[0122] Combined with Figure 3e As shown, at the third moment (the third moment is after the second moment), the second proofreading result "同", the intermediate decoding state of the second decoding unit h2, and the third sub-vector x3 in the encoding vector are input into the third decoding unit h3 of the decoder to obtain the third output vocabulary and its probability distribution y3. Then, the confusion set of the third word to be proofread, "凡" in the input text "不同凡想", is obtained. Combining this confusion set, the third output vocabulary and its probability distribution y3, the third proofreading result "凡" is obtained.
[0123] This proofreading result is the same as the third word "凡" in the input text, determining that there is no error in the third word of the input text.
[0124] At this time, the text that has been proofread is "不同凡", and its length is less than the length of "不同凡想", so continue decoding.
[0125] Combined with Figure 3f As shown, the third proofreading result "fan", the decoding state of the third decoding unit h3, and the fourth sub-vector x4 in the encoding vector are input into the fourth decoding unit h4 of the decoder to obtain the fourth output vocabulary list and its probability distribution y4. Then, the confusion set of the fourth word to be proofread "xiang" in the input text "Think Different" is obtained. Combining this confusion set, the fourth output vocabulary list, and its probability distribution y4, the fourth proofreading result "xiang" is obtained.
[0126] This proofreading result "xiang" is different from the fourth word "xiang" in the input text, and it is determined that the fourth word in the input text may be incorrect.
[0127] At this time, the text that has been proofread is "Extraordinary", and its length is equal to the length of "Think Different", so the decoding stops.
[0128] Figure 4 It is a schematic diagram of a text proofreading device provided by an embodiment of the present application.
[0129] Combined with Figure 4 As shown, the text proofreading device includes an acquisition module 41, a decoding module 42, a first determination module 43, a second determination module 44, and a third determination module 45; the acquisition module 41 is configured to obtain the current sub-vector in the encoding vector of the input text during the current decoding process, where the position of the current sub-vector in the encoding vector corresponds to the number of completed decoding processes; the decoding module 42 is configured to perform decoding processing on the current sub-vector to obtain the current output vocabulary list and the probability distribution of the current output vocabulary list; the first determination module 43 is configured to determine the word to be proofread in the input text, where the position of the word to be proofread in the input text corresponds to the number of completed decoding processes; the second determination module 44 is configured to determine the proofreading result corresponding to the word to be proofread according to the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary list, and the probability distribution; the third determination module 35 is configured to determine the proofread text according to the proofreading result.
[0130] Optionally, the second determination module 44 includes a first determination unit, a second determination unit, and a third determination unit; the first determination unit is configured to determine at least one word in the current output vocabulary list as the target word set; the second determination unit is configured to determine the intersection of the confusion set and the target vocabulary list; the third determination unit is configured to determine the word to be proofread as the proofreading result when the intersection is an empty set; when the intersection is a non-empty set, the proofreading result is determined according to the inclusion relationship between the intersection and the word to be proofread.
[0131] Optionally, determining the proofreading result according to the inclusion relationship between the intersection and the word to be proofread includes: when the intersection does not contain the word to be proofread, determining the word with the highest probability in the intersection as the proofreading result; when the intersection contains the word to be proofread, determining the word to be proofread as the proofreading result.
[0132] Optionally, the first determining unit is specifically configured to determine at least one word in the current output vocabulary as the target word set, including: determining a set number of words in the current output vocabulary as the target word set; wherein the probability of any word in the target word set is greater than or equal to the probability of any word in the current output vocabulary that is not in the target word set.
[0133] Optionally, the decoding module 42 includes a first decoding unit, which is configured to decode the initial value in the case of the first decoding process to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0134] Optionally, the decoding module 42 includes a second decoding unit, which is configured to decode the previous proofreading result and the current subvector in the absence of the first decoding process, so as to obtain the current output vocabulary and the probability distribution of the current output vocabulary, wherein the previous proofreading result is the proofreading result obtained in the previous decoding process.
[0135] Optionally, the second decoding unit is specifically configured to obtain the intermediate decoding state in the previous decoding process; determine the weight of each sub-vector in the encoding vector based on the matching degree between the intermediate decoding state and each sub-vector in the encoding vector; determine the dynamic semantic encoding vector by weighting the sub-vector in the encoding vector and its corresponding weight; and perform decoding processing on the dynamic semantic encoding vector, the previous proofreading result, and the current sub-vector to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
[0136] Optionally, the text proofreading device further includes an end module, which is configured to, after obtaining the proofreading result of the current decoding process, connect all the obtained proofreading results to obtain the currently proofread text; and end the decoding process if the length of the currently proofread text is the same as the length of the input text.
[0137] Optionally, the text proofreading apparatus further includes an encoding module configured to perform text embedding and position embedding on the input text to obtain an input vector; process the input vector multiple times using at least one sub-encoder; and determine the output of the last sub-encoder as the encoded vector.
[0138] Optionally, the third determining module 45 is specifically configured to, after multiple decoding processes, sequentially connect the proofreading results obtained in each decoding process according to the order of the decoding process to obtain the proofread text.
[0139] In some embodiments, the electronic device includes a processor and a memory storing program instructions, the processor being configured to execute the text proofreading method provided in the foregoing embodiments when executing the program instructions.
[0140] Figure 5 This is a schematic diagram of an electronic device provided in an embodiment of this application. (In conjunction with...) Figure 5 As shown, the electronic device includes:
[0141] The processor 51 and memory 52 may also include a communication interface 53 and a bus 54. The processor 51, communication interface 53, and memory 52 can communicate with each other via the bus 54. The communication interface 53 can be used for information transmission. The processor 51 can invoke logical instructions in the memory 52 to execute the text proofreading method provided in the foregoing embodiments.
[0142] Furthermore, the logic instructions in the aforementioned memory 52 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0143] The memory 52, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this application. The processor 41 executes functional applications and data processing by running the software programs, instructions, and modules stored in the memory 52, thereby implementing the methods in the above-described method embodiments.
[0144] The memory 52 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 52 may include high-speed random access memory and may also include non-volatile memory.
[0145] This application provides a storage medium storing computer-executable instructions configured to execute the text proofreading method provided in the foregoing embodiments.
[0146] This application provides a computer program product, which includes a computer program stored on a computer-readable storage medium. The computer program includes program instructions, which, when executed by a computer, cause the computer to perform the text proofreading method provided in the foregoing embodiments.
[0147] The aforementioned computer-readable storage medium may be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.
[0148] The technical solutions of this application embodiment can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods in this application embodiment. The aforementioned storage medium can be a non-transitory storage medium, including: USB flash drive, portable hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, and other media capable of storing program code; it can also be a transient storage medium.
[0149] The foregoing description and accompanying drawings fully illustrate embodiments of this application to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operations may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms "a," "an," and "the" are intended to equally include the plural forms unless the context clearly indicates otherwise. Additionally, when used in this application, the terms "comprise" and its variations "comprises" and / or "comprising," etc., refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Unless otherwise specified, an element defined by the phrase "comprising a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes that element. In this document, each embodiment may focus on describing the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, then the relevant parts can be referred to the description of the method section.
[0150] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0151] The methods and products (including but not limited to devices and equipment) disclosed in the embodiments herein can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of units may be merely a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms. Units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to implement this embodiment according to actual needs. Furthermore, the functional units in the embodiments of this application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
[0152] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.< / sos> < / sos> < / sos> < / sos> < / sos>
Claims
1. A method of proofreading text, characterized by, include: In the current decoding process, the current sub-vector in the encoding vector of the input text is obtained, wherein the position of the current sub-vector in the encoding vector corresponds to the number of decoding processes that have been completed; The current subvector is decoded to obtain the current output vocabulary and the probability distribution of the current output vocabulary; The word to be proofread is determined in the input text, wherein the position of the word to be proofread in the input text corresponds to the number of decoding processes that have been completed; Based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution, determine the proofreading result corresponding to the word to be proofread; The proofread text is determined based on the proofreading results; The step of determining the proofreading result corresponding to the word to be proofread based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution includes: Determine the intersection of the confusion set and the target vocabulary; If the intersection is an empty set, the word to be proofread is determined as the proofreading result; If the intersection is not empty, the proofreading result is determined based on the inclusion relationship between the intersection and the word to be proofread.
2. The text proofreading method according to claim 1, characterized by, The step of determining the proofreading result corresponding to the word to be proofread based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution further includes: At least one word in the current output vocabulary is identified as the target vocabulary.
3. The text proofreading method according to claim 2, characterized in that, The step of determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread includes: If the intersection does not contain the word to be proofread, the word with the highest probability in the intersection is determined as the proofreading result; If the intersection contains the word to be proofread, the word to be proofread is determined as the proofreading result. Target 4. The text proofreading method according to any one of claims 1 to 3, characterized in that, The step of decoding the current subvector to obtain the current output vocabulary and its probability distribution includes: In the first decoding process, the initial value and the current sub-vector are decoded to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
5. The text proofreading method according to any one of claims 1 to 3, characterized in that, The step of decoding the current subvector to obtain the current output vocabulary and its probability distribution includes: In cases other than the first decoding process, the previous proofreading result and the current subvector are decoded to obtain the current output vocabulary and its probability distribution. The previous proofreading result is the proofreading result obtained in the previous decoding process.
6. The text proofreading method according to claim 5, characterized in that, The decoding process of the previous proofreading result and the current subvector to obtain the current output vocabulary and its probability distribution includes: Obtain the intermediate decoding state from the previous decoding process; The weight of each sub-vector in the encoding vector is determined based on the degree of matching between the intermediate decoding state and each sub-vector in the encoding vector; The weighted sum of each sub-vector and its corresponding weight in the encoding vector is determined as the dynamic semantic encoding vector; The dynamic semantic encoding vector, the previous proofreading result, and the current subvector are decoded to obtain the current output vocabulary and the probability distribution of the current output vocabulary.
7. The text proofreading method according to any one of claims 1 to 3, characterized in that, Before obtaining the current subvector in the encoded vector of the input text, the method further includes: The input text is embedded using text embedding and position embedding to obtain an input vector; The input vector is processed multiple times using at least one sub-encoder; The output of the last sub-encoder is determined as the encoding vector of the input text.
8. The text proofreading method according to any one of claims 1 to 3, characterized in that, The step of determining the proofread text based on the proofreading result includes: After multiple decoding processes, the proofreading results obtained from each decoding process are sequentially concatenated according to the order of the decoding process to obtain the proofread text.
9. A text proofreading device, characterized in that, include: The acquisition module is configured to acquire the current sub-vector in the encoding vector of the input text during the current decoding process, wherein the position of the current sub-vector in the encoding vector corresponds to the number of decoding processes that have been completed; The decoding module is configured to decode the current subvector to obtain the current output vocabulary and the probability distribution of the current output vocabulary; The first determining module is configured to determine the word to be checked in the input text, wherein the position of the word to be checked in the input text corresponds to the number of decoding processes that have been completed; The second determining module is configured to determine the proofreading result corresponding to the word to be proofread, based on the word to be proofread, the confusion set corresponding to the word to be proofread, the current output vocabulary, and the probability distribution; wherein, determining the proofreading result corresponding to the word to be proofread includes: determining the intersection of the confusion set and the target vocabulary; if the intersection is an empty set, determining the word to be proofread as the proofreading result; if the intersection is not an empty set, determining the proofreading result based on the inclusion relationship between the intersection and the word to be proofread. The third determining module is configured to determine the proofread text based on the proofreading results.
10. An electronic device comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to perform the text proofreading method as described in any one of claims 1 to 8 when executing the program instructions.
11. A storage medium storing program instructions, characterized in that, When the program instructions are executed, they perform the text proofreading method as described in any one of claims 1 to 8.