Corpus generation method and apparatus, electronic device, and storage medium

By using a translation model array and edit distance calculation, the problem of low efficiency in manually acquiring parallel corpora is solved, and multiple parallel corpora with the same meaning are efficiently acquired.

CN114239609BActive Publication Date: 2026-06-30SHANGHAI LIULISHUO INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI LIULISHUO INFORMATION TECH CO LTD
Filing Date
2021-12-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for obtaining parallel corpora manually are inefficient and cannot meet practical needs.

Method used

A pre-trained array of translation models is used to translate corpora of the target language type. By calculating the edit distance between multiple translation results and the original corpus, the target corpus with the same meaning is selected.

Benefits of technology

It improves the efficiency of obtaining parallel corpora, enabling the acquisition of more corpora with the same meaning and language type at once.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114239609B_ABST
    Figure CN114239609B_ABST
Patent Text Reader

Abstract

A corpus generation method and device, electronic equipment and storage medium, wherein the corpus generation method comprises: obtaining a first corpus of a target language type; inputting the first corpus into a trained translation model array for translation to obtain a plurality of translation results; wherein the translation model array comprises translation models for translating the first corpus into a corpus of another language type, and translation models for translating the corpus of the other language type into the target language type, and each translation model in the translation model array is arranged in a predetermined order; and calculating the edit distance between the plurality of translation results and the first corpus to obtain a target corpus corresponding to the first corpus. The above scheme can improve the efficiency of obtaining parallel corpus.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer natural language processing technology, and in particular to a corpus generation method, apparatus, electronic device and storage medium. Background Technology

[0002] Currently, parallel corpora with the same meaning and language type are mainly obtained manually. Specifically, after obtaining the corpus, it is translated manually to obtain translated corpora with the same meaning and language type.

[0003] However, manual translation is inefficient in obtaining parallel corpora. Summary of the Invention

[0004] In view of this, embodiments of this specification provide a corpus generation method, apparatus, electronic device, and storage medium that can improve the efficiency of acquiring parallel corpora.

[0005] First, this specification provides a corpus generation method, including:

[0006] Obtain the first corpus of the target language type;

[0007] The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results; wherein, the translation model array includes a translation model that translates the first corpus into corpora of other language types, and a translation model that translates the corpora of other language types into the target language type, and the translation models in the translation model array are set in a preset order;

[0008] The edit distance between the multiple translation results and the first corpus is calculated to obtain the target corpus corresponding to the first corpus.

[0009] Optionally, at least one translation model in the translation model array generates output corpus with at least two expressions for any input corpus.

[0010] Optionally, each translation model in the translation model array includes an encoding layer and a decoding layer;

[0011] The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results, including:

[0012] The first corpus is input into the encoding layer of the first translation model in the translation model array for encoding to obtain the corresponding vector matrix;

[0013] The vector matrix is ​​input into the decoding layer of the first translation model, the vector matrix is ​​decoded and translated, and multiple first translation results of a preset language type are obtained according to a preset algorithm.

[0014] The multiple first translation results are respectively input into other translation models in the translation model array to translate the first translation results accordingly, so as to obtain multiple translation results for the target language type.

[0015] Optionally, the step of inputting the vector matrix into the decoding layer of the first translation model, decoding and translating the vector matrix, and obtaining multiple first translation results for a preset language type according to a preset algorithm includes:

[0016] The vector matrix corresponding to the first corpus and the set of start identifiers are input into the encoding layer of the first translation model to obtain the first target vector matrix;

[0017] The first target vector matrix is ​​input into the decoding layer of the first translation model, and the first target vector matrix is ​​decoded and translated to obtain multiple word units. After normalization, the probability value of each word unit is obtained.

[0018] Based on the probability values ​​of each word unit and according to the preset algorithm, a first preset number of word units are selected to obtain the first target word unit set;

[0019] The vector matrix corresponding to the first corpus and the set of first target word units are input into the decoding layer of the translation model, and multiple prediction results are obtained according to the preset algorithm.

[0020] Based on the probability values ​​of the multiple prediction results, select the first preset number of prediction results;

[0021] The prediction results are iteratively predicted until an end identifier is found, resulting in multiple first translation results for the other language types.

[0022] Optionally, each translation model in the translation model array includes an encoding layer and a decoding layer;

[0023] The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results, including:

[0024] The first corpus is input into the first translation model in the translation model array, and the first corpus is translated to obtain a second translation result of a preset language type;

[0025] The second translation result is input into the encoding layer of at least one other translation model in the translation model array to obtain the corresponding vector matrix;

[0026] The vector matrix is ​​input into the decoding layer of at least one other translation model, and the vector matrix is ​​subjected to corresponding decoding and translation processing. According to a preset algorithm, multiple translation results of the target language type are obtained.

[0027] Optionally, the step of inputting the vector matrix into the decoding layer of at least one other translation model, performing corresponding decoding and translation processing on the vector matrix, and obtaining multiple translation results for the target language type according to a preset algorithm includes:

[0028] The vector matrix corresponding to the second translation result and the set of start identifiers are input into the decoding layer of the other at least one translation model to obtain the second target vector matrix;

[0029] The second target vector matrix is ​​input into the decoding layer of the other at least one translation model, and the second target vector matrix is ​​subjected to corresponding decoding and translation processing to obtain multiple word units. After normalization, the probability value of each word unit is obtained.

[0030] Based on the probability values ​​of each word unit and according to the preset algorithm, a second preset number of word units are selected to obtain the second target word unit set;

[0031] The vector matrix corresponding to the second translation result and the set of second target word units are input into the decoding layer of the other at least one translation model, and multiple prediction results are obtained according to the preset algorithm;

[0032] The prediction results are iteratively predicted until an end identifier is found, so as to obtain multiple translation results for the target language type.

[0033] Optionally, calculating the edit distance between the plurality of translation results and the first corpus to obtain the target corpus corresponding to the first corpus includes:

[0034] Calculate the edit distance between each word in each of the multiple translation results and each word in the first corpus;

[0035] Based on the edit distance, the translation results that meet the preset conditions are selected as the target corpus corresponding to the first corpus.

[0036] Optionally, the translation model array consists of translation models belonging to the same language family.

[0037] Accordingly, embodiments of this specification also provide a corpus generation apparatus, including:

[0038] Corpus acquisition unit, suitable for acquiring the first corpus of the target language type;

[0039] The translation unit is adapted to input the first corpus into a pre-trained translation model array for translation, and obtain multiple translation results; wherein, the translation model array includes a translation model for translating the first corpus into other language types, and a translation model for translating the other language types into the target language type, and the translation models in the translation model array are set in a preset order;

[0040] The processing unit is adapted to calculate the edit distance between the multiple translation results and the first corpus, and obtain the target corpus corresponding to the first corpus.

[0041] This specification also provides an electronic device, including a memory and a processor, wherein the memory is adapted to store one or more computer instructions, and the processor, when executing the computer instructions, performs the steps of the corpus generation method described in any of the foregoing embodiments.

[0042] This specification also provides a computer-readable storage medium storing computer instructions, characterized in that the computer instructions, when executed, perform the steps of the corpus generation method described in any of the foregoing embodiments.

[0043] By employing the above scheme, a pre-trained translation model array is used to translate the first corpus of the target language type, resulting in translations of multiple target language types. Furthermore, by calculating the edit distance between the multiple translations and the first corpus, target corpus with the same meaning as the first corpus can be obtained. Compared to obtaining parallel corpus with the same meaning and language type manually, more parallel corpus can be obtained at once, thus improving the efficiency of obtaining parallel corpus.

[0044] Furthermore, by enabling at least one translation model in the translation model array to generate output corpus with at least two expressions for any input corpus, the same amount of target corpus can be obtained while reducing the size of the translation model array.

[0045] Furthermore, each translation model in the translation model array includes an encoding layer and a decoding layer. The first corpus is input into the encoding layer of the first translation model in the array for encoding, resulting in a corresponding vector matrix. This vector matrix is ​​then input into the decoding layer of the first translation model for decoding and translation processing. Following a preset algorithm, multiple first translation results for a preset language type are obtained. These multiple first translation results are then input into other translation models in the array to perform corresponding translations, thereby obtaining multiple translation results for the target language type. Using this method, by setting the parameters of the encoding layer of the translation model, multiple first translation results can be obtained. Furthermore, by having other translation models translate these first translation results, multiple translation results for the target language type can be obtained. Moreover, by setting the parameters of the encoding layers of other translation models, even more multiple translation results for the target language type can be obtained, improving the efficiency of acquiring parallel corpora. Attached Figure Description

[0046] To more clearly illustrate the technical solutions of the embodiments of this specification, the drawings used in the description of the embodiments of this specification or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0047] Figure 1 A flowchart of a corpus generation method according to an embodiment of this specification is shown;

[0048] Figure 2 A flowchart illustrating an embodiment of this specification is shown, showing how to translate a first corpus to obtain multiple translation results.

[0049] Figure 3 The flowchart illustrates a specific application scenario in an embodiment of this specification, in which a first corpus is translated to obtain multiple translation results.

[0050] Figure 4 A schematic diagram of a corpus generation device according to an embodiment of this specification is shown;

[0051] Figure 5 A schematic diagram of the structure of an electronic device according to an embodiment of this specification is shown. Detailed Implementation

[0052] Currently, daily life requires a large amount of parallel corpora with the same meaning and language type to enrich applications closely related to human life, such as teaching and writing. However, as mentioned in the background section, manually acquiring parallel corpora is inefficient and cannot meet practical needs.

[0053] To address the aforementioned problems, embodiments of this specification provide a corpus generation method, comprising: acquiring a first corpus of a target language type; inputting the first corpus into a pre-trained translation model array for translation to obtain multiple translation results; calculating the edit distance between the multiple translation results and the first corpus to obtain a target corpus corresponding to the first corpus; wherein the translation model array includes a translation model for translating the first corpus into corpora of other language types, and a translation model for translating the corpora of other language types into the target language type, and the translation models in the translation model array are set in a preset order.

[0054] By employing the above method, a first corpus of target language types can be translated using a pre-trained translation model array, resulting in translations of multiple target language types. Furthermore, by calculating the edit distance between the multiple translations and the first corpus, target corpus with the same meaning as the first corpus can be obtained. Compared to obtaining parallel corpus with the same meaning and language type manually, more parallel corpus can be obtained at once, thus improving the efficiency of acquiring parallel corpus.

[0055] To enable those skilled in the art to better understand and implement the embodiments of this specification, the following detailed description is provided with reference to the accompanying drawings and specific application examples.

[0056] Reference Figure 1 The flowchart shown in this embodiment of the specification illustrates a corpus generation method. In this embodiment, the method can be executed according to the following steps:

[0057] S11, Obtain the first corpus of the target language type.

[0058] In practice, the wider the source of the first corpus of the target language type, the more application scenarios are included in the parallel corpus with the same meaning obtained by paraphrasing and translating the first corpus. Therefore, the first corpus can be obtained in multiple different fields.

[0059] For example, if the user is a teacher, they can obtain the relevant text data from the lesson text as the first corpus, and after translation, they can obtain translation results in various ways, enriching the teaching scenario.

[0060] In practice, one or more target language types can be selected as the first corpus according to actual needs. For example, the target language type can be Chinese, English, German, French, etc.

[0061] In practice, the first corpus can be a single sentence, a paragraph containing multiple sentences, or a document containing many sentences. When the first corpus contains multiple sentences, each sentence needs to be input sequentially into the pre-trained translation model array. This specification does not limit the specific format of the acquired training corpus, as long as it meets the corpus format requirements.

[0062] S12, input the first corpus into the pre-trained translation model array for translation, and obtain multiple translation results.

[0063] The translation model array includes a translation model for translating the first corpus into corpora of other language types, and a translation model for translating the corpora of other language types into the target language type, and the translation models in the translation model array are set in a preset order.

[0064] In specific implementation, the translation model array includes at least two translation models, and in the two translation models, one translation model can translate the first corpus of the target language type into other language types, and the other translation model can translate the corpus of other language types into the corpus of the target language type and output it. That is, in the embodiments of this specification, the first corpus of the target language type is translated at least twice to obtain multiple translation results.

[0065] It is understood that the translation model array may also include three or more translation models. When the translation model array includes three or more translation models, the input corpus type of the first translation model in the translation model array is the first corpus, and the output corpus of the last translation model in the translation model array is the target language type corpus.

[0066] As an optional example, at least one translation model generates output corpus with at least two expressions for any input corpus.

[0067] By ensuring that at least one translation model in the translation model array generates output corpus with at least two expressions for any input corpus, the same amount of target corpus can be obtained while reducing the size of the translation model array.

[0068] S13, calculate the edit distance between the multiple translation results and the first corpus to obtain the target corpus corresponding to the first corpus.

[0069] In practice, multiple translation results can be obtained through step S12, but there may be cases where some translation results are completely identical to the first corpus. Therefore, it is necessary to remove the target corpus that is completely identical to the first corpus based on the edit distance between the multiple translation results and the first corpus, so as to selectively obtain the target corpus.

[0070] By employing the above scheme, a pre-trained translation model array is used to translate the first corpus of the target language type, resulting in translations of multiple target language types. Furthermore, by calculating the edit distance between the multiple translations and the first corpus, target corpus with the same meaning as the first corpus can be obtained. Compared to obtaining parallel corpus with the same meaning and language type manually, more parallel corpus can be obtained at once, thus improving the efficiency of acquiring parallel corpus.

[0071] In practice, the first corpus can be obtained in various ways. For example, at least one of the following methods can be used to obtain the first corpus:

[0072] Extracting text data from the internet;

[0073] Manually input text data into a preset domain.

[0074] Voice data is captured from the Internet and then processed through speech recognition to obtain corresponding text data.

[0075] After obtaining the first corpus of the preset domain, and before inputting the first corpus into the pre-trained translation model array for translation and obtaining multiple translation results, the first corpus can be segmented to obtain the word encoding vector corresponding to the first corpus, and the word encoding vector corresponding to the first corpus can be input into the translation model array.

[0076] In one example, Byte Pair Encoding (BPE) can be used to obtain the word encoding vector corresponding to the first corpus. For example, for the corpus {Maintenant,regardez le menu dessalades}, after BPE encoding, we can obtain {Maintenant,regar@@dez le menudes sal@@des}.

[0077] In the embodiments of this specification, the translation model may include a sequence-to-sequence (seq2seq) model. A seq2seq model is a neural network model with an encoder-decoder structure, where the input is a sequence and the output is also a sequence, and the lengths of the input and output sequences can be different. For example, when the translation model is an English-German translation model, the length of the translated German text can be longer than that of the English text.

[0078] In specific implementations, the translation model may include a transformer with a self-attention mechanism, a long-document transformer (Longformer) model, a recurrent neural network (RNN) model, a bidirectional encoder (BERT) representing transformers, etc.

[0079] In the embodiments of this specification, a set of parallel corpora of a preset language type can be used to train the translation model to obtain an array of translated models. The parallel corpora include source language corpora and target language corpora, and the two can be interchanged to obtain different types of translation models when training the translation model.

[0080] For example, if the source language corpus in the parallel corpus is Chinese and the target language corpus is English, a Chinese-English translation model can be trained; if the source language corpus in the parallel corpus is English and the target language corpus is Chinese, an English-Chinese translation model can be trained.

[0081] In practice, translation models can be trained according to the language family type to obtain an array of translation models with the same language family.

[0082] In the embodiments of this specification, the translation model array can be translation models with the same language family. Specifically, the translation models with the same language family can include English-German translation models and German-English translation models, English-French translation models and French-English translation models.

[0083] Using the trained translation model array described above, the first corpus can be translated to obtain multiple translation results. In the process of obtaining multiple translation results, the first corpus can be decoded and translated to obtain multiple first translation results in a preset language type. Then, these multiple first translation results in the preset language type can be directly translated at least once to obtain multiple translation results in the target language type.

[0084] The following, in conjunction with the accompanying drawings, details the process of translating the first corpus and obtaining multiple translation results in the embodiments of this specification through specific examples.

[0085] Reference Figure 2 The flowchart shown in this embodiment of the specification illustrates a process of translating a first corpus to obtain multiple translation results. In this embodiment of the specification, each translation model in the translation model array includes an encoding layer and a decoding layer.

[0086] The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results. This can be performed as follows:

[0087] S21, the first corpus is input into the encoding layer of the first translation model in the translation model array for encoding to obtain the corresponding vector matrix.

[0088] Specifically, by inputting the first corpus into the encoding layer of the first translation model, each word in the first corpus can be encoded to obtain the vector matrix corresponding to the word.

[0089] S22, the vector matrix is ​​input to the decoding layer of the first translation model, the vector matrix is ​​decoded and translated, and multiple first translation results of the preset language type are obtained according to the preset algorithm.

[0090] Specifically, when the decoding layer of the first translation model processes the vector matrix, in addition to translating the vector matrix, it also decodes the translation result and performs corresponding operations on the decoded translation result according to a preset algorithm to obtain multiple first translation results of a preset language type.

[0091] In the embodiments of this specification, the preset algorithm may be a beam search algorithm. When applying this algorithm, the corresponding beam size can be set according to actual needs. By setting different beam sizes, different numbers of preset language types of first translation results can be obtained.

[0092] S23, the multiple first translation results are respectively input to other translation models in the translation model array, and the first translation results are translated accordingly to obtain multiple translation results of the target language type.

[0093] Specifically, after the above steps, multiple first translation results can be obtained. These multiple first translation results can be sequentially input into other translation models for at least one direct translation to obtain multiple translation results for the target language type.

[0094] As a specific example, if the translation model array includes two translation models, where the first translation model is an English-French translation model and the second translation model is a French-English translation model, and the first corpus is English, the English-French translation model can decode and translate the first corpus to obtain multiple French corpora, and then the multiple French corpora are sequentially input into the French-English translation model for direct translation to obtain multiple English corpora with the same meaning as the first corpus.

[0095] By using the above method, multiple first translation results can be obtained by setting the parameters of the encoding layer of the translation model. Other translation models can then translate these first translation results accordingly, resulting in multiple translation results for the target language type. Furthermore, by setting the parameters of the encoding layers of other translation models, even more multiple translation results for the target language type can be obtained, thus improving the efficiency of obtaining parallel corpora.

[0096] In practice, decoding and translating the first corpus yields multiple word units, each with a different weight value. Therefore, it is necessary to select appropriate word units to facilitate subsequent combination of these units and obtain a first translation result with the same predefined language type and meaning.

[0097] Reference Figure 3 The flowchart shown in this embodiment illustrates a specific application scenario for translating a first corpus to obtain multiple translation results. The specific steps are as follows:

[0098] S31, input the vector matrix corresponding to the first corpus and the set of start identifiers into the encoding layer of the first translation model to obtain the first target vector matrix.

[0099] In practical implementation, assuming the vector matrix corresponding to the first corpus is X, then X and the set of start identifiers [ <start> , <start> 、…、 <start>They are input together into the encoding layer of the corresponding translation model to determine the corpus that needs to be encoded.

[0100] S32, the first target vector matrix is ​​input to the decoding layer of the first translation model, the first target vector matrix is ​​decoded and translated to obtain multiple word units, and after normalization, the probability value of each word unit is obtained.

[0101] In practice, after decoding and translation, multiple word units can be obtained. After performing a soft-max function operation on the multiple word units, their corresponding probability values ​​can be obtained.

[0102] S33. Based on the probability values ​​of each word unit and according to the preset algorithm, select a first preset number of word units to obtain a first target word unit set.

[0103] Specifically, if the preset algorithm is a beam search algorithm with a beam width of 8 (i.e., the first preset number of words is 8), then the top 8 words with the highest probability values ​​are obtained from the multiple word units.

[0104] S34, the vector matrix corresponding to the first corpus and the first target word unit set are input into the decoding layer of the translation model, and multiple prediction results are obtained according to the preset algorithm.

[0105] S35, select the first preset number of prediction results based on the probability values ​​of the multiple prediction results.

[0106] In practice, since different word units have different probability values, the multiple prediction results obtained also have different probability values. The n prediction results with the highest probability values ​​can be obtained from the multiple prediction results in accordance with step S33, where n is greater than or equal to 1.

[0107] S36, perform iterative prediction on the prediction result until the prediction result shows an end identifier, and obtain multiple first translation results for the other language types.

[0108] Using the above method, multiple first translation results can be obtained. After performing at least one literal translation on the first translation results, multiple translation results of the target type can be obtained.

[0109] In some other embodiments of this specification, other trained models may perform multiple decoding and translation processes on the first translation result to obtain more intermediate results, and the last translation model may output multiple translation results of the target type.

[0110] In specific implementation, in addition to using the first translation model in the translation model array to decode and translate the first corpus, the first translation model can also directly translate the first corpus to obtain a second translation result of a preset language type. Then, the second translation result is decoded and translated at least once to obtain multiple translation results of the target language type.

[0111] Specifically, each translation model in the translation model array includes an encoding layer and a decoding layer;

[0112] The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results, including:

[0113] The first corpus is input into the first translation model in the translation model array, and the first corpus is translated to obtain a second translation result of a preset language type;

[0114] The second translation result is input into the encoding layer of at least one other translation model in the translation model array to obtain the corresponding vector matrix;

[0115] The vector matrix is ​​input into the decoding layer of at least one other translation model, and the vector matrix is ​​subjected to corresponding decoding and translation processing. According to a preset algorithm, multiple translation results of the target language type are obtained.

[0116] For the encoding, decoding, and translation processes, please refer to [link / reference needed]. Figure 2 The relevant descriptions will not be elaborated here. The difference lies in that, during the encoding translation process, the second translation result obtained by the first translation model is encoded and translated, resulting in multiple translation results for the target language type, but the processing procedure is the same.

[0117] As previously mentioned, decoding and translating the corpus yields multiple word units, each with a different weight value. Therefore, it is necessary to select appropriate word units to facilitate subsequent combination of these units, resulting in translations of the same target language type and meaning. The specific process includes:

[0118] The vector matrix corresponding to the second translation result and the set of start identifiers are input into the decoding layer of the other at least one translation model to obtain the second target vector matrix;

[0119] The second target vector matrix is ​​input into the decoding layer of the other at least one translation model, and the second target vector matrix is ​​subjected to corresponding decoding and translation processing to obtain multiple word units. After normalization, the probability value of each word unit is obtained.

[0120] Based on the probability values ​​of each word unit and according to the preset algorithm, a second preset number of word units are selected to obtain the second target word unit set;

[0121] The vector matrix corresponding to the second translation result and the set of second target word units are input into the decoding layer of the other at least one translation model, and multiple prediction results are obtained according to the preset algorithm;

[0122] The prediction results are iteratively predicted until an end identifier is found, so as to obtain multiple translation results for the target language type.

[0123] For a detailed explanation of the process of obtaining multiple translation results for the target language type, please refer to [link to relevant documentation]. Figure 3 The corresponding content will not be described in detail here. The difference is that... Figure 3 The technical solution is to decode and translate the first corpus to obtain a corpus of a preset language type, while the technical solution of this paper decodes and translates the intermediate result obtained by translating the first corpus to obtain a translation result of the target language type, but the processing procedures of the two are the same.

[0124] Using the above method, a second translation result can be obtained. By performing at least one decoding translation process on the second translation result, multiple translation results of the target type can be obtained.

[0125] In some other embodiments of this specification, the second translation result may be decoded and translated multiple times by at least one other translation model in the translation model array, and the last translation model may output multiple translation results of the target type.

[0126] By employing at least two of the above methods, multiple translation results for the target language type can be obtained respectively. At this point, a suitable translation result can be selected as the target corpus based on the edit distance between the multiple translation results and the first corpus.

[0127] In practice, the edit distance between each word in each of the multiple translation results and each word in the first corpus can be calculated separately.

[0128] Based on the edit distance, the translation results that meet the preset conditions are selected as the target corpus corresponding to the first corpus.

[0129] Specifically, if eight translation results are obtained, the edit distance between each word in these eight translation results and each word in the first corpus is calculated, and the translation result that meets the preset conditions is selected as the target corpus. The preset conditions may include that the edit distance between each word in the translation result and each word in the first corpus is greater than a preset edit distance. For example, if the edit distance between each word in the translation result and each word in the first corpus is greater than or equal to 2, then the translation result can be used as the final translation result for the target language type.

[0130] To facilitate understanding, the following specific example will be used to explain in detail the process of translating the first corpus using a pre-trained translation model array to obtain multiple translation results in the embodiments of this specification.

[0131] For ease of explanation, let's assume the first corpus S is: {Now, look at the salad menu.}, and the pre-trained translation model array includes an English-French translation model and a French-English translation model.

[0132] After segmenting the first corpus S: {Now, look at the salad menu.}, it is input into the English-French translation model to obtain a translation result t-fr: {Maintenant, regardez le menu des salads.}.

[0133] After performing BPE encoding on the translation result t-fr, we obtain {Maintenant,regar@@dez le menu dessal@@des}, and input this sentence into the encoding layer of the French-English translation model, outputting the corresponding vector matrix O1.

[0134] Combine the vector matrix O1 with the set initialized as the start identifier. <start> , <start> 、…、 <start>The target vector matrix O2 is input into the encoding layer of the French-English translation model. The target vector matrix O2 is then input into the decoding layer of the French-English translation model. The beam search algorithm is used to select the first beam size of word units from the multiple word units obtained after decoding and translation. For example, the word unit set can be obtained as: ["Take", "Now", "Look", "View", ...].

[0135] The vector matrix O1 and the vector matrices corresponding to each word unit in the word unit set ["Take","Now","Look","View",...] are input into the decoding layer of the French-English translation model, and the beam search algorithm is used to select the first beamsize results to obtain multiple prediction results ["Take a","Now","Now check","Now watch",...].

[0136] The multiple prediction results ["Take a", "Now", "Now check", "Now watch", ...] are iteratively predicted, and the beam width of the beam search algorithm is set to 8 to obtain 8 prediction results ["Take a look at the salad menu now.", "Now, look at the salad menu.", "Now, check out the salad menu.", "Now watch the salad menu.", "Now look at the salad menu", "Now take a look at the salad menu", "Now look at the menu of salads", "Now, check the salad menu"].

[0137] Calculate the edit distance between each of the eight predicted results and the first corpus S: {Now, look at the salad menu.}. For example, the edit distance between "Take a look at the salad menu now." and {Now, look at the salad menu.} is 3, the edit distance between "Now, check out the salad menu." and {Now, look at the salad menu.} is 2, and so on. Calculate the edit distance values ​​sequentially and select the four sentences with the largest edit distance values ​​as the target corpus.

[0138] Understandably, other translation models can be used to decode and translate the first corpus to obtain more target corpus.

[0139] For example, the pre-trained translation model array includes an English-German translation model and a German-English translation model. Using the English-German translation model and the French-German translation model to translate the first corpus S: {Now, look at the salad menu.}, eight prediction results can be obtained ["Check out the salad menu now.", "Look at the salad menu now.", "Check out the salad menu rightnow.", "Now take a look at the salad menu.", "Now look at the salad menu", "Take a look at the salad menu", "Just look at the salad menu now", "Have a look at the salad menu"].

[0140] Next, the edit distance between these 8 prediction results and the first corpus S: {Now, look at the saladmenu.} is calculated, and the top 4 sentences with the largest edit distance values ​​are selected as the target corpus.

[0141] In other embodiments of this specification, the first corpus S: {Now, look at the salad menu.} is translated using an English-German translation model and a German-English translation model, yielding 8 prediction results; the first corpus S: {Now, look at the salad menu.} is also translated using an English-French translation model and a French-English translation model, yielding 8 prediction results. The edit distances between these 16 prediction results and the first corpus S: {Now, look at the salad menu.} can be calculated, and the top 8 sentences with the largest edit distance values ​​are selected as the target corpus.

[0142] Accordingly, embodiments of this specification also provide apparatus corresponding to the above-described corpus generation method, which will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0143] Reference Figure 4 The schematic diagram shown here is a structural diagram of a corpus generation device in one embodiment of this specification. In some embodiments of this specification, the corpus generation device 40 may include:

[0144] Corpus acquisition unit 41 is suitable for acquiring the first corpus of the target language type;

[0145] Translation unit 42 is adapted to input the first corpus into a pre-trained translation model array for translation, and obtain multiple translation results;

[0146] The translation model array includes a translation model for translating the first corpus into corpora of other language types, and a translation model for translating the corpora of other language types into the target language type, and the translation models in the translation model array are set in a preset order;

[0147] The processing unit 43 is adapted to calculate the edit distance between the plurality of translation results and the first corpus, so as to obtain the target corpus corresponding to the first corpus.

[0148] By using the aforementioned corpus generation device, and by translating a first corpus of a target language type using a pre-trained translation model array, multiple translation results of the target language type can be obtained. Furthermore, by calculating the edit distance between the multiple translation results and the first corpus, target corpus with the same meaning as the first corpus can be obtained. Compared to obtaining parallel corpus with the same meaning and language type manually, this method can improve the efficiency of obtaining parallel corpus.

[0149] This specification also provides an electronic device for generating multiple corpora with the same meaning and language type, such as... Figure 5 As shown, the electronic device 50 may include a memory 51 and a processor 52, wherein the memory 51 is adapted to store one or more computer instructions, and the processor 52 executes the steps of the corpus generation method described in any of the foregoing embodiments when running the computer instructions.

[0150] In specific implementation, such as Figure 5 As shown, the electronic device 5 may also include an expansion interface 53, which is suitable for connecting with other devices to achieve data interaction.

[0151] Specifically, electronic device 50 can be a general-purpose or special-purpose computer device, or more specifically, a server or computer terminal, such as a personal computer or portable terminal device.

[0152] In a practical implementation, the memory 51, processor 52, and expansion interface 53 can be connected via a bus.

[0153] In specific implementations, the processor can be implemented by processing chips such as a central processing unit (CPU), a graphics processing unit (GPU), or a field-programmable gate array (FPGA), or by an application-specific integrated circuit (ASIC) or one or more integrated circuits configured to implement the embodiments of this specification.

[0154] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device.

[0155] This invention also provides a computer-readable storage medium storing computer instructions, which, when executed, can perform the steps of the corpus generation method described in any of the above embodiments of this invention. The computer-readable storage medium can be any suitable readable storage medium such as an optical disc, a hard disk drive, or a solid-state drive. The instructions stored on the computer-readable storage medium execute the steps of the corpus generation method described in any of the above embodiments; for details, please refer to the above embodiments, which will not be repeated here.

[0156] The computer-readable storage medium may include, for example, any suitable type of memory cell, memory device, memory article, memory medium, storage device, storage article, storage medium and / or storage cell, such as memory, removable or non-removable medium, erasable or non-erasable medium, writable or rewritable medium, digital or analog medium, hard disk, floppy disk, optical disc read-only memory (CD-ROM), recordable optical disc (CD-R), rewritable optical disc (CD-RW), optical disc, magnetic medium, magneto-optical medium, removable memory card or disk, various types of digital universal optical disc (DVD), magnetic tape, cassette tape, etc.

[0157] Computer instructions may include any suitable type of code implemented using any appropriate high-level, low-level, object-oriented, visual, compiled, and / or interpreted programming language, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, etc.

[0158] While the embodiments disclosed in this specification are as described above, they are not limited thereto. Any person skilled in the art can make various modifications and alterations without departing from the spirit and scope of the embodiments in this specification. Therefore, the scope of protection of the embodiments in this specification should be determined by the scope defined in the claims.< / start> < / start> < / start> < / start> < / start> < / start>

Claims

1. A corpus generation method characterized by, include: Obtain the first corpus of the target language type; The first corpus is input into a pre-trained translation model array for translation, resulting in multiple translation results. The translation model array includes models for translating the first corpus into other language types and models for translating the other language types into the target language type. The translation models in the array are arranged in a preset order. Each translation model in the array includes an encoding layer and a decoding layer. The process of inputting the first corpus into the pre-trained translation model array for translation to obtain multiple translation results includes: inputting the first corpus into the encoding layer of the first translation model in the array for encoding to obtain a corresponding vector matrix; inputting the vector matrix into the decoding layer of the first translation model for decoding and translation processing, and obtaining multiple first translation results for a preset language type according to a preset algorithm; and inputting the multiple first translation results into other translation models in the array to translate the first translation results accordingly, thereby obtaining multiple translation results for the target language type. The vector matrix is ​​input to the decoding layer of the first translation model, and the vector matrix is ​​decoded and translated. According to a preset algorithm, multiple first translation results for a preset language type are obtained, including: inputting the vector matrix corresponding to the first corpus and a set of start identifiers to the encoding layer of the first translation model to obtain a first target vector matrix; inputting the first target vector matrix to the decoding layer of the first translation model, and decoding and translating the first target vector matrix to obtain multiple word units, and after normalization, obtaining the probability value of each word unit; selecting a first preset number of word units according to the probability values ​​of each word unit and according to a beam search algorithm to obtain a first target word unit set; inputting the vector matrix corresponding to the first corpus and the first target word unit set to the decoding layer of the translation model, and obtaining multiple prediction results according to the beam search algorithm; selecting the first preset number of prediction results according to the probability values ​​of the multiple prediction results; iteratively predicting the prediction results until an end identifier appears, obtaining multiple first translation results for other language types. The edit distance between the multiple translation results and the first corpus is calculated to obtain the target corpus corresponding to the first corpus.

2. The corpus generation method of claim 1, wherein, At least one translation model in the array of translation models generates output corpora with at least two expressions for any input corpus.

3. The corpus generation method according to claim 1, characterized in that, Each translation model in the translation model array includes an encoding layer and a decoding layer; The step of inputting the first corpus into a pre-trained translation model array for translation to obtain multiple translation results also includes: The first corpus is input into the first translation model in the translation model array, and the first corpus is translated to obtain a second translation result of a preset language type; The second translation result is input into the encoding layer of at least one other translation model in the translation model array to obtain the corresponding vector matrix; The vector matrix is ​​input into the decoding layer of at least one other translation model, and the vector matrix is ​​subjected to corresponding decoding and translation processing. According to a preset algorithm, multiple translation results of the target language type are obtained.

4. The corpus generation method according to claim 3, characterized in that, The step involves inputting the vector matrix into the decoding layer of at least one other translation model, performing corresponding decoding and translation processing on the vector matrix, and obtaining multiple translation results for the target language type according to a preset algorithm, including: The vector matrix corresponding to the second translation result and the set of start identifiers are input into the decoding layer of the other at least one translation model to obtain the second target vector matrix; The second target vector matrix is ​​input into the decoding layer of the other at least one translation model, and the second target vector matrix is ​​subjected to corresponding decoding and translation processing to obtain multiple word units. After normalization, the probability value of each word unit is obtained. Based on the probability values ​​of each word unit and according to the preset algorithm, a second preset number of word units are selected to obtain the second target word unit set; The vector matrix corresponding to the second translation result and the set of second target word units are input into the decoding layer of the other at least one translation model, and multiple prediction results are obtained according to the preset algorithm; The prediction results are iteratively predicted until an end identifier is found, so as to obtain multiple translation results for the target language type.

5. The corpus generation method according to claim 1, characterized in that, The step of calculating the edit distance between the multiple translation results and the first corpus to obtain the target corpus corresponding to the first corpus includes: Calculate the edit distance between each word in each of the multiple translation results and each word in the first corpus; Based on the edit distance, the translation results that meet the preset conditions are selected as the target corpus corresponding to the first corpus.

6. The corpus generation method according to any one of claims 1 to 5, characterized in that, The translation model array consists of translation models belonging to the same language family.

7. A corpus generation device, characterized in that, include: Corpus acquisition unit, suitable for acquiring the first corpus of the target language type; A translation unit is adapted to input the first corpus into a pre-trained translation model array for translation, obtaining multiple translation results. The translation model array includes a translation model for translating the first corpus into other language types, and a translation model for translating the other language types into the target language type. The translation models in the array are arranged in a preset order. Each translation model in the array includes an encoding layer and a decoding layer. The process of inputting the first corpus into the pre-trained translation model array to obtain multiple translation results includes: inputting the first corpus into the encoding layer of the first translation model in the array for encoding, obtaining a corresponding vector matrix; inputting the vector matrix into the decoding layer of the first translation model, performing decoding and translation processing on the vector matrix, and obtaining multiple first translation results for a preset language type according to a preset algorithm; and inputting the multiple first translation results into other translation models in the array to translate the first translation results accordingly, thereby obtaining multiple translation results for the target language type. The process involves inputting the vector matrix into the decoding layer of the first translation model, decoding and translating the vector matrix, and obtaining multiple first translation results for a preset language type according to a preset algorithm. This includes: inputting the vector matrix corresponding to the first corpus and a set of start identifiers into the encoding layer of the first translation model to obtain a first target vector matrix; inputting the first target vector matrix into the decoding layer of the first translation model, decoding and translating the first target vector matrix to obtain multiple word units, and obtaining the probability value of each word unit after normalization; selecting a first preset number of word units based on the probability values ​​of each word unit and according to a beam search algorithm to obtain a first target word unit set; inputting the vector matrix corresponding to the first corpus and the first target word unit set into the decoding layer of the translation model, and obtaining multiple prediction results according to the beam search algorithm; selecting the first preset number of prediction results based on the probability values ​​of the multiple prediction results; and iteratively predicting the prediction results until an end identifier appears, thus obtaining multiple first translation results for other language types. The processing unit is adapted to calculate the edit distance between the multiple translation results and the first corpus, and obtain the target corpus corresponding to the first corpus.

8. An electronic device comprising a memory and a processor, wherein, The memory is adapted to store one or more computer instructions, characterized in that, when the processor executes the computer instructions, it performs the steps of the corpus generation method according to any one of claims 1 to 6.

9. A computer-readable storage medium storing computer instructions thereon, characterized in that, When the computer instructions are executed, they perform the steps of the corpus generation method according to any one of claims 1 to 6.