A training method and device of a general translation model, a computer device, a medium and a program product
By adding semantic labels and adjusting model parameters during the training process of the general translation model, the off-target problem of the general translation model is solved, and the translation accuracy and efficiency are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-26
- Publication Date
- 2026-06-26
AI Technical Summary
General translation models are prone to off-target issues, resulting in the translation text containing non-target language text. Existing post-processing operations are difficult to effectively solve this problem, affecting translation accuracy.
During training, semantic labels are added to language sample pairs, translation is performed using an initial general translation model, and model parameters are adjusted based on the differences between the translated text and the target text to train a general translation model.
It improves translation accuracy, reduces the probability of off-target problems, shortens the translation cycle, and reduces the cost of manual correction.
Smart Images

Figure CN122287652A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a training method, apparatus, computer device, medium, and program product for a general translation model. Background Technology
[0002] A general translation model is capable of translating text in multiple languages, such as translating Chinese to English and English to Spanish. While general translation models are highly convenient, they are prone to off-target translation problems. Off-target translation occurs when the target language is not treated as the intended language for translation, resulting in the inclusion of text in other languages in the final translation.
[0003] In related technologies, post-processing is typically employed to reduce the probability of off-target errors in general translation models. Specifically, scanning rules are pre-set based on the off-target problem, and the translation text is scanned using character sets from other languages based on these rules. If text belonging to other languages is detected in the translation text, it is manually corrected to improve translation accuracy.
[0004] However, since the off-target problem is not limited to the character set of a single language, such as when the target language is English, the translated text may contain multiple character sets such as Chinese, German, and French. This makes it difficult to describe the scanning rules in the above post-processing operations, which in turn leads to the off-target problem and results in lower translation accuracy. Summary of the Invention
[0005] To address the aforementioned technical problems, this application provides a training method, apparatus, computer equipment, medium, and program product for a general translation model, which reduces the probability of off-target problems and improves translation accuracy.
[0006] The embodiments of this application disclose the following technical solutions:
[0007] On the one hand, embodiments of this application provide a training method for a general translation model, the method comprising:
[0008] Obtain target language sample pairs including target semantic tags. The language sample pairs include source text belonging to the source language and target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tags have the semantics to translate the source text into text belonging to the target language.
[0009] Based on the target semantic tags and the source text, translation is performed using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tags. The initial general translation model is used to translate texts in multiple languages.
[0010] Based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain a general translation model.
[0011] On the other hand, embodiments of this application provide a training device for a general translation model, characterized in that the device includes: an acquisition unit, a translation unit, and an adjustment unit;
[0012] The acquisition unit is used to acquire target language sample pairs including target semantic tags. The language sample pairs include source text belonging to the source language and target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tags have the semantics to translate the source text into text belonging to the target language.
[0013] The translation unit is used to translate based on the target semantic tag and the source text using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tag. The initial general translation model is used to translate texts in multiple languages.
[0014] The adjustment unit is used to adjust the model parameters of the initial general translation model according to the difference between the translated text and the target text, so as to obtain a general translation model.
[0015] On the other hand, embodiments of this application provide a computer device, the computer device including a processor and a memory:
[0016] The memory is used to store computer programs and to transfer the computer programs to the processor;
[0017] The processor is configured to execute the methods described above according to instructions in the computer program.
[0018] On the other hand, embodiments of this application provide a computer-readable storage medium for storing a computer program for performing the methods described above.
[0019] On the other hand, embodiments of this application provide a computer program product including a computer program, which, when run on a computer device, causes the computer device to perform the methods described above.
[0020] As can be seen from the above technical solution, during the training process, semantic labels are added to the language sample pairs used for training. The semantics represented by these labels enable the general translation model to clearly define the translation task it performs. Taking a target language sample pair including target semantic labels as an example, this target language sample pair includes source text belonging to the source language and target text belonging to the target language. The target semantic labels have the semantic meaning of translating the text into text belonging to the target language. Based on the target semantic labels and the source text, translation is performed through the initial general translation model. During the translation process, the initial general translation model can clearly define the current translation task based on the semantics of the target semantic labels, and thus translate the source text based on this translation task to obtain the translated text. Based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain the general translation model. Thus, through the training process, the general translation model can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn to understand the semantics of the semantic labels, thereby clarifying the target language indicated by the current translation task, reducing the probability of off-target problems during the translation process, and improving the accuracy of the translation. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 A schematic diagram illustrating an application scenario of a training method for a general translation model provided in this application embodiment;
[0023] Figure 2 A schematic diagram illustrating the application of a general translation model in a game development scenario, provided as an embodiment of this application;
[0024] Figure 3 A schematic diagram illustrating the application of a general translation model in a foreign language learning scenario, provided as an embodiment of this application;
[0025] Figure 4 A flowchart illustrating a training method for a general translation model provided in this application embodiment;
[0026] Figure 5 A schematic diagram illustrating the training and application process of a general translation model provided in this application embodiment;
[0027] Figure 6A schematic flowchart of the training and application of another general translation model provided by an embodiment of this application;
[0028] Figure 7 A schematic flowchart of the training and application of yet another general translation model provided by an embodiment of this application;
[0029] Figure 8 A schematic flowchart of the training and application of yet another general translation model provided by an embodiment of this application;
[0030] Figure 9 A schematic flowchart of cleaning an initial sample set provided by an embodiment of this application;
[0031] Figure 10 A schematic diagram of the application scenario of a training method for a general translation model provided by an embodiment of this application;
[0032] Figure 11 A schematic diagram of data cleaning using a language classification model provided by an embodiment of this application;
[0033] Figure 12 A schematic structural diagram of a training device for a general translation model provided by an embodiment of this application;
[0034] Figure 13 A schematic structural diagram of a server provided by an embodiment of this application;
[0035] Figure 14 A schematic structural diagram of a terminal device provided by an embodiment of this application. Detailed implementation manners
[0036] The embodiments of this application will be described below with reference to the accompanying drawings.
[0037] Taking the translation from Chinese to English as an example, when "I love to eat apples" is translated as "I love to eat apples", then this translation task (hereinafter, for the convenience of description, the process of translating text belonging to one language into text belonging to another language is called a translation task) has a problem of missing the target. Through the post-processing operations described in the related art, the scanning rule will be set to recognize Chinese characters, so that "I love to eat apples" can be translated through the Chinese character set, and "apples" can be recognized, and then manually corrected to "I love to eat apple".
[0038] Analysis revealed that because the general translation model can perform multiple translation tasks, it may become unclear which specific task it is performing, leading to off-target errors. Furthermore, manual correction is a lengthy and costly process.
[0039] Based on this, embodiments of this application provide a training method for a general translation model. During training, semantic labels are added to the language sample pairs used for training. The semantic labels represent semantics that enable the general translation model to clearly understand the translation task it is performing. Thus, the training process is performed based on these semantically labeled language sample pairs, allowing the trained general translation model to not only learn how to translate text belonging to the source language into text belonging to the target language, but also to learn and understand the semantics of the semantic labels, thereby clarifying the target language indicated by the current translation task. This reduces the probability of off-target errors during translation and improves translation accuracy. This process reduces the probability of manual correction, shortens the cycle for obtaining translated text, and reduces overhead.
[0040] The training method for the general translation model provided in this application can be applied to computer devices with the capability to train general translation models, such as terminal devices and servers.
[0041] Specifically, terminal devices can be desktop computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can be smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, etc. Smart in-vehicle devices can be in-vehicle navigation terminals and in-vehicle computers, etc. Portable wearable devices can be smartwatches, smart bracelets, head-mounted devices, etc., but are not limited to these.
[0042] The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server or server cluster that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. Terminal devices and servers can be connected directly or indirectly via wired or wireless communication; this application does not impose any restrictions on this.
[0043] To facilitate understanding of the training method for the general translation model provided in this application embodiment, the following example uses a server as the execution subject of the training method for the general translation model to illustrate the application scenarios of the training method for the general translation model.
[0044] See Figure 1, This figure is a schematic diagram of the application scenario of a method for training a general translation model provided by an embodiment of the present application. As Figure 1 shown, this application scenario includes a server 100. The server 100 can be an independent server for training a general translation model. After the training of the general translation model is completed, the trained general translation model can be deployed on the server or terminal device corresponding to the product to provide general translation services; the server 100 can also be a server that provides corresponding services for various products, and the services provided can include, for example, translating the source text belonging to the source language into the text belonging to the target language. Hereinafter, an example of the server 100 training the general translation model will be used for illustration.
[0045] During the training process, semantic labels are added to the language sample pairs used for training. The semantics represented by the semantic labels can enable the general translation model to clearly understand the translation task it performs. Taking the target language sample pair including the target semantic label as an example, the target language sample pair includes the source text belonging to the source language and the target text belonging to the target language, and the target language sample pair further includes the target semantic label. The target semantic label has the semantics of translating the text to be translated (in the embodiment of the present application, it is the source text) into the text belonging to the target language. For example, the source language is Chinese, the source text is "我爱吃苹果", the target language is English, the target text is "I love to eat apple", and the semantics of the target semantic label is to translate the text to be translated into English.
[0046] The server 100 translates according to the target semantic label and the source text through the initial general translation model. The initial general translation model can clarify the current translation task based on the semantics of the target semantic label during the translation process, such as translating the source text into English. Thus, the source text is translated based on this translation task to obtain a translation text. For example, the source text "我爱吃苹果" belonging to Chinese is translated into English to obtain the translation text "I love to eat苹果".
[0047] The server 100 adjusts the model parameters of the initial general translation model according to the difference between the translation text and the target text to obtain the general translation model. For example, the difference between the translation text and the target text is reduced, so that the translation text obtained based on the initial general translation model is closer and closer to the target text, thereby making the output result (such as the translation text) of the initial general translation model more and more accurate to obtain a general translation model with more accurate translation. For example, through continuous training, the translation text "I love to eat苹果" obtained based on the initial general translation model can become "I love to eatapple".
[0048] Therefore, through the training process, the general translation model can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn to understand the semantics of semantic tags, thereby clarifying the target language indicated by the current translation task, thus reducing the probability of off-target problems in the translation process and improving the accuracy of translation.
[0049] The training method for the general translation model provided in this application embodiment can be executed by a server. However, in other embodiments of this application, the terminal device may also have similar functions to the server to execute the training method for the general translation model provided in this application embodiment, or the terminal device and the server may jointly execute the training method for the general translation model provided in this application embodiment. This embodiment does not limit this.
[0050] The training method for the general translation model provided in this application can be applied to various scenarios, including but not limited to game development, translation software, and foreign language learning. Two scenarios are given below as examples.
[0051] Scenario 1: Foreign language learning scenario.
[0052] For example, see Figure 2 , Figure 2 The illustrated application scenario may include a terminal device 210 and a server 220, which can communicate via a communication network. The communication network uses standard communication technologies and / or protocols, typically the Internet, but can also be any network, including but not limited to Bluetooth, a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), mobile, private networks, or any combination of virtual private networks. In some embodiments, customized or dedicated data communication technologies may be used to replace or supplement the aforementioned data communication technologies.
[0053] Terminal device 210 is equipped with a foreign language learning client, which provides foreign language learning services. Server 220 deploys a general translation model trained using the training method of the general translation model provided in this application embodiment, used to provide text translation services to the foreign language learning client. Taking English learning as an example... Figure 2As shown, a user selects a section of English text 211 from an English article using a foreign language learning client. The selected section (i.e., the translation content) is highlighted with an underline. The user can click the translation control 212 to select to translate the English text into Chinese (i.e., the translation target). The terminal device 210 sends the translation target and the translation content to the server 220. The server 220 generates corresponding semantic tags based on the translation target (these semantic tags have the semantic meaning of translating the text into Chinese). Using a deployed general translation model, the server translates the translation content based on the corresponding semantic tags to obtain the corresponding Chinese translation result. This Chinese translation result is then sent to the terminal device 210 and displayed on the foreign language learning client for the user's reference and learning.
[0054] Scenario 2: Game development scenario.
[0055] For example, during game development, due to differences between game versions in different language environments, it is necessary to translate game text in one language (such as game skill descriptions, character introductions, and character dialogues) into game text in other languages, thereby developing the game into multiple versions adapted to different language environments. For instance, based on the Chinese version of the game text, multiple regional versions of the game text (such as English, German, Japanese, and Korean) can be developed so that players in non-Chinese-speaking regions can also participate in the game smoothly.
[0056] Taking the translation of the Chinese version of the game text into English and Spanish versions as an example, such as... Figure 3 As shown, this application scenario may include terminal device 310, terminal device 320, and server 330. Terminal device 310 and server 330 can communicate via a communication network, and terminal device 320 and server 330 can communicate via a communication network. A general translation model trained using the training method of the general translation model provided in this application embodiment is deployed on server 330. If, in the game, the Chinese version of the game text is "The lineup can be adjusted after each battle," then server 330 translates it using the general translation model trained in this application embodiment. Based on semantic tags that have the semantic meaning of translating text into English, it can obtain the English version of the game text, "The lineup can be adjusted after each battle," and send it to terminal device 310 for display. Similarly, based on semantic tags that have the semantic meaning of translating text into Spanish, it can obtain the Spanish version of the game text, "La alineación se puede ajustar después de cada batalla," and send it to terminal device 320 for display.
[0057] It should be noted that the above application scenarios are merely examples. The training method of the general translation model provided in this embodiment can also be applied to other scenarios, and is not limited here.
[0058] The following describes in detail a training method for a general translation model provided in this application through method embodiments.
[0059] See Figure 4 This figure is a schematic flowchart illustrating a training method for a general translation model provided in an embodiment of this application. For ease of description, the following embodiments will still use a server as the execution entity of the training method for this general translation model as an example. Figure 4 As shown, the training method for this general translation model includes the following steps:
[0060] S401: Obtain target language sample pairs including target semantic labels.
[0061] In related technologies, to reduce the probability of off-target issues, instructions are often used to specify the translation task to be performed by the translation model. For example, before using a general translation model, a user not only inputs the text to be translated but also selects the translation target, such as translating text in language A to text in language B. Based on the user's selection, an instruction is generated to translate the text in language A to text in language B, and the general translation model translates the text to language B based on this instruction. However, executing a corresponding translation task based on a specific instruction is a capability inherent in the general translation model itself; if this capability is insufficient, off-target issues may still occur.
[0062] In other words, current general-purpose translation models are insufficient in their capabilities; they lack the ability to autonomously define translation tasks, which can lead to off-target errors. Therefore, to enhance the capabilities of general-purpose translation models, this application provides a training method for such models. This method enables the trained model to not only execute instructions but also autonomously define translation tasks, effectively adding a new capability to the model and thus improving its overall performance. This reduces off-target errors and increases translation accuracy. The process of training a general-purpose translation model to autonomously define translation tasks is described below.
[0063] Obtain language sample pairs that can be used to train a general translation model. A pair of language sample pairs includes a source text and a target text. The source text belongs to the source language and is the text to be translated. The target text belongs to the target language and is the text obtained through translation with relatively accurate results. The source text and the target text belong to different languages and express the same semantics. Taking a pair of language sample pairs as an example, the source text it includes is "I like to eat apples", and the target text is "I love to eat apple". The source text and the target text both express the same semantics and belong to Chinese and English respectively.
[0064] In order to train the general translation model to have the ability to autonomously clarify translation tasks, in the embodiments of the present application, semantic tags are added to the language sample pairs to obtain target language sample pairs. The semantic tag is a tag with semantics, and the semantics it has are determined based on the translation task. Different semantic tags have different semantics. For example, the target semantic tag has the semantics of translating the text (such as the source text) into a text belonging to the target language. Taking the target language sample pair with the source text "The weather is very good today" and the target text "The weather is good today" as an example, the target semantic tag it includes has the semantics of translating the text "The weather is very good today" into English.
[0065] The embodiments of the present application do not specifically limit the method of adding target semantic tags to the language sample pairs. Four methods will be used as examples for illustration later, and will not be elaborated here.
[0066] It can be understood that all the data collected in the present application (such as data like target language sample pairs) are obtained under the separate consent and authorization of the object to which the data belongs (such as users, institutions or enterprises), and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.
[0067] S402: Based on the target semantic tag and the source text, perform translation through the initial general translation model to obtain a translation text.
[0068] The initial general translation model is a general translation model that has not been trained yet and is used to translate texts in multiple languages. The structure of the initial general translation model and the general translation model can be the same. When the model parameters of the initial general translation model are adjusted, the general translation model is obtained. The present application does not specifically limit the initial general translation model, such as Transformer, Bidirectional Encoder Representations from Transformers (BERT), Tencent Machine Translation (TMT), etc.
[0069] This application does not specifically limit the method of the initial general translation model. The following describes two methods as examples.
[0070] Method 1 involves combining the target semantic tags and the source text to obtain the input text. This input text is then fed into an initial general translation model for translation, resulting in the translated text. During the translation process, the initial general translation model first performs semantic understanding on the target semantic tags and the source text. Based on the semantics of the target semantic tags, it translates the source text into text belonging to the target language, thus performing a language conversion and obtaining the translated text. In essence, the translated text is the text obtained by translating the source text into the target language indicated by the target semantic tags; that is, the translated text is the text obtained by the initial general translation model based on the target semantic tags.
[0071] Method 2 involves extracting features from the target semantic tags to obtain tag feature vectors, extracting features from the source text to obtain source text feature vectors, and then inputting the tag feature vectors and source text feature vectors into an initial general translation model. After semantic understanding, the initial general translation model translates the source text feature vectors based on the semantics of the target semantic tags to obtain the translated text.
[0072] Based on this, the initial general translation model can clarify the translation task during the translation process according to the semantics represented by the target semantic label, thereby outputting the corresponding translated text for the source text, reducing the amount of non-target language text output in the translated text, and reducing the probability of off-target problems.
[0073] S403: Adjust the model parameters of the initial general translation model based on the differences between the translated text and the target text to obtain a general translation model.
[0074] The general translation model is a model obtained by training an initial general translation model. The general translation model can also be used to translate texts in multiple languages. Furthermore, the general translation model can understand the semantics of semantic tags, thereby clarifying the translation task to be performed and reducing the probability of off-target problems.
[0075] The translated text is the text obtained by the initial general translation model, and its accuracy may be low. The target text is the text that expresses the same semantics as the source text, and its accuracy is high. The difference between the translated text and the target text reflects the accuracy of the initial general translation model. Based on the difference between the translated text and the target text, the model parameters of the initial general translation model can be adjusted so that the difference between the translated text and the target text obtained by the initial general translation model becomes smaller and smaller. In other words, the initial general translation model becomes more and more accurate through continuous training.
[0076] The embodiments of this application do not specifically limit the number of training iterations or the training method. For example, the training of the initial general translation model can be terminated after a preset number of iterations or after the initial general translation model has converged. That is, the model parameters of the initial general translation model are no longer adjusted, thereby obtaining a general translation model with fixed parameters.
[0077] As mentioned above, by training an initial general translation model to recognize target semantic labels, the initial general translation model can understand the semantics represented by the target semantic labels. This allows the initial general translation model to have target language specificity during the translation task, meaning it can clearly identify the target language of the translation task. Therefore, the general translation model trained using the initial general translation model can also clearly identify the target language of the translation task.
[0078] It should be noted that the initial general translation model can be trained multiple times to learn various semantic labels. The resulting general translation model can recognize multiple semantic labels. When performing translation tasks, it can not only clarify the translation task based on the corresponding semantic labels, but also, combined with its generalization ability, learn the ability to clarify the translation task autonomously through the learning of multiple semantic labels. For example, it can clarify the translation task even without semantic labels. This will be explained later through the first of the four ways to add semantic labels, and will not be elaborated here.
[0079] As can be seen from the above technical solution, during the training process, semantic labels are added to the language sample pairs used for training. The semantics represented by these labels enable the general translation model to clearly define the translation task it performs. Taking a target language sample pair including target semantic labels as an example, this target language sample pair includes source text belonging to the source language and target text belonging to the target language. Based on the target semantic labels and the source text, translation is performed through an initial general translation model. During the translation process, the initial general translation model can clearly define the current translation task based on the semantics of the target semantic labels, and thus translate the source text based on this translation task to obtain the translated text. Based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain the general translation model. Thus, through the training process, the general translation model can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn to understand the semantics of the semantic labels, thereby clarifying the target language indicated by the current translation task, reducing the probability of off-target problems during the translation process, and improving the accuracy of the translation.
[0080] The embodiments of this application are not specifically limited to the way of adding target semantic tags to language sample pairs. The four addition methods are first explained in a macroscopic way through Table 1, and then the four addition methods are explained in detail through the embodiments.
[0081] See Table 1, which provides a macro-level explanation of the four ways to add target semantic labels to language samples.
[0082] Table 1
[0083] Add method Source text target text Add method one Add target semantic tags No additional target semantic tags Method 2 Add first child tag and second child tag No additional target semantic tags Add method three No additional target semantic tags Add target semantic tags Add method four Add first child tag Add a second child tag
[0084] For a single language sample pair, Method 1 adds a target semantic tag to the source text but not to the target text. Method 2 also adds a target semantic tag to the source text but not to the target text. The target semantic tag includes a first sub-tag and a second sub-tag. The first sub-tag has the semantic meaning of translating text belonging to the source language, and the second sub-tag has the semantic meaning of the translated text belonging to the target language. Method 3 adds a target semantic tag to the target text but not to the source text. Method 4 adds a first sub-tag to the source text and a second sub-tag to the target text.
[0085] The following examples illustrate these four ways of adding data; see Table 2 for details.
[0086] Table 2
[0087] Add method Source text target text Add method one <es>I like to eat pomelo < / es> Me gusta comer toronja Method 2 <en> <es>I like to eat pomelo < / es> < / en> Me gusta comer toronja Add method three I like to eat pomelo <es>I like to eat grapefruit < / es> Add method four <en>I like to eat pomelo < / en> <es>I like to eat grapefruit < / es>
[0088] Table 2 includes examples of four ways to add target semantic labels to language sample pairs, where the source text is "I like to eat pomelo" and the target text is "Me gusta comer toronja". If the target semantic label is not split, then... <en>This refers to the target semantic tag. If the target semantic tag includes a first sub-tag and a second sub-tag, then... <en>This is the first child tag, and it has the semantic meaning of translating English. <es>This is the second sub-tag, which has the semantic meaning of translating the text into Spanish.
[0089] The four methods of addition are explained below.
[0090] (1) Add method one: add the target semantic tag to the source text.
[0091] As shown in Table 1 or Table 2, Method 1 adds the target semantic label to the source text. This means the source text includes the target semantic label, effectively combining the target semantic label and the source text to obtain the input text, which is then processed together. The process of training the initial general translation model based on the input text will be explained below.
[0092] See Figure 5 This figure illustrates the training and application process of a general translation model provided in this embodiment. The initial general translation model translates the target semantic tags and the source text to obtain the translated text. Referring further to Table 2, with the target language being Spanish, target semantic tags are generated that express the semantics of translating the source text into Spanish: <es>And add the target semantic tags to the source text.
[0093] During the translation process, the initial general translation model must not only perform semantic understanding of the source text but also of the target semantic tags. That is, training the initial general translation model to understand the target semantic tags enables it to autonomously determine the current translation task, i.e., which language the source text should be converted to. Based on the semantic understanding of the target semantic tags, the initial general translation model performs language conversion on the source text to obtain the translated text. Then, based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain the general translation model. Related details can be found in S402-S403 above and will not be repeated here.
[0094] This application does not specifically limit the usage process of the general translation model obtained by adding method one. The following continues to use the translation goal of translating text belonging to the source language into text belonging to the target language as an example to illustrate the usage through two methods.
[0095] For usage method one, please refer to [link / reference]. Figure 5 After obtaining the first text to be translated and the target language, target semantic labels are generated according to the target language. Based on the generated target semantic labels and the first text to be translated, the first target translated text is obtained by translating using the trained general translation model.
[0096] The first text to be translated is the untranslated text in the source language, which can be obtained through user input or other means. The first target text to be translated is the translated text obtained by the general translation model based on the first text to be translated. Using target semantic tags enables the general translation model to understand that the first text to be translated needs to be translated into the target language, thus clarifying the translation task more clearly and generating a more accurate first target text to be translated.
[0097] This application does not limit the method of obtaining the target language. For example, the target language can be set by the user, such as the user selecting the language of the text to be translated and the language of the translated text. Alternatively, the target language can be determined based on the most recent historical translation data. Historical translation data is used to record the user's multiple translation processes, including the meaning of two consecutive translation tasks. Figure 1 Generally, the language of the translated text is the same as the target language, so the language of the most recently recorded historical translation data is most likely to be the target language. For example, if the target language of the most recent translation record is English, then it is more likely that the target language of the next translation will be English. When the user does not specify a target language, English can be used as the target language for the next translation.
[0098] In Method Two, after obtaining the first text to be translated and the target language, the target language is used as the instruction for this translation task, and the first text to be translated is directly translated based on the general translation model. Although no target semantic labels are generated based on the target language, the general translation model, during training, learns the ability to autonomously define the translation task based on its generalization ability. That is, the general translation model not only has the ability to execute instructions but also the ability to autonomously define the translation task, which is equivalent to adding an additional capability. Therefore, the general translation model trained through this embodiment is more powerful. Using the target language as an instruction and translating the first text to be translated based on the more powerful general translation model can reduce off-target problems and improve translation accuracy. Furthermore, it should be noted that Method One is more effective than Methods Two through Four.
[0099] Therefore, by adding target semantic tags to the source text, the initial general translation model can perform semantic understanding of both the source text and the target semantic tags simultaneously during training. Compared to performing semantic understanding twice, this single semantic understanding is faster, resulting in a faster training speed for the general translation model. Furthermore, the trained general translation model can understand, based on the target semantic tags, that the translation task is to translate the first text to be translated into text belonging to the target language, thereby reducing the probability of translating the first text to be translated into a non-target language. In other words, the general translation model can more effectively translate the first text to be translated based on the target semantic tags during the translation process, reducing the probability of off-target errors and improving translation accuracy.
[0100] (2) Add method two: the target semantic tag includes the first sub-tag and the second sub-tag, and add the first sub-tag and the second sub-tag to the source text.
[0101] As shown in Table 1 or Table 2 (addition method two), the target semantic label includes a first sub-label and a second sub-label. The first sub-label has the semantic meaning of translating text belonging to the source language, and the second sub-label has the semantic meaning of the translated text belonging to the target language. In other words, the first sub-label emphasizes the language of the text to be translated, and the second sub-label emphasizes the language of the translated text. Adding the first and second sub-labels to the source text is equivalent to combining the first and second sub-labels with the source text to obtain the input text. The process of training the initial general translation model based on the input text will be explained below.
[0102] See Figure 6 This figure illustrates the training and application process of another general translation model provided in this application embodiment. The initial general translation model translates the first sub-tag, the second sub-tag, and the source text to obtain the translated text. Referring again to Table 2, with the target language being Spanish, a second sub-tag is generated that expresses the semantics of translating the source text into Spanish: <es>And the first sub-tag that can express the semantics of the source text belonging to English: <en>And add the first and second child tags to the source text.
[0103] During the translation process, the initial general translation model must perform semantic understanding not only on the source text but also on the first and second sub-tags. That is, the initial general translation model is trained to understand the semantics of the first and second sub-tags, which respectively emphasize the languages of the two texts (the text to be translated and the translated text). This ensures that when defining the translation task, the initial general translation model clearly defines not only the language of the text to be translated but also the language of the text to be translated. Based on the semantic understanding results of the first and second sub-tags by the initial general translation model, the source text is converted to a different language to obtain the translated text. Then, based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain the general translation model. Related details can be found in S402-S403 above and will not be repeated here.
[0104] This application does not specifically limit the usage process of the general translation model obtained based on the second method of addition. The general translation model can translate texts in multiple languages. The following continues to use the translation goal of translating texts belonging to the source language into texts belonging to the target language as an example to illustrate one usage method. For relevant details, please refer to the aforementioned first method of addition, which will not be repeated here.
[0105] See also Figure 6 During model usage, a second text to be translated and the target language are obtained. The second text to be translated is untranslated text belonging to the source language. A first sub-tag can be generated based on the second text to be translated, thus identifying the language of the second text to be translated. A second sub-tag is generated based on the target language, thus identifying the language of the second target text to be translated. The first and second sub-tags constitute the target semantic tags.
[0106] Based on the first sub-tag, the second sub-tag, and the second text to be translated, a general translation model is used to obtain the second target translated text. The second target translated text is the translated text obtained by translating the second text to be translated using the general translation model. The use of the first and second sub-tags enables the general translation model to understand not only the semantics of translating the second text to be translated into the target language, but also the semantics of the second text to be translated being in the source language. This allows for a clearer understanding of the source and target languages involved in the translation task, resulting in a more accurate second target translated text.
[0107] Therefore, by adding a first sub-label and a second sub-label to the source text, the target semantic label is refined into a first sub-label and a second sub-label. During training, the initial general translation model, when translating the source text, can not only identify the language of the translated text based on the second sub-label, but also identify the language of the text to be translated based on the first sub-label, thus better understanding the translation task. The trained general translation model, based on the first and second sub-labels, can understand that the translated text (e.g., the second target translation text) belongs to the target language, and also understand that the text to be translated (e.g., the second text to be translated) belongs to the source language, further clarifying the translation task. In other words, the general translation model can more accurately translate the first text to be translated based on the target semantic labels, including the first and second sub-labels, reducing the probability of off-target errors and improving translation accuracy.
[0108] (3) Add method three: add the target semantic tag to the target text.
[0109] As shown in Table 1 or Table 2, method three involves adding the target semantic tags to the target text, meaning the target text includes the target semantic tags. The process of the initial general translation model will be explained below.
[0110] See Figure 7 This figure illustrates the training and application process of another general translation model provided in this application embodiment. The initial general translation model translates the source text to obtain the first text to be translated. Referring further to Table 2, the target language is Spanish, and target semantic labels that express the semantics of translating the source text into Spanish are generated: <es>The target semantic tags are then added to the target text. Semantic recognition is performed on the target semantic tags included in the target text to obtain the first semantic result. Based on the first semantic result, the first text to be translated is adjusted to obtain the translated text. According to the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain the general translation model.
[0111] This application does not specifically limit the semantic recognition method, such as recognition through recurrent neural networks (RNN), long short-term memory networks (LSTM), etc. The first semantic result is the result describing the semantics of the target semantic label.
[0112] The first undetermined translation text is a text obtained by translating the source text according to the initial general translation model. Since it is translated without understanding the semantics corresponding to the target semantic tags, its accuracy is low. The first semantic result is the result obtained by identifying the semantics corresponding to the target semantic tags. It reflects the languages that the first undetermined translation text should include; that is, the text that can be clearly translated into the target language based on the first semantic result. Therefore, the first undetermined translation text is adjusted based on the first semantic result to obtain the translated text.
[0113] This application does not specifically limit the adjustment method. For example, if the first text to be translated includes a language different from the language indicated by the first semantic result, the text corresponding to the different language is re-translated until the language is the same as that indicated by the first semantic result, thus obtaining the translated text. Taking the first text to be translated as "I like to eat apple" as an example, if the language indicated by the first semantic result is English, then "apple" is re-translated until "apple" is obtained, thus obtaining the translated text "I like to eat apple". Alternatively, text in the first text to be translated that has off-target issues can be replaced with text in the target language that expresses the same semantic meaning, thereby obtaining a more accurate translated text.
[0114] This application does not specifically limit the usage process of the general translation model obtained based on the third method of addition. The general translation model can translate texts in multiple languages. The following continues to use the translation goal of translating texts belonging to the source language into texts belonging to the target language as an example to illustrate one usage method. For relevant details, please refer to the aforementioned first method of addition, which will not be repeated here.
[0115] See also Figure 7 After obtaining the third text to be translated and the target language, the third text to be translated is the text awaiting translation, and the target language is the language of the text translated in the translation task. The target language can be used as the instruction for this translation task, and the third text to be translated can be translated using a general translation model to obtain the third target translated text. Similarly, although no target semantic labels are generated based on the target language, the general translation model already has the ability to autonomously define the translation task during training. Using the target language as the instruction and translating the third text to be translated using a more capable general translation model can reduce off-target problems and improve translation accuracy.
[0116] As one possible approach to improve the accuracy of the third target translation text, target semantic tags can be obtained based on the target language, and then semantic recognition can be performed on the target semantic tags. Based on the semantic results, the third target translation text can be adjusted to obtain the adjusted translation text, thereby improving accuracy.
[0117] Therefore, although the accuracy of the first undetermined translation text obtained by translating the source text using the initial general translation model is low, the accuracy of the first semantic result based on the target semantic label is high. Adjusting the first undetermined translation text based on the first semantic result yields a translation text with higher accuracy, and more clearly defines the translation task as translating the text to be translated into text belonging to the target language. Adjusting the model parameters of the initial general translation model based on the differences between the translation text and the target text results in a general translation model. The trained general translation model not only learns how to translate text belonging to the source language into text belonging to the target language, but also learns the adjustment process based on the target semantic label; that is, it can indirectly understand the semantics of the target semantic label, making the general translation model more powerful, thereby reducing the probability of off-target text and improving translation accuracy.
[0118] (4) Add method four: the target semantic tag includes the first sub-tag and the second sub-tag. Add the first sub-tag to the source text and the second sub-tag to the target text.
[0119] As shown in Table 1 or Table 2, in addition to method four, the target semantic label includes a first sub-label and a second sub-label. The first sub-label has the semantic meaning of translating text belonging to the source language, and the second sub-label has the semantic meaning of the translated text belonging to the target language. In other words, the first sub-label emphasizes the language of the text to be translated, and the second sub-label emphasizes the language of the translated text. The training process of the initial general translation model will be explained below.
[0120] See Figure 8 This figure illustrates the training and application process of another general translation model provided in this application embodiment. The initial general translation model translates the first sub-tag and the source text to obtain the second text to be translated. Referring again to Table 2, with the target language being Spanish, a second sub-tag is generated that expresses the semantics of translating the source text into Spanish: <es>And the first sub-tag that can express the semantics of the source text belonging to English: <en>The first sub-tag is added to the source text, and the second sub-tag is added to the target text.
[0121] In the translation process, the initial general translation model not only performs semantic understanding of the source text but also of the first sub-tag. This allows the initial general translation model to identify the language of the text being translated, i.e., the source language, when defining the translation task. Based on the semantic understanding of the first sub-tag, the source text is converted to a different language, resulting in the second text to be translated. Semantic recognition is then performed on the second sub-tag included in the target text, yielding the second semantic result. The second text to be translated is then adjusted based on the second semantic result, resulting in the translated text. The model parameters of the initial general translation model are adjusted according to the differences between the translated text and the target text, resulting in the general translation model. Finally, the model parameters of the initial general translation model are adjusted again based on the differences between the translated text and the target text, resulting in the general translation model.
[0122] The second undetermined translation text is a text obtained by translating the source text according to the initial general translation model. Since this second undetermined translation text is translated without understanding the semantics corresponding to the second sub-tag, its accuracy is low. The second semantic result is the result obtained by identifying the semantics corresponding to the second sub-tag, which identifies the target language, resulting in higher accuracy. Therefore, adjusting the second undetermined translation text using the more accurate second semantic result yields a more accurate translation text. This application does not specifically limit the adjustment method; relevant details can be found in the adjustment method for the first undetermined translation text, which will not be repeated here.
[0123] This application does not specifically limit the usage process of the general translation model obtained based on the fourth addition method. The general translation model can translate texts in multiple languages. The following continues to use the translation goal of translating texts belonging to the source language into texts belonging to the target language as an example to illustrate one usage method. For relevant details, please refer to the aforementioned first addition method, which will not be repeated here.
[0124] See also Figure 8 After obtaining the fourth text to be translated and the target language, the fourth text to be translated is the text awaiting translation, and the target language is the language of the text translated in the translation task. A first sub-label can be generated based on the fourth text to be translated, and the target language can be used as the instruction for this translation task. Based on the semantic understanding of the first sub-label, the general translation model translates the fourth text to be translated, obtaining the fourth target translated text. Although there is no second sub-label, the general translation model has already acquired the ability to autonomously define the translation task during training. Therefore, by using the target language as the instruction, and translating the third text to be translated using the more capable general translation model, off-target problems can be reduced, thereby improving translation accuracy.
[0125] As one possible approach, to improve the accuracy of the fourth target translation text, a second sub-label can be obtained based on the target language, and then semantic recognition can be performed on the second sub-label. Based on the semantic results, the fourth target translation text can be adjusted to obtain the adjusted translation text, thereby improving accuracy.
[0126] Therefore, although the accuracy of the second undetermined translation text obtained by translating the source text using the initial general translation model is low, the accuracy of the second semantic result based on the second sub-label is high. Adjusting the second undetermined translation text based on the second semantic result yields a more accurate translation text, and more clearly defines the translation task as translating the text to be translated into text belonging to the target language. Adjusting the model parameters of the initial general translation model based on the differences between the translation text and the target text results in a general translation model. The trained general translation model not only learns how to translate text belonging to the source language into text belonging to the target language, but also learns the adjustment process based on the second sub-label, meaning it can indirectly understand the semantics of the target semantic label. This makes the general translation model more powerful, reducing the probability of off-target text and improving translation accuracy.
[0127] A prompt is an injected instruction used to direct an artificial intelligence model (such as a large language model) to output content according to the prompt. By optimizing the prompt, the output of the artificial intelligence model can better meet user needs. Based on this, this application embodiment also provides an application method of a general translation model based on prompts, taking the translation of source text belonging to the source language into target text belonging to the target language as an example, see A1-A3 (not shown in the figure).
[0128] A1: Obtain the target translation task, initial prompt words, and the fifth text to be translated.
[0129] The target translation task is used to indicate the language of the translated text, such as translating the current text into English. As one possible implementation, the target translation task can be a task that translates text belonging to the source language (such as the fifth text to be translated) into text belonging to the target language (such as the fifth target translation text), indicating the two languages required for the translation task.
[0130] The fifth text to be translated is the text that has not yet been translated, and the fifth target text to be translated is the text obtained by translating the fifth text to be translated using a general translation model.
[0131] Initial prompts are words that await filling in, and they are less specific to the translation task of a general translation model. For example, if the source language is English and the target language is Spanish, the initial prompts could be as shown in Table 3.
[0132] Table 3
[0133]
[0134] in, <source> Used to indicate <source> The following text is the text awaiting translation. <target>'This is used to instruct the general translation model to output the translated text according to the specified format. Please play a professional [ ] translator is a descriptive phrase that adds a translation scenario, meaning please play the role of a professional [ ] translator. The [ ] is filled in based on the target translation task, such as "English-Spanish", which means please play the role of a professional English-Spanish translator.
[0135] A2: Fill in the initial prompt words according to the target translation task and the fifth text to be translated to obtain the prompt words.
[0136] Following the format and position of the initial prompt words, the target translation task and the fifth text to be translated are filled into the corresponding positions to obtain the prompt words, which are the content used to prompt the model. For example, taking English as the source language and Spanish as the target language, the initial prompt words can be as shown in Table 4.
[0137] Table 4
[0138]
[0139] A3: Input the prompt words into the general translation model to obtain the fifth target translation text.
[0140] First, the prompt words can be converted into a form acceptable to a general translation model to obtain the prompt word information. Then, the prompt word information can be identified through a general translation model to clarify the target translation task. The text carried by the prompt words can be translated according to the target translation task indicated by the prompt words to obtain the third target translation text.
[0141] It should be noted that the above-mentioned general translation model based on prompt words can be used in parallel with any of the methods 1-4, or it can be superimposed on any of the methods 1-4. This allows the general translation model to learn and understand the semantics of semantic tags, thus clarifying the translation task and further improving the accuracy of the general translation model.
[0142] Therefore, the general translation model trained based on the aforementioned target semantic labels can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn to understand the semantics of the semantic labels. In addition, it can combine the prompt words that can clarify the translation task (obtained based on the target translation task) to clarify the target language indicated by the current translation task, thereby reducing the probability of off-target problems in the translation process and improving the accuracy of translation.
[0143] Word segmentation is the basic unit that constitutes a text. For example, source text may include multiple word segments. Taking the translation of source text belonging to the source language into target text belonging to the target language as an example, during the translation process, some word segments in the source text are often translated into text in a non-target language, resulting in a high probability of these word segments having off-target problems. In order to reduce the number of such word segments having off-target problems during the translation process, this application embodiment also provides a specific implementation method of S402, that is, based on the target semantic tags and source text, translation is performed through an initial general translation model to obtain the translated text. See B1-B3 for details (not shown in the figure):
[0144] B1: Based on the target semantic tags and the source text, the translation is performed using the initial general translation model to obtain the third undetermined translation text and the position of the target word segment.
[0145] The third text to be translated is a text obtained by translating the source text according to the initial general translation model. In the embodiments of this application, the target semantic tag not only has the semantic meaning of translating the source text into text belonging to the target language, but also has the semantic meaning of indicating the position of the target word in the source text. The probability of the target word being translated into the target language is less than the first probability threshold, that is, the target word is a word in the source text with a high probability of off-target problems.
[0146] For example, the first probability threshold can be 90%. When the probability of a word segment being translated into the target language is less than the first probability threshold, it indicates that the probability of the word segment being off-target is relatively high. Taking the source text "I love to eat apples" as an example, the source text includes four word segments: I, love, eat, and apple. Among them, "apple" is the target word, meaning that the probability of "apple" being translated into the target language is less than the first probability threshold.
[0147] This application does not limit how the target semantic tag indicates the location of the target word; two methods are given as examples. Method one: the target semantic tag can be embedded next to the target word, thus indicating not only who the target word is, but also where the target word is located. If the target semantic tag is... <en>Then, after the source text is embedded with the target semantic tag, it can be "I love to eat". <en>"Apple" can also mean "I love eating apples". <en>Method two: The positional sequence of the target word can also be added using target semantic tags to express the semantic meaning of the target word's position, such as... <en4>"I love eating apples." <en4>The 4th participle in is the target participle in the source text.
[0148] Thus, the initial general translation model can obtain the position of the target participle by identifying the semantics of the position of the target participle expressed by the target semantic label, and obtain the third tentative translation text by performing language conversion on the source text based on the current model parameters.
[0149] B2: If in the third tentative translation text, the language of the text corresponding to the position of the target participle is different from the target language, the initial general translation model is used to translate the text corresponding to the position of the target participle in the source text to obtain the translated participle.
[0150] If in the third tentative translation text, the language of the text corresponding to the position of the target participle is the same as the target language, it means that the target participle does not have a miss target problem in this translation, and thus the probability of the third tentative translation text having a miss target problem is relatively small, and the third tentative translation text can be directly determined as the translation text.
[0151] If in the third tentative translation text, the language of the text corresponding to the position of the target participle is different from the target language, it means that the target participle has a miss target problem in this translation, and the target participle needs to be corrected. The initial general translation model can be used to translate the text corresponding to the position of the target participle again to obtain the translated participle.
[0152] The translated participle is obtained by the initial general translation model translating the text corresponding to the position of the target participle and conforms to the text of the target language. It should be noted that it may be necessary to perform one or more translations through the initial general translation model until a translated participle that conforms to the target language is obtained. The translated participle can be obtained by the initial general translation model re-translating the target participle, or by translating the text obtained by translating the position of the target participle in the third tentative translation text.
[0153] Continuing to refer to the example in B1, so continue to take the source text "I like to eat apples" as an example for illustration. The target semantic label not only indicates that the target language is English, but also indicates that the 4th participle is prone to a miss target problem and is the target participle. If the third tentative translation text is "I like to eat apples", the 4th participle is "apples" which does not belong to the target language, so it is necessary to use the initial general translation model to translate "apples" again until "apple" belonging to the target language is obtained.
[0154] B3: Obtain the translation text according to the translated participle and the third tentative translation text. <For example, the translation segment can be used to overwrite or replace the text corresponding to the target segment in the third text to be translated, thereby enhancing the translation of the target segment and obtaining a more accurate translated text. Thus,
[0156] Therefore, the target semantic tag not only has the semantic meaning of translating the source text into text belonging to the target language, but also the semantic meaning of indicating the position of the target word in the source text. If the text corresponding to the position of the target word in the third undetermined translation text obtained by the initial general translation model does not belong to the target language, i.e., the language indicated by the target semantic tag, the target word is translated again until a translation word belonging to the target language is obtained. Then, the translation word is used to replace the third undetermined translation text to obtain the translated text. The target word that is prone to off-target problems in this translated text did not have off-target problems, thus improving the translation accuracy of the translated text. Moreover, by translating the target word multiple times through the initial general translation model, the initial general translation model can learn which words are prone to off-target problems. Thus, the trained general translation model can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn which words are prone to off-target problems, thus paying more attention to translating words that are prone to off-target problems.
[0157] In order to enable the initial general translation model to understand the semantics of the target semantic label, this application provides a specific implementation method for converting the target semantic label according to the vocabulary mapping table, see C1-C3 (not shown in the figure).
[0158] C1: Retrieves a lexical mapping table that includes various semantic tags.
[0159] A vocabulary mapping table includes the mapping relationship between text and encoding vectors, enabling the conversion between text and encoding vectors. Therefore, a vocabulary mapping table can be obtained by expanding the initial vocabulary mapping table with mapping relationships between various semantic tags and encoding vectors. The initial vocabulary mapping table does not include various semantic tags, while the current vocabulary mapping table includes various semantic tags.
[0160] Encoding vectors are numerical vectors obtained by encoding text. Encoding vectors are a numerical representation of text and can participate more directly in data processing operations such as matrix operations. Through encoding vectors, the initial general translation model can more easily learn the semantics expressed by semantic tags.
[0161] The vocabulary mapping table includes various semantic tags, each with the semantic meaning of translating the translated text into different languages. It should be noted that, in this embodiment, the various semantic tags included in the vocabulary mapping table include target semantic tags, which will be used as an example below.
[0162] C2: Transform the target semantic label according to the vocabulary mapping table to obtain the label encoding vector.
[0163] The tag encoding vector is an encoding vector obtained by transforming the target semantic tag through a vocabulary mapping table. The vocabulary mapping table contains the mapping relationship between text and encoding vectors. Different texts correspond to different encoding vectors. Based on this, the encoding vector corresponding to the target semantic tag can be found based on the vocabulary mapping table, that is, the tag encoding vector.
[0164] C3: Transform the source text according to the vocabulary mapping table to obtain the text encoding vector.
[0165] The encoding vector corresponding to the source text can be found based on the vocabulary mapping table, i.e., the text encoding vector. The text encoding vector is the encoding vector obtained by transforming the source text through the vocabulary mapping table.
[0166] C4: Based on the tag encoding vector and text encoding vector, the translation is performed using the initial general translation model to obtain the translated text.
[0167] Both the label encoding vector and the text encoding vector are numerical vectors. The initial general translation model is easier to understand, and thus, based on the label encoding vector and the text encoding vector, translation is performed through an initial translation process to obtain the translated text. As one possible implementation, the initial general translation model can first obtain the translated text encoding vector, and then transform the translated text encoding vector according to a vocabulary mapping table to obtain the translated text, etc.
[0168] It should be noted that, in the process of performing translation tasks, the general translation model can also first convert the text to be translated (such as the first text to be translated) and various semantic tags (such as the target semantic tags) through a vocabulary mapping table to obtain the corresponding encoding vectors before translation, so as to better understand the semantics in the text and semantic tags.
[0169] Therefore, by expanding the vocabulary mapping table with various semantic tags, the vocabulary mapping table is expanded. After converting the semantic tags into corresponding encoding vectors based on the vocabulary mapping table, and then translating using the initial general translation model or general translation model, the understanding of the text or semantic tags by the initial general translation model or general translation model is enhanced. This improves the semantic understanding ability of the initial general translation model or general translation model for multiple semantic tags, enabling it to learn the translation task indicated by the semantic tags more directly and fully, thus improving the translation accuracy of the general translation model.
[0170] The training process of a general translation model requires multiple language sample pairs. However, if the target text in the target language sample pair contains text that does not belong to the target language, the language sample pair used for training will not be pure enough, which will lead to errors in the training process of the general translation model and thus a higher probability of off-target problems.
[0171] Based on this, embodiments of this application provide a data cleaning method to obtain cleaner data for training an initial general translation model. Language sample pairs can be divided into sample sets of multiple domains based on different factors such as style, background, and paradigm, such as the gaming domain, music domain, and literature domain. The target domain is one of these multiple domains. The following explanation uses data cleaning of the target domain as an example.
[0172] Specifically, an initial sample set for the target domain is obtained, and language sample pairs of target text belonging to other languages are removed from the initial sample set to obtain a standard sample set.
[0173] The initial sample set includes multiple language sample pairs, while the standard sample set is obtained after data cleaning of the initial sample set. It should be noted that the standard sample set may include target language sample pairs. The multiple language sample pairs included in the standard sample set are used to train the initial general translation model to obtain the general translation model. Other languages are languages different from the target language; for example, if the target language is German, other languages could be English or Portuguese. The category of the target text is the language to which the target text belongs. When all text in the target text belongs to the target language, the target text is categorized as the target language; when some or all of the text in the target text belongs to other languages, the target text is categorized as other languages.
[0174] By first determining the category of each target text in the initial sample set, we can obtain language sample pairs of target texts belonging to other languages. Then, we can remove these language sample pairs from the initial sample set to obtain a standard sample set for training the initial general translation model.
[0175] like Figure 9 As shown, taking the source text belonging to Chinese and the target text belonging to English as an example, the initial sample set includes language sample pair A, language sample pair B, and language sample pair C. Among them, the English corresponding to "support" in language sample pair B should be "support", while "soutien" is the French corresponding to "support", indicating that the category of the target text in this language sample pair is other languages. Delete language sample pair B to obtain the standard sample set.
[0176] Therefore, for the initial sample set corresponding to the target field, delete the language sample pairs whose category of the target text is other languages to obtain a more pure standard sample set for the target field. When the standard sample set is used to train the general translation model for the target field, it will reduce the training error caused by inaccurate language samples, reduce the probability of the off-target problem when the general translation model obtained by training translates the text in the target field, and improve the translation accuracy.
[0177] The embodiment of this application does not limit the data cleaning method for the initial sample set, that is, it does not specifically limit how to delete the language sample pairs whose category of the target text is other languages from the initial sample set to obtain the standard sample set. The following takes two cleaning methods as examples for illustration.
[0178] Cleaning method one: Clean based on the language classification model.
[0179] As known from the foregoing, each language sample pair includes a pair of language samples (such as the source text and the target text). As a possible implementation, the category of the language sample can be predicted by the language classification model, and the language classification model is a model used to classify the language to which the text belongs. After obtaining the multiple language sample pairs included in the initial sample set, classify the source text or the target text in the multiple language sample pairs through the language classification model to obtain the category corresponding to each language sample. Therefore, through the language classification model, the category of each language sample can be identified, and thus data cleaning can be performed based on the category of the language sample.
[0180] The embodiment of this application does not specifically limit the language classification model, such as Support Vector Machine (SVM), Convolutional Neural Network (CNN), Bidirectional Encoder Representations from Transformers (BERT), etc.
[0181] Most language classification models classify text by sequence prediction, which predicts the next element or the entire output sequence based on a given input sequence (such as characters, words, or sentences). However, BERT, using a bidirectional transformer encoder, can consider contextual information simultaneously. Compared to other models (those based on sequence prediction), it focuses more on semantic understanding for language classification. Semantic understanding involves parsing and interpreting the meaning of text, including the meaning of words, grammatical structure, and implicit information in the context. Semantic understanding focuses on the actual meaning of the text, not just its form or structure. In the process of classifying language samples, BERT can more fully understand the semantics of the text input to the model, resulting in higher accuracy. This application does not specifically limit the training process of the language classification model; subsequent explanations will use F1-F3 as examples, and will not be repeated here.
[0182] It should be noted that the data cleaning process can adopt different processing methods depending on the granularity of the data. The following will illustrate two methods (i.e., sentence-based or word-based).
[0183] First, the process of data cleaning on a sentence-by-sentence basis will be explained, see D1-D2 (not shown in the figure).
[0184] D1: Classify the target text in each language sample pair included in the initial sample set according to the language classification model to obtain the category of each target text.
[0185] Taking a target text in a language sample pair as an example, the probability of the target text belonging to different languages is predicted based on the language classification model, resulting in the probability distribution of the target text belonging to different languages. The category of the target text is then determined based on the probability distribution of the target text belonging to different languages, such as identifying the language with the highest probability as the category of the target text.
[0186] Each target text in each language sample pair in the initial sample set is identified using a language classification model to determine the category of each target text.
[0187] D2: Remove language sample pairs of target text belonging to other languages from the initial sample set to obtain the standard sample set.
[0188] Therefore, we can identify the category of the target text in the language sample pair only. When the category of the target text is another language, we can delete the target text and the source text corresponding to the target text from the initial sample set, that is, delete the language sample pair containing the target text, so as to obtain a relatively pure standard sample set.
[0189] Therefore, by using a language classification model to determine the category of each target text in the initial sample set, language sample pairs containing target texts belonging to other languages are deleted. This allows for the training of a more accurate general translation model based on a purer standard sample set (i.e., a lower probability that the target text in each language sample pair belongs to another language). This method requires fewer language samples to be identified, has a faster recognition speed, and is convenient and efficient.
[0190] Next, the process of data cleaning based on word segmentation will be explained, see E1-E3 (not shown in the figure).
[0191] E1: Based on the language classification model, classify the words in the target text of each language sample pair in the initial sample set to obtain the category of each word in each target text.
[0192] Each word in the target text can be obtained by segmenting the target text using a word segmenter. The category of each word is predicted based on the language classification model to obtain the category of each word in the target text.
[0193] One possible approach is to segment the target text using the tokenizer of the BERT model, and then classify each segmented word based on the BERT model, thereby improving the accuracy of the classification results.
[0194] E2: If the word segmentation category is other languages, then the word segmentation belonging to other languages will be translated into word segmentation belonging to the target language to obtain the updated target text.
[0195] If the word segmentation category is another language, it indicates a higher probability of off-target errors when training the initial general translation model based on the target text containing that word. Therefore, the word segmentation can be modified by translating the other language segmentation into the target language segmentation. Then, in the target text, the target language segmentation is used to replace the other language segmentation, resulting in an updated target text.
[0196] E3: Obtain the standard sample set based on the updated target text.
[0197] For example, the target text can be removed from the initial sample set and the updated target text can be added to the initial sample set to obtain the standard sample set.
[0198] Therefore, by re-translating word segments belonging to other languages in the target text, we can minimize the loss of language samples due to data cleaning, thus ensuring a richer set of language samples in the standard sample set and guaranteeing the accuracy of the trained general translation model. Compared to directly deleting training data with off-target issues from the training dataset, this method corrects word segments in the training data with off-target problems, making the training data richer and resulting in a model with stronger generalization and higher accuracy.
[0199] The training process for the language classification model can be found in F1-F3 (not shown in the figure).
[0200] F1: Get language samples with category labels.
[0201] Category labels are used to identify the language to which a language sample belongs. This application does not limit the labeling method of category labels. For example, category labels can be manually labeled or automatically labeled for language samples according to a predefined rule set. The rule set can include rules based on keyword matching, regular expressions or other logical conditions.
[0202] Language samples generally refer to clean datasets, such as the Workshop on Machine Translation (WWT) dataset or the Technology, Entertainment, Design (TED) dataset. Because these publicly available datasets are typically strongly labeled, they are clean and high-quality datasets. Language classification models trained on clean language samples tend to have higher accuracy.
[0203] F2: Classify the language samples according to the initial language classification model to obtain the predicted category of the language samples.
[0204] The initial language classification model is a language classification model that has not yet been fully trained. Language samples are classified using the current model parameters of the initial language classification model to obtain the category of the language sample. The classification process is described in D1 and will not be repeated here.
[0205] F3: Adjust the model parameters of the initial language classification model based on the difference between the predicted category and the category label to obtain the language classification model.
[0206] It should be noted that when training the initial language classification model, language samples are typically segmented using a word segmenter. A word segmenter is a tool or algorithm used to divide text into smaller units (such as words). Different languages are difficult to segment using the same word segmenter due to their different grammatical rules. For example, using the same word segmenter (such as the word segmenter built into the initial language classification model) for Chinese and Spanish will lead to inaccurate word segmentation. Therefore, this application provides an implementation method for joint training of the word segmenter and the initial language classification model, as shown in G1-G5 (not shown in the figures).
[0207] G1: Obtain target language samples with category labels.
[0208] The target language sample is one of multiple language samples used to train the initial language classification model.
[0209] G2: Determine the initial target segmenter based on the category of the target language sample, and perform segmentation on the target language sample based on the initial target segmenter to obtain multiple segments corresponding to the target language sample.
[0210] The initial target segmenter is a segmenter that matches the target language samples. Compared to the target segmenter, the initial target segmenter has not yet been fully trained.
[0211] For example, if the target language sample is classified as Japanese, then a word segmenter that can segment the target language sample according to Japanese grammar rules is identified. By segmenting the target language sample using the target word segmenter, the accuracy of obtaining multiple segments corresponding to the target language sample is higher.
[0212] G3: Classify multiple word segments according to the initial language classification model to obtain the probability distribution of each word segment.
[0213] Multiple word segments can be input into the initial language classification model as a single word segmentation sequence, or multiple word segments can be input into the initial language classification model separately. Then, the probability distribution of each word segment can be predicted based on the initial language classification model, that is, the probability of each word segment belonging to a different language can be predicted.
[0214] G4: Determine the predicted category of the target language sample based on the probability distribution of each word segment.
[0215] Taking the example that each word in the target language sample belongs to English, if the probability of each word belonging to English is greater than the preset threshold, then the predicted category of the target language sample is determined to be English.
[0216] G5: Based on the difference between the predicted category and the category label, adjust the model parameters of the initial language classification model and the initial target word segmenter to obtain the language classification model and the target word segmenter.
[0217] Therefore, by using a targeted word segmenter and jointly training the initial language classification model, the word segmentation adaptation capability of the initial language classification model during training is improved to different languages, thereby improving the accuracy of the language classification model.
[0218] The second cleaning method is based on the probability of word segmentation, see H1-H4 (not shown in the figure).
[0219] H1: Segment the target text in the undetermined language sample pairs in the initial sample set to obtain multiple segments.
[0220] The undetermined language sample pair is one of multiple language sample pairs.
[0221] H2: Determine the probability of occurrence of each word using the word probability table of the target language.
[0222] The word probability table includes the probability of each word belonging to the target language appearing in the target domain's lexicon. The probability of each word appearing in the target domain's lexicon can be determined based on the number of times each word appears in the target domain's lexicon.
[0223] Taking the encyclopedic knowledge domain as an example, the encyclopedic knowledge domain lexicon could be 1000 Chinese sentences describing knowledge. Translated sentences in different languages are collected. Taking English as the target language, statistical language model-based algorithms (N-grams) can be used to statistically analyze and capture word sequence patterns in the text. For example, 2-grams (i.e., N=2) are used to represent word groups composed of two consecutive word segments, and the frequency or probability of each word segment is recorded statistically.
[0224] For example, firstly, word segmentation yields 5000 word groups, and the frequency of each word group within these 5000 groups is counted. For instance, "I like" appears 500 times, "the pomelo" appears 80 times, "the toronja" appears 0 times, and "teme" appears 3 times. A non-linear normalization method can be used to normalize these frequency counts to probabilities between 0 and 1, resulting in a 2-gram word frequency array for English. This array, along with the probability of each word group, forms a word probability table. In this table, the probability of the word group "I like" is 0.72, the probability of the word group "the pomelo" is 0.52, the probability of the word group "the toronja" is 0, and the probability of the word group "teme" is 0.16, etc.
[0225] H3: Based on the occurrence probability of each word segment, determine the probability that the target text in the undetermined language sample pair belongs to the target language.
[0226] The probability of the target text in the language sample pair belonging to the target language can be obtained by averaging the occurrence probabilities of each segment. Alternatively, different weights can be assigned to each segment based on its part of speech (e.g., preposition, verb, noun, etc.), such as assigning lower weights to prepositions and higher weights to nouns, and then performing a weighted average operation on the occurrence probabilities of each segment to obtain the probability of the target text in the language sample pair belonging to the target language. This embodiment of the application does not limit this.
[0227] H4: If the probability that the language of the target text in the undetermined language sample pair belongs to the target language is less than the second probability threshold, then the undetermined language sample pair is deleted from the initial sample set to obtain the standard sample set.
[0228] The second probability threshold is used to reflect the probability that the target text belongs to the target language. For example, the second probability threshold can be 70%. If the probability that the target text belongs to the target language is less than the second probability threshold, it means that the probability that the target text belongs to the target language is relatively small. Then, the undetermined language sample pairs are deleted from the initial sample set to obtain the standard sample set.
[0229] The embodiments of this application do not specifically limit the method of cleaning the initial sample set. For example, cleaning based on cleaning method one, cleaning based on cleaning method two, cleaning first using cleaning method one and then cleaning second using cleaning method two, or cleaning first using cleaning method two and then cleaning first using cleaning method one, etc. The following description takes cleaning first using cleaning method one and then cleaning second using cleaning method two as an example. See I1-I7 for details.
[0230] I1: Classify the target text in each language sample pair included in the initial sample set according to the language classification model to obtain the category of each target text.
[0231] I2: Remove language sample pairs of the target text belonging to other languages from the initial sample set to obtain the undetermined sample set.
[0232] The undetermined sample set is the sample set obtained by the first cleaning process described above.
[0233] I3: For undetermined language sample pairs in the undetermined sample set, segment the target text in the undetermined language sample pairs to obtain multiple segments.
[0234] A language sample pair to be determined is one of multiple language sample pairs in a set of undetermined samples.
[0235] I4: Determine the occurrence probability of each word using the word probability table of the target language.
[0236] The word probability table includes the probability of each word belonging to the target language appearing in the vocabulary of the target domain.
[0237] I5: Based on the occurrence probability of each word segment, determine the probability that the target text in the undetermined language sample pair belongs to the target language.
[0238] I6: If the probability that the language of the target text in the pending language sample pair belongs to the target language is less than the second probability threshold, then the pending language sample pair is deleted from the pending sample set.
[0239] I7: Treat each language sample pair in the undetermined sample set as an undetermined language sample pair, and perform H2-H5 to obtain the standard sample set.
[0240] Therefore, the language classification model is a model pre-trained on a large amount of data. First, it identifies the category of each target text, achieving high accuracy and identifying most target texts belonging to other languages. Then, cleaning method two only needs to identify the remaining target texts in the target sample set that belong to other languages, reducing the number of identification steps in the entire data cleaning process. In other words, after processing the initial sample set using cleaning method one to obtain the target sample set, cleaning method two is used for further cleaning. This not only enhances the cleaning of the remaining target texts belonging to other languages in the target sample set, resulting in better cleaning results, but also reduces the computational load of data cleaning.
[0241] To facilitate a further understanding of the technical solutions provided in the embodiments of this application, the following example uses a server as the execution subject of the training method for the general translation model provided in the embodiments of this application, and the target domain as the game domain, to provide an overall exemplary description of the training method for the general translation model.
[0242] See Figure 10 The figure is a schematic diagram of an application scenario for a training method of a general translation model provided in an embodiment of this application.
[0243] S1001: Obtain clean data in multiple languages.
[0244] For example, obtain relatively clean datasets such as the WWT dataset and the TED dataset. Since these publicly available datasets are generally strongly labeled, they are clean and high-quality datasets, and language classification models trained based on clean language samples will have higher accuracy.
[0245] S1002: The language classification model is obtained through training.
[0246] The initial language classification model is trained using clean multilingual data to obtain the language classification model. For details, please refer to the aforementioned F1-F3 models, which will not be repeated here.
[0247] S1003: Obtain the initial sample set in the game domain.
[0248] The initial sample set includes multiple language sample pairs related to the game domain. Each language sample pair includes a source text and a target text. The language sample pairs included in the initial sample set contain impure language sample pairs, such as Chinese included in a target sample that should be English.
[0249] Using the trained language classification model, an initial sample set in the game domain (which can be further refined based on various games; this application does not specifically limit this) can be classified and cleaned to obtain a standard sample set that accurately belongs to a certain language. This will be explained in detail below.
[0250] S1004: Data cleaning is performed using a language classification model and N-grams to obtain a standard sample set.
[0251] The initial sample set is cleaned for the first time using a language classification model, as detailed in D1-D2 above; the data is cleaned for the second time using N-gram, as detailed in H1-H4 above, or the two data cleaning processes can be described in I1-I7 above.
[0252] As shown in Table 5, Table 5 is an example of cleaning the initial sample set.
[0253] Table 5
[0254]
[0255]
[0256] Column 1 contains the Chinese version of the game text (i.e., the source text in the language sample pairs). Column 2 contains the English version of the game text, but this is the data before cleaning and may contain impurities. Therefore, columns 1 and 2 constitute the initial sample set. Column 3 contains the English version of the game text, but this is the cleaned data and contains mostly impurities. Therefore, columns 1 and 3 constitute the standard sample set. Specifically, the correct description for the first data point in column 3 should be "Mayhem Mode - Movement speed reduction Arthur's Holy Guard Mode". After deleting the language sample pairs corresponding to the target text that are not in English from the initial sample set, the standard sample set is obtained.
[0257] Taking the language classification model as an example, see BERT model. Figure 11 The figure is a schematic diagram of data cleaning using a language classification model provided in an embodiment of this application.
[0258] S1101: Segment words using a word segmenter.
[0259] The BERT model's tokenizer segments the target samples in the initial sample set, resulting in multiple tokens. This loads the text from the initial sample set into the format required by the BERT model. We can then initialize the BERT model and load its structure.
[0260] S1102: Initialize the BERT model and load the model structure.
[0261] The output of the last layer of the BERT model is connected to a fully connected linear layer to act as a classifier, thereby classifying the language to which the text belongs.
[0262] S1103: Feature extraction using the BERT model.
[0263] The BERT model is used to extract features from the target text in the initial sample set, resulting in an embedding vector that describes the characteristics of the target text.
[0264] S1104: Classification of fully connected linear layers.
[0265] Based on embedded vectors, classification is performed through fully connected linear layers to classify the language to which the text belongs.
[0266] For example, if we want to determine whether a sentence belongs to English, the inference output of the BERT model is 0.63 for "I like to eat toronja". According to the empirical threshold, sentences with a value less than 0.9 do not meet the characteristics of English and need to be cleaned and removed from the initial sample set.
[0267] Research revealed that thresholds using a single approach are rarely completely accurate; even thresholds greater than 0.9 occasionally contain a small number of erroneous data points. Therefore, an alternative approach was adopted: using N-grams to infer which language sample pairs in the initial sample set have the highest probability of belonging to which language. This process removes language sample pairs that are not part of the target language, resulting in standard sample pairs. The initial sample set undergoes a first data cleaning using a language classification model, essentially a preliminary screening. A second data cleaning using N-grams serves as a verification and guarantee, ultimately yielding a relatively pure standard sample set.
[0268] S1005: Add semantic labels to language sample pairs in the standard sample set.
[0269] Taking the target language sample pairs in the standard sample set as an example, target semantic labels are generated based on the target language and added to the target language sample pairs to obtain target language sample pairs including target semantic labels.
[0270] This application does not specifically limit the method of adding target language sample pairs. For details, please refer to the above-mentioned addition method one to addition method four. In this application embodiment, addition method one, which has better effect, is adopted.
[0271] S1006: Expand the target semantic tags to obtain a lexical mapping table.
[0272] The vocabulary mapping table includes various semantic tags, each with the semantic meaning of translating the translated text into different languages. The vocabulary mapping table establishes the mapping relationship between text and encoding vectors, enabling the conversion between text and encoding vectors.
[0273] S1007: A general translation model is obtained by training on a standard sample set with semantic labels.
[0274] For example, for target language sample pairs in the standard sample set, translation is performed based on the target semantic tags and source text using an initial general translation model to obtain translated text. Then, based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain a general translation model. See S402-S403 above for details, which will not be repeated here.
[0275] Therefore, to address the off-target phenomenon in machine translation, this application's embodiments utilize a language classification model for language detection and N-gram word frequency detection. By combining these two methods, non-pure language training data is comprehensively cleaned to obtain a relatively pure standard sample set. The initial general translation model is then trained based on this pure standard sample set, thereby mitigating the off-target problem in the trained general translation model. Furthermore, by adding semantic labels to each language sample pair in the standard sample set and expanding the vocabulary mapping table, the general translation model's ability to understand semantic labels is enhanced, allowing it to learn more target language information. This further reduces the probability of off-target problems, improves translation accuracy, and enhances the user experience.
[0276] In response to the training method of the general translation model described above, this application also provides a corresponding training device for the general translation model, so that the training method of the general translation model can be applied and implemented in practice.
[0277] See Figure 12 This figure is a schematic diagram of the structure of a training device for a general translation model provided in an embodiment of this application. Figure 12 As shown, the training device 1200 of the general translation model includes: an acquisition unit 1201, a translation unit 1202, and a training unit 1203;
[0278] The acquisition unit 1201 is used to acquire a target language sample pair including a target semantic tag. The language sample pair includes a source text belonging to the source language and a target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tag has the semantics to translate the source text into text belonging to the target language.
[0279] The translation unit 1202 is used to translate based on the target semantic tag and the source text using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tag. The initial general translation model is used to translate texts in multiple languages.
[0280] The adjustment unit 1203 is used to adjust the model parameters of the initial general translation model according to the difference between the translated text and the target text, so as to obtain a general translation model.
[0281] As can be seen from the above technical solution, the training device for the general translation model provided in this application includes an acquisition unit, a translation unit, and a training unit. During training, the acquisition unit acquires target language sample pairs including target semantic tags. The semantics represented by these tags enable the general translation model to clearly define the translation task it performs. Taking a target language sample pair including target semantic tags as an example, this pair includes source text belonging to the source language and target text belonging to the target language. The target semantic tags have the semantic meaning of translating the text into text belonging to the target language. Through the translation unit, based on the target semantic tags and the source text, translation is performed using the initial general translation model. During the translation process, the initial general translation model can clearly define the current translation task based on the semantics of the target semantic tags, thereby translating the source text based on this translation task to obtain the translated text. Through the training unit, the model parameters of the initial general translation model are adjusted according to the differences between the translated text and the target text to obtain the general translation model. Therefore, through the training process, the general translation model can not only learn how to translate text belonging to the source language into text belonging to the target language, but also learn to understand the semantics of semantic tags, thereby clarifying the target language indicated by the current translation task, thus reducing the probability of off-target problems in the translation process and improving the accuracy of translation.
[0282] As one possible implementation, the device further includes an application unit for:
[0283] If the source text includes the target semantic tag, then obtain the first text to be translated belonging to the source language and the target language;
[0284] Generate the target semantic tag according to the target language;
[0285] Based on the target semantic tags and the first text to be translated, the first target translated text is obtained by translating using the general translation model.
[0286] As one possible implementation, the device further includes an application unit for:
[0287] If the target semantic tag includes a first sub-tag and a second sub-tag, where the first sub-tag has the semantic meaning of translating text belonging to the source language, and the second sub-tag has the semantic meaning that the translated text belongs to the target language, then the source text includes the first sub-tag and the second sub-tag.
[0288] Obtain the second text to be translated, which belongs to the source language, and the target language;
[0289] The first sub-tag is generated based on the second text to be translated, and the second sub-tag is generated based on the target language;
[0290] Based on the first sub-tag, the second sub-tag, and the second text to be translated, the second target translated text is obtained by translating using the general translation model.
[0291] As one possible implementation, if the target text includes the target semantic tag, then the translation unit 1202 is specifically used for:
[0292] The source text is translated using the initial general translation model to obtain the first text to be translated.
[0293] The target semantic label is semantically recognized to obtain a first semantic result;
[0294] Based on the first semantic result, the first text to be translated is adjusted to obtain the translated text.
[0295] As one possible implementation, if the target semantic tag includes a first sub-tag and a second sub-tag, the first sub-tag having the semantic meaning of translating text belonging to the source language, and the second sub-tag having the semantic meaning that the translated text belongs to the target language, the source text includes the first sub-tag, and the target text includes the second sub-tag, then the translation unit 1202 is specifically used for:
[0296] Based on the source text and the first sub-tag, the second text to be translated is obtained by translating using the initial general translation model.
[0297] Perform semantic recognition on the second sub-label to obtain the second semantic result;
[0298] Based on the second semantic result, the second text to be translated is adjusted to obtain the translated text.
[0299] As one possible implementation, the target semantic tag also has the semantic meaning of indicating the position of the target word in the source text, and the probability that the target word is translated into the target language is less than a first probability threshold. The translation unit 1202 is specifically used for:
[0300] Based on the target semantic tags and the source text, translation is performed using the initial general translation model to obtain the third text to be translated and the position of the target word segmentation.
[0301] If the language of the text corresponding to the position of the target word in the third text to be translated is different from that of the target language, then the text at the position of the target word in the source text is translated using the initial general translation model to obtain the translated word;
[0302] The translated text is obtained based on the translated word segmentation and the third undetermined translated text.
[0303] As one possible implementation, the translation unit 1202 is specifically used for:
[0304] Obtain a vocabulary mapping table that includes multiple semantic tags, the vocabulary mapping table being used to convert text into encoded vectors;
[0305] The target semantic label is converted according to the vocabulary mapping table to obtain the label encoding vector;
[0306] The source text is converted according to the vocabulary mapping table to obtain a text encoding vector;
[0307] Based on the tag encoding vector and the text encoding vector, the translation is performed using the initial general translation model to obtain the translated text.
[0308] As one possible implementation, the device further includes a cleaning unit for:
[0309] Obtain an initial sample set for the target domain, the initial sample set including multiple language sample pairs;
[0310] The target text is removed from the initial sample set by removing language sample pairs belonging to other languages, resulting in a standard sample set. The standard sample set includes the target language sample pairs, where the other languages are languages different from the target language. The multiple language sample pairs included in the standard sample set are used to train the initial general translation model to obtain the general translation model.
[0311] As one possible implementation, the device further includes a cleaning unit for:
[0312] According to the language classification model, the target text in each language sample pair included in the initial sample set is classified to obtain the category of each word in each target text;
[0313] If the word segmentation category is one of the other languages, then the word segmentation belonging to the other language will be translated into the word segmentation belonging to the target language to obtain the updated target text;
[0314] Based on the updated target text, a standard sample set is obtained.
[0315] As one possible implementation, the device further includes a cleaning unit for:
[0316] The target text in each language sample pair included in the initial sample set is classified according to the language classification model to obtain the category of each target text;
[0317] The standard sample set is obtained by deleting language sample pairs of the target text belonging to other languages from the initial sample set.
[0318] As one possible implementation, the device further includes a cleaning unit for:
[0319] Remove language sample pairs of the target text belonging to other languages from the initial sample set to obtain a sample set to be determined;
[0320] For the undetermined language sample pairs in the undetermined sample set, the target text in the undetermined language sample pairs is segmented into words to obtain multiple words;
[0321] The occurrence probability of each segmented word is determined by the word probability table of the target language, wherein the word probability table includes the probability of each segmented word belonging to the target language appearing in the vocabulary of the target domain;
[0322] Based on the occurrence probability of each of the segmented words, determine the probability that the language of the target text in the undetermined language sample pair belongs to the target language.
[0323] If the probability that the language of the target text in the undetermined language sample pair belongs to the target language is less than the second probability threshold, then the undetermined language sample pair is deleted from the undetermined sample set.
[0324] Each language sample pair in the undetermined sample set is taken as the undetermined language sample pair, and the step of segmenting the target text in the undetermined language sample pair to obtain multiple segmented words, as well as subsequent steps, are performed to obtain the standard sample set.
[0325] This application also provides a computer device, which can be a server or a terminal device. The computer device provided in this application will be described below from a hardware implementation perspective. Figure 13 The diagram shown is a structural schematic of the server. Figure 14 The diagram shown is a structural schematic of the terminal device.
[0326] See Figure 13 This figure is a schematic diagram of a server structure provided in an embodiment of this application. The server 1400 can vary considerably due to different configurations or performance. It may include one or more processors 1422, such as a central processing unit (CPU), memory 1432, and one or more application programs 1442 or data storage media 1430 (e.g., one or more mass storage devices). The memory 1432 and storage media 1430 can be temporary or persistent storage. The program stored in the storage media 1430 may include one or more modules (not shown in the figure), each module may include a series of instruction operations on the server. Furthermore, the processor 1422 may be configured to communicate with the storage media 1430 and execute the series of instruction operations in the storage media 1430 on the server 1400.
[0327] Server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input / output interfaces 1458, and / or one or more operating systems 1441, such as Windows Server. TM Mac OS X TM Unix TM Linux TM FreeBSD TM etc.
[0328] The steps performed by the server in the above embodiments can be based on this Figure 13 The server structure shown.
[0329] The processor 1422 is used to perform the following steps:
[0330] Obtain target language sample pairs including target semantic tags. The target language sample pairs include source text belonging to the source language and target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tags have the semantics to translate the source text into text belonging to the target language.
[0331] Based on the target semantic tags and the source text, translation is performed using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tags. The initial general translation model is used to translate texts in multiple languages.
[0332] Based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain a general translation model.
[0333] Optionally, the processor 1422 may also execute method steps of any specific implementation of the training method of the general translation model in the embodiments of this application.
[0334] See Figure 14 This figure is a schematic diagram of the structure of a terminal device provided in an embodiment of this application. The description will be based on a smartphone as an example. Figure 14 The diagram shown is a partial structural block diagram of the smartphone, which includes: a radio frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, a sensor 1550, an audio circuit 1560, a Wi-Fi module 1570, a processor 1580, and a power supply 1590, among other components. Those skilled in the art will understand that... Figure 14 The smartphone structure shown does not constitute a limitation on smartphones and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0335] The following is combined with Figure 14 A detailed introduction to the various components of a smartphone:
[0336] The RF circuit 1510 can be used to receive and transmit signals during information transmission or calls. In particular, it receives downlink information from the base station and processes it with the processor 1580; in addition, it transmits uplink data to the base station.
[0337] The memory 1520 can be used to store software programs and modules, and the processor 1580 runs the software programs and modules stored in the memory 1520 to realize various functions and data processing of the smartphone.
[0338] Input unit 1530 can be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smartphone. Specifically, input unit 1530 may include touch panel 1531 and other input devices 1532. Touch panel 1531, also known as a touch screen, can collect touch operations on or near the user and drive corresponding connected devices according to a pre-set program. In addition to touch panel 1531, input unit 1530 may also include other input devices 1532. Specifically, other input devices 1532 may include, but are not limited to, one or more of the following: physical keyboard, function keys (such as volume control buttons, power buttons, etc.), trackball, mouse, joystick, etc.
[0339] The display unit 1540 can be used to display information input by the user or information provided to the user, as well as various menus of the smartphone. The display unit 1540 may include a display panel 1541, which may optionally be configured as a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
[0340] Smartphones may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors. Other sensors that smartphones may also be equipped with, such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, will not be detailed here.
[0341] Audio circuit 1560, speaker 1561, and microphone 1562 provide an audio interface between the user and the smartphone. Audio circuit 1560 converts received audio data into electrical signals and transmits them to speaker 1561, where speaker 1561 converts them into sound signals for output. On the other hand, microphone 1562 converts collected sound signals into electrical signals, which are received by audio circuit 1560, converted into audio data, and then processed by processor 1580 before being transmitted via RF circuit 1510 to, for example, another smartphone, or the audio data can be output to memory 1520 for further processing.
[0342] The processor 1580 is the control center of the smartphone, connecting various parts of the smartphone through various interfaces and lines. It performs various functions and processes data by running or executing software programs and / or modules stored in the memory 1520, and by calling data stored in the memory 1520. Optionally, the processor 1580 may include one or more processing units.
[0343] The smartphone also includes a power supply 1590 (such as a battery) that supplies power to various components. Preferably, the power supply can be logically connected to the processor 1580 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system.
[0344] Although not shown, smartphones may also include a camera, Bluetooth module, etc., which will not be described in detail here.
[0345] In this embodiment of the application, the memory 1520 included in the smartphone can store computer programs and transmit the computer programs to the processor.
[0346] The processor 1580 included in the smartphone can execute the training method of the general translation model provided in the above embodiments according to the instructions in the computer program.
[0347] This application also provides a computer-readable storage medium for storing a computer program for executing the training method of the general translation model provided in the above embodiments.
[0348] On the other hand, embodiments of this application provide a computer program product including a computer program, which, when run on a computer device, causes the computer device to perform a training method for a general translation model provided in various optional implementations of the above aspects.
[0349] On the other hand, embodiments of this application provide a computer program product including a computer program, which, when run on a computer device, causes the computer device to perform a training method for a general translation model provided in various optional implementations of the above aspects.
[0350] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium can be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, and other media that can store computer programs.
[0351] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such use of data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "corresponding to," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0352] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0353] It should be noted that the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for the device and system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiments. The device and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the solution in this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0354] The above description is merely one specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods. Therefore, the scope of protection of this application should be determined by the scope of the claims. < / en> < / en> < / en> < / target> < / en> < / es> < / es> < / en> < / es> < / es> < / es> < / en> < / en>
Claims
1. A training method for a general translation model, characterized in that, The method includes: Obtain target language sample pairs including target semantic tags. The target language sample pairs include source text belonging to the source language and target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tags have the semantics to translate the source text into text belonging to the target language. Based on the target semantic tags and the source text, translation is performed using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tags. The initial general translation model is used to translate texts in multiple languages. Based on the differences between the translated text and the target text, the model parameters of the initial general translation model are adjusted to obtain a general translation model.
2. The method according to claim 1, characterized in that, If the source text includes the target semantic tag, the method further includes: Obtain the first text to be translated belonging to the source language and the target language; Generate the target semantic tag according to the target language; Based on the target semantic tags and the first text to be translated, the first target translated text is obtained by translating using the general translation model.
3. The method according to claim 1, characterized in that, If the target semantic tag includes a first sub-tag and a second sub-tag, where the first sub-tag has the semantic meaning of translating text belonging to the source language, and the second sub-tag has the semantic meaning that the translated text belongs to the target language, then the source text includes the first sub-tag and the second sub-tag, and the method further includes: Obtain the second text to be translated belonging to the source language and the target language; The first sub-tag is generated based on the second text to be translated, and the second sub-tag is generated based on the target language; Based on the first sub-tag, the second sub-tag, and the second text to be translated, the second target translated text is obtained by translating using the general translation model.
4. The method according to claim 1, characterized in that, If the target text includes the target semantic tag, then the translation based on the target semantic tag and the source text, using an initial general translation model, yields the translated text, including: The source text is translated using the initial general translation model to obtain the first text to be translated. The target semantic label is semantically recognized to obtain a first semantic result; Based on the first semantic result, the first text to be translated is adjusted to obtain the translated text.
5. The method according to claim 1, characterized in that, If the target semantic tag includes a first sub-tag and a second sub-tag, where the first sub-tag has the semantic meaning of translating text belonging to the source language, and the second sub-tag has the semantic meaning that the translated text belongs to the target language, and the source text includes the first sub-tag and the target text includes the second sub-tag, then the translation based on the target semantic tag and the source text, using an initial general translation model, to obtain the translated text includes: Based on the source text and the first sub-tag, the second text to be translated is obtained by translating using the initial general translation model. Perform semantic recognition on the second sub-label to obtain the second semantic result; Based on the second semantic result, the second text to be translated is adjusted to obtain the translated text.
6. The method according to claim 1, characterized in that, The target semantic tag also has semantic meaning indicating the position of the target word in the source text. The probability that the target word is translated into the target language is less than a first probability threshold. The translated text is obtained by translating based on the target semantic tag and the source text using an initial general translation model, including: Based on the target semantic tags and the source text, translation is performed using the initial general translation model to obtain the third text to be translated and the position of the target word segmentation. If the language of the text corresponding to the position of the target word in the third text to be translated is different from that of the target language, then the text at the position of the target word in the source text is translated using the initial general translation model to obtain the translated word; The translated text is obtained based on the translated word segmentation and the third undetermined translated text.
7. The method according to claim 1, characterized in that, The process of translating based on the target semantic tags and the source text using an initial general translation model to obtain translated text includes: Obtain a vocabulary mapping table that includes multiple semantic tags, the vocabulary mapping table being used to convert text into encoded vectors; The target semantic label is converted according to the vocabulary mapping table to obtain the label encoding vector; The source text is converted according to the vocabulary mapping table to obtain a text encoding vector; Based on the tag encoding vector and the text encoding vector, the translation is performed using the initial general translation model to obtain the translated text.
8. The method according to claim 1, characterized in that, The method further includes: Obtain an initial sample set for the target domain, the initial sample set including multiple language sample pairs; The target text is removed from the initial sample set by removing language sample pairs belonging to other languages, resulting in a standard sample set. The standard sample set includes the target language sample pairs, where the other languages are languages different from the target language. The multiple language sample pairs included in the standard sample set are used to train the initial general translation model to obtain the general translation model.
9. The method according to claim 8, characterized in that, The step of deleting language sample pairs of the target text belonging to other languages from the initial sample set yields a standard sample set, including: According to the language classification model, the target text in each language sample pair included in the initial sample set is classified to obtain the category of each word in each target text; If the word segmentation category is one of the other languages, then the word segmentation belonging to the other language will be translated into the word segmentation belonging to the target language to obtain the updated target text; Based on the updated target text, a standard sample set is obtained.
10. The method according to claim 8, characterized in that, The step of deleting language sample pairs of the target text belonging to other languages from the initial sample set yields a standard sample set, including: The target text in each language sample pair included in the initial sample set is classified according to the language classification model to obtain the category of each target text; The standard sample set is obtained by deleting language sample pairs of the target text belonging to other languages from the initial sample set.
11. The method according to claim 10, characterized in that, The step of deleting language sample pairs belonging to other languages from the initial sample set to obtain the standard sample set includes: Remove language sample pairs of the target text belonging to other languages from the initial sample set to obtain a sample set to be determined; For the undetermined language sample pairs in the undetermined sample set, the target text in the undetermined language sample pairs is segmented into words to obtain multiple words; The occurrence probability of each segmented word is determined by the word probability table of the target language, wherein the word probability table includes the probability of each segmented word belonging to the target language appearing in the vocabulary of the target domain; Based on the occurrence probability of each of the segmented words, determine the probability that the language of the target text in the undetermined language sample pair belongs to the target language. If the probability that the language of the target text in the undetermined language sample pair belongs to the target language is less than the second probability threshold, then the undetermined language sample pair is deleted from the undetermined sample set. Each language sample pair in the undetermined sample set is taken as the undetermined language sample pair. The step of segmenting the target text in the undetermined language sample pair to obtain multiple segmented words, as well as subsequent steps, are performed to obtain the standard sample set.
12. A training device for a general translation model, characterized in that, The device includes: an acquisition unit, a translation unit, and an adjustment unit; The acquisition unit is used to acquire target language sample pairs including target semantic tags. The language sample pairs include source text belonging to the source language and target text belonging to the target language. The source text and the target text are texts that express the same semantics but belong to different languages. The target semantic tags have the semantics to translate the source text into text belonging to the target language. The translation unit is used to translate based on the target semantic tag and the source text using an initial general translation model to obtain translated text. The translated text is the text obtained by translating the source text based on the target language indicated by the target semantic tag. The initial general translation model is used to translate texts in multiple languages. The training unit is used to adjust the model parameters of the initial general translation model based on the differences between the translated text and the target text, so as to obtain a general translation model.
13. A computer device, characterized in that, The computer device includes a processor and memory: The memory is used to store computer programs and to transfer the computer programs to the processor; The processor is configured to perform the method according to any one of claims 1-11 according to the computer program.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store a computer program for performing the method according to any one of claims 1-11.
15. A computer program product comprising a computer program, characterized in that, When it is run on a computer device, it causes the computer device to perform the method described in any one of claims 1-11.