Large model training method, task processing method, device, medium, and program product
By using an extended network to learn relevant knowledge of a multi-task model under the first parameter-locked state of a large language model, the problems of knowledge forgetting and low efficiency in multi-task model training are solved, and efficient multi-task model training is achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD
- Filing Date
- 2025-09-16
- Publication Date
- 2026-07-02
AI Technical Summary
When training multi-task models, there are problems such as knowledge forgetting and low training efficiency. In particular, when the model needs to learn new task-related knowledge, it is easy to forget the knowledge that has been learned previously, and fine-tuning all parameters takes a long time.
The first parameter of the large language model is locked, and the network learns relevant knowledge of multiple natural language processing tasks through extension network. Only the second parameter of the extension network is updated to avoid knowledge forgetting, and the training efficiency is improved by using the FlashAttention attention mechanism.
It effectively avoids the knowledge forgetting problem of large language models, significantly improves training efficiency, and enhances the performance and robustness of the model in multiple natural language processing tasks.
Smart Images

Figure CN2025121612_02072026_PF_FP_ABST
Abstract
Description
Training methods, task processing methods, equipment, media and program products for large models
[0001] This disclosure claims priority to Chinese Patent Application No. 202411942412.7, filed on December 27, 2024, entitled “Training Method, Task Processing Method, Device, Medium and Program Product for Large Models”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This disclosure relates to the field of artificial intelligence technology, specifically to a training method, task processing method, device, medium, and program product for a large model. Background Technology
[0003] With the increasing demands of users and the expansion of service types, it is necessary to utilize a single large model to handle multiple tasks; that is, training a multi-task model is essential. Currently, training multi-task models often involves matching training data from multiple tasks and fine-tuning all the model's parameters. This training method for multi-task models suffers from at least the following problems: First, the knowledge forgetting problem, meaning the model easily forgets previously learned knowledge when learning new task-related knowledge. Second, the efficiency problem, meaning it requires a long training time and is inefficient. Summary of the Invention
[0004] This disclosure addresses the aforementioned technical problems by providing a training method, task processing method, device, medium, and program product for large models.
[0005] The first aspect of this disclosure provides a method for training a large model, the method comprising:
[0006] In any training round, the model to be trained is invoked to output multiple output texts based on multiple training texts. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. The first parameter of the large language model is locked in the training round.
[0007] The target loss is determined based on the target data, which includes the plurality of output texts;
[0008] The second parameters of the at least one extended network are updated according to the target loss until the training termination condition is met, thus obtaining a multi-task model.
[0009] A second aspect of this disclosure provides a task processing method, the method comprising:
[0010] Obtain text data for a target task, wherein the target task is any one of multiple natural language processing tasks;
[0011] The text data is processed by a multi-task model to obtain output text. The multi-task model is trained according to the training method provided in the first aspect above.
[0012] A third aspect of this disclosure provides a training apparatus for a large model, the apparatus comprising:
[0013] The training module is used to call the model to be trained to output multiple output texts based on multiple training texts in any training round. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. The first parameter of the large language model is locked in the training round.
[0014] The first determining module is used to determine the target loss based on the target data, wherein the target data includes the plurality of output texts;
[0015] An update module is used to update the second parameters of the at least one extended network according to the target loss until the training termination condition is met, thereby obtaining a multi-task model.
[0016] A fourth aspect of this disclosure provides a computing device comprising:
[0017] The acquisition module is used to acquire text data of the target task, which is any one of multiple natural language processing tasks.
[0018] The processing module is used to call the multi-task model to process the text data and obtain the output text. The multi-task model is trained according to the training method provided in the first aspect above.
[0019] A fifth aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the method as described in the first aspect above.
[0020] A sixth aspect of this disclosure provides a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method as described in the first aspect above.
[0021] A seventh aspect of this disclosure provides a computer program product including a computer program that is executed by a processor to implement the method described in the first aspect above.
[0022] This disclosure has at least the following beneficial effects or advantages:
[0023] In this embodiment of the disclosure, in any training round, a training model containing a large language model and at least one extended network is trained. During the training process, since the first parameter of the large language model is locked, the training model can learn relevant knowledge of multiple natural language processing tasks through the extended network. This not only avoids the knowledge forgetting problem of the large language model, but also only requires updating the second parameter of the extended network instead of updating all parameters, thus greatly improving training efficiency.
[0024] The above description is only an overview of the technical solution of this disclosure. In order to better understand the technical means of this disclosure, it can be implemented in accordance with the contents of the specification. In order to make the above and other objects, features and advantages of this disclosure more obvious and understandable, specific embodiments of this disclosure are given below. Attached Figure Description
[0025] In the accompanying drawings, unless otherwise specified, the same reference numerals throughout the various drawings denote the same or similar parts or elements. These drawings are not necessarily drawn to scale. It should be understood that these drawings depict only some embodiments according to this disclosure and should not be construed as limiting the scope of this disclosure.
[0026] Figure 1 is a schematic diagram of an application scenario of a training method for a multi-task model provided in an embodiment of this disclosure;
[0027] Figure 2 is a first flowchart of a training method for a large model provided in an embodiment of this disclosure;
[0028] Figure 3 is a second flowchart of a training method for a large model provided in an embodiment of this disclosure;
[0029] Figure 4 is a third flowchart of a large model training method provided in an embodiment of this disclosure;
[0030] Figure 5 is a schematic diagram of the model structure provided in the embodiment of this disclosure;
[0031] Figure 6 is a fourth flowchart of a training method for a large model provided in an embodiment of this disclosure;
[0032] Figure 7 is a fifth flowchart of a training method for a large model provided in an embodiment of this disclosure;
[0033] Figure 8 is a sixth flowchart of a training method for a large model provided in an embodiment of this disclosure;
[0034] Figure 9 is a flowchart of a task processing method provided in an embodiment of this disclosure;
[0035] Figure 10 is a schematic diagram of the structure of a large model training device provided in an embodiment of this disclosure;
[0036] Figure 11 is a schematic diagram of the structure of a computing device provided in an embodiment of this disclosure;
[0037] Figure 12 is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0038] In the following description, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments can be modified in various ways without departing from the spirit or scope of this disclosure. Therefore, the drawings and description are considered to be exemplary in nature and not restrictive.
[0039] To facilitate understanding of the technical solutions of the embodiments of this disclosure, the related technologies of the embodiments of this disclosure are described below. The following related technologies are optional solutions and can be combined with the technical solutions of the embodiments of this disclosure in any way, and all of them fall within the protection scope of the embodiments of this disclosure.
[0040] The following terms will be used in the following text:
[0041] Large Language Models (LLMs): Language models with a large number of parameters. They are typically based on deep learning architectures, such as Transformers, and learn language structures and patterns through unsupervised pre-training on massive amounts of text data. LLMs can perform various natural language processing tasks, such as text generation, question answering, and translation, without requiring additional task-specific training.
[0042] Multi-task learning (MTL) is a machine learning technique that processes multiple related tasks simultaneously to improve performance and efficiency by sharing representations, features, or models. Unlike single-task learning, MTL attempts to leverage the correlations between tasks to achieve better generalization on certain tasks.
[0043] Model fine-tuning refers to further training a pre-trained model on a small amount of domain-specific data to improve its performance on a particular task. This process typically involves adjusting the model's parameters to better adapt to new data distributions and task requirements. Fine-tuning can significantly reduce the need for large-scale training data and computational resources while retaining the rich knowledge learned by the pre-trained model, resulting in superior performance on new tasks.
[0044] With the continuous development of artificial intelligence technology, large models are widely used in various scenarios to perform various tasks, such as Natural Language to SQL (NL2SQL) tasks and multi-round rewriting tasks. However, deploying a model for each task greatly increases the model deployment load. Therefore, training a model that can handle multiple tasks simultaneously, i.e., training a multi-task model, is necessary. Currently, the training process of multi-task models involves matching training data from multiple tasks and fine-tuning all the model's parameters. Therefore, at least the following problems exist: Knowledge forgetting problem: when learning new task-related knowledge, the model easily forgets previously learned knowledge. Efficiency problem: fine-tuning all the model's parameters is time-consuming and inefficient; and since the amount of data for multiple tasks increases exponentially, it is necessary to use appropriate fine-tuning methods to shorten the training time and cost. Task and data imbalance problem: the influence between tasks may cause the model to perform well on some tasks and poorly on others. Simultaneously, data imbalance can also cause the model to favor models with larger datasets.
[0045] Based on this, this disclosure provides a method for training a large model. Figure 1 is a schematic diagram of an application scenario of the method for training a large model provided in this disclosure. As shown in Figure 1, the scenario includes a training device and a model to be trained.
[0046] The model to be trained includes a pre-trained large language model and at least one extended network. Each extended network is used to learn relevant knowledge for multiple natural language processing tasks, and each extended network may include multiple processing layers. For the specific structure of the model to be trained, please refer to the relevant descriptions below; details that are repeated here will not be elaborated upon.
[0047] Training equipment can be either a terminal device or a server. Terminal devices can be mobile phones, tablets, desktop computers, laptops, home appliances, automotive terminals, etc. Servers can be independent servers such as physical servers and cloud servers, or server clusters composed of multiple servers. Training equipment can maintain initialization strategies, training strategies, loss determination strategies, and parameter tuning strategies. The initialization strategy instructs the updating state of the first parameter of the large language model to be locked and initializes the second parameter of at least one extended network. The training strategy instructs the model to be trained to output multiple output texts based on multiple training texts in any training epoch. The loss determination strategy instructs the determination of the target loss based on target data, which includes multiple output texts. The parameter tuning strategy instructs the updating of the second parameter based on the target loss until the training termination condition is met, resulting in a multi-task model.
[0048] Figure 1 illustrates an example where the training device is an independent physical server. It should be understood that Figure 1 is merely a schematic representation of the application scenario of the training method for large models involved in this disclosure, and does not constitute a limitation on the technical solution of this disclosure.
[0049] It should be noted that the application scenarios or examples provided in this disclosure are for ease of understanding, and this disclosure does not specifically limit the application of the technical solutions. Furthermore, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. The collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0050] The technical solutions of this disclosure and how they solve the aforementioned technical problems are described in detail below with specific embodiments. The listed specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. The embodiments of this disclosure will be described in detail below with reference to the accompanying drawings.
[0051] Figure 2 is a flowchart of a training method for a large model provided in an embodiment of this disclosure. The method shown in Figure 2 can be executed by the training device in Figure 1. As shown in Figure 2, the method includes the following steps 101-103.
[0052] Step 101: In any training round, the model to be trained is invoked to output multiple output texts based on multiple training texts. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. In any training round, the first parameter of the large language model is locked.
[0053] Step 102: Determine the target loss based on the target data, which includes multiple output texts.
[0054] Step 103: Update the second parameters of at least one extended network according to the target loss until the training termination condition is met, and obtain the multi-task model.
[0055] To incorporate relevant knowledge from different natural language processing tasks into a single model to obtain a multi-task model, one or more embodiments of this disclosure pre-train a large language model, construct a model to be trained based on the large language model and at least one extended network, and fine-tune the model to obtain the multi-task model. To prevent the large language model from forgetting the knowledge learned during pre-training during fine-tuning, one or more embodiments of this disclosure freeze the first parameters of the large language model, keeping them locked, and learn relevant knowledge from multiple natural language processing tasks through the extended network during fine-tuning, while also optimizing the second parameters of the extended network.
[0056] The multiple natural language processing tasks can be any number of text classification, sentiment analysis, syntactic analysis, text generation, information extraction, etc., and this disclosure does not specifically limit them. The multiple training texts can include training texts corresponding to each natural language processing task, or training texts corresponding to some natural language processing tasks. For any training text, the model to be trained first segments the training text into multiple characters (tokens), where each character can be a word, a phrase, or a sentence. For each character, a first identifier corresponding to the character is obtained from a preset mapping table. The length of each first identifier is aligned to a preset length, and each preset-length first identifier is encoded to obtain a corresponding first vector. Based on each first vector, the probability distribution of the first character of the output text is predicted, including the probability that the first character is one of the preset characters in the preset mapping table. The preset character with the highest probability is determined as the first character, and a second identifier corresponding to the first character is output. Then, based on each first vector and the second identifier, the second identifier of the second character is output in the aforementioned manner, until the second identifier of the last character of the output text is obtained. The characters corresponding to each second identifier are obtained from the preset mapping table, and the corresponding output text is obtained based on the obtained characters.
[0057] In some implementations, a batch size of 16 can be used in the training process described above, meaning that the number of training texts in any given training epoch is 16. A warmup ratio of 10% and a 5*e ratio can also be set. -6The initial learning rate and cosine learning rate scheduler are used. The second parameter and task weights are updated using an adaptive moment estimation optimizer (Adam optimizer). Mixed precision (BF16, Brain Floating Point) with a weight decay of 0.1 is employed. To improve training speed, the FlashAttention mechanism can be used. FlashAttention is a novel attention algorithm that is fast and memory efficient. It should be noted that the aforementioned parameters are not limited to these and can be set according to needs in practical applications.
[0058] Therefore, in any training epoch, by training the model to be trained, which includes a large language model and at least one extended network, and since the first parameter of the large language model is locked during the training process, the model to be trained can learn relevant knowledge of multiple natural language processing tasks through the extended network. This not only avoids the knowledge forgetting problem of the large language model, but also only requires updating the second parameter of the extended network instead of updating all parameters, thus greatly improving training efficiency.
[0059] To improve the performance of the model under training on multiple natural language processing tasks, in some implementations, a target training set is first constructed based on the initial training set corresponding to each natural language processing task. Specifically, as shown in Figure 3, step 100 may be included before step 101:
[0060] Step 100: Obtain training texts that meet the second preset conditions from multiple initial training sets to obtain the target training set. The multiple initial training sets correspond one-to-one with multiple natural language processing tasks, and each initial training set includes multiple training texts corresponding to the natural language processing task.
[0061] In some implementations, multiple training texts can be pre-acquired for each natural language processing task to obtain an initial training set for each task. To enrich the knowledge system of the model to be trained and enable it to improve upon knowledge it is currently lacking, in some implementations, training texts that meet a second preset condition are obtained from the initial training set to obtain a target training set. The second preset condition characterizes the accuracy of the large language model's processing results on the training texts. Specifically, as shown in Figure 4, step 100 may include steps 1001 to 1003:
[0062] Step 1001: For the initial training set of any of the multiple natural language processing tasks, call the large language model to process each training text in the initial training set to obtain each predicted text that corresponds one-to-one with each training text.
[0063] In some implementations, for any natural language processing task's initial training set, all training texts in the initial training set can be simultaneously input into the large language model to obtain the predicted text corresponding to each training text. In some implementations, a portion of the training texts in the initial training set can be input into the large language model in batches to obtain the predicted text corresponding to each training text. This disclosure does not impose specific limitations on this; it can be set as needed in practical applications.
[0064] Step 1002: For any training text, determine the similarity between the predicted text corresponding to the training text and the target output text corresponding to the training text. The label of the training text includes the target output text corresponding to the training text.
[0065] In some implementations, for any training text, the BLUE (Bilingual Evaluation Understudy) value between the predicted text corresponding to the training text and the target output text corresponding to the training text can be calculated, and the similarity between the predicted text and the target output text can be determined by the BLUE value. The BLUE value ranges from 0 to 1; a higher BLUE value indicates a higher similarity between the predicted text and the target output text. The specific method for calculating the BLUE value can be found in related technologies, and is not specifically limited in this disclosure.
[0066] It should be noted that the similarity between the predicted text and the target output text is not limited to the statistical BLUE value, but can also be calculated using cosine similarity, Euclidean distance, Hamming distance, etc.
[0067] Step 1003: Determine each training text corresponding to the target similarity as the target training set. The target similarity is the similarity that is less than the first similarity threshold.
[0068] Specifically, for any similarity, determine whether the similarity is less than the first similarity threshold. If so, the similarity is determined as the target similarity, and the training text corresponding to each target similarity is determined as the target training set.
[0069] A similarity score less than the first similarity threshold indicates a significant difference between the predicted text and the target output text. This means the model's knowledge of the corresponding training text is insufficient, its processing capability is weak, and the accuracy of the processing results is low. In this embodiment, the training texts with similarity scores less than the first similarity threshold are defined as the target training set. The model is then trained based on this target training set. This ensures the model can learn relevant knowledge from multiple natural language processing tasks while simultaneously learning knowledge it currently lacks sufficient understanding of, thereby enriching the model's knowledge system and improving its robustness and accuracy.
[0070] It should be noted that the process of obtaining training texts that meet the second preset condition is not limited to the above method. In some embodiments, after obtaining each predicted text that corresponds one-to-one with each training text, for any training text, the predicted text corresponding to the training text is scored according to the target output text corresponding to the training text. The higher the score, the higher the accuracy of the predicted text. The training texts corresponding to scores less than the score threshold are determined as training texts that meet the second preset condition.
[0071] Corresponding to step 100, as shown in Figure 3, step 101 may include the following step 1011:
[0072] Step 1011: In any training round, multiple training texts are obtained from the target training set, and the model to be trained is called to output multiple output texts based on the multiple training texts; the model to be trained includes a large language model and at least one extended network, any extended network is used to learn relevant knowledge of multiple natural language processing tasks, and the first parameter of the large language model is locked.
[0073] It should be emphasized that Figures 3 and 4 are for illustration only, and step 100 is performed only before the first training round, and not in every training round.
[0074] By obtaining training texts that meet the second preset condition from multiple initial training sets corresponding to multiple natural language processing tasks, a target training set is obtained. The model to be trained is then trained based on the target training set, which enables the model to learn more of the knowledge it does not yet have sufficient knowledge of, thus improving the robustness and accuracy of the model to be trained.
[0075] To avoid the knowledge forgetting problem and task imbalance problem of large language models, in some implementations, when the training round is the first training round, before calling the model to be trained to output multiple output texts based on multiple training texts, the method may further include the following steps 1003 and 1004:
[0076] Step 1003: Call the preset function interface to add at least one extended network to the pre-trained large language model to obtain the model to be trained.
[0077] Specifically, in the case of the first training round, before calling the model to be trained to output multiple output texts based on multiple training texts, a preset function interface is first called to add at least one extended network to the pre-trained large language model to obtain the model to be trained.
[0078] In some implementations, the large language model includes multiple connected decoding blocks. Accordingly, adding at least one extension network to the pre-trained large language model may include the following step A:
[0079] Step A: Divide multiple decoding blocks in the large language model into at least one block combination; for any block combination, add at least one extension network at a preset position in the block combination.
[0080] As an example, as shown in Figure 5, the large language model includes an input layer, M*N decoding blocks, and an output layer. Every M decoding blocks are divided into a block combination, for a total of N block combinations. P extension networks are added after the last decoding block in each block combination. Here, M, N, and P are all integers greater than or equal to 1; for example, M = 4, N = 7, and P = 1.
[0081] It should be noted that the preset position is not limited to the position in the example above, and it can be set according to the needs in actual application. The way to add extension networks to a large language model is not limited to step A above; multiple extension networks can also be added after a specific decoding block of the large language model.
[0082] By adding at least one extended network to a large language model, it is possible to learn relevant knowledge for multiple natural language processing tasks during training, thereby avoiding the knowledge forgetting problem of large language models.
[0083] Step 1004: Configure the update state of the first parameter of the large language model to the locked state, and initialize the second parameter of the extended network and multiple task weights, with each task weight corresponding to a different natural language processing task.
[0084] In some implementations, the target attribute of the first parameter can be configured as the first data to configure the update state of the first parameter as a locked state. Here, the target attribute can be the `requires_grad` attribute, and the first data can be `false`. In other implementations, a context manager can be used to configure the update state of the first parameter as a locked state. This disclosure does not specifically limit the scope of these implementations.
[0085] In some implementations, any decoding block of the large language model includes multiple first processing layers, and the structure of any extended network is the same as that of the decoding block, i.e., the number of network layers in any extended network is the same as that in the decoding block. The specific structure of the decoding block can be found in related technologies, and will not be detailed here. To achieve identity mapping during training, the aforementioned initialization of the second parameter may include the following step B:
[0086] Step B: Initialize the parameters of the second processing layer in the extended network to zero; for any extended layer, determine the processing layer in the decoding block corresponding to the extended layer. The position of the extended layer in the extended network is the same as the position of the first processing layer in the decoding block. The extended layer is a layer in the extended network other than the second processing layer; initialize the parameters of the extended layer to the parameters of the first processing layer.
[0087] In other words, the extended network includes a second processing layer and an extension layer. In some implementations, the second processing layer can be a linear layer. By initializing the parameters of the second processing layer to zero and the parameters of the extension layer to the corresponding parameters of the first processing layer, an identity mapping can be achieved through the second processing layer, meaning that data is passed from the input to the output unchanged. Since the identity mapping does not introduce additional computational complexity, it can improve training efficiency and prevent the extended network from immediately affecting the output of a large language model. Furthermore, as training progresses, the parameters of the extension layer are gradually adjusted to adapt to the needs of multiple natural language processing tasks, thus gradually integrating knowledge from multiple natural language processing tasks into the model.
[0088] Furthermore, to avoid task imbalance, initializing multiple task weights can be done by initializing each task weight to 1.
[0089] Therefore, by adding an extended network to the large language model before the first training epoch, configuring the update state of the large language model's first parameter to a locked state, and initializing the second parameter and multiple task weights, it is possible to learn relevant knowledge of multiple natural language processing tasks through the extended network during training, while avoiding the knowledge forgetting problem of the large language model.
[0090] To ensure task balance while updating the second parameter, some embodiments of this disclosure include multiple task weights in the target data, and a first loss corresponding to each natural language processing task is determined based on the output text, and a target loss is determined based on each first loss and each task weight. Specifically, as shown in Figure 6, step 102 may include steps 1021 to 1023:
[0091] Step 1021: Divide the multiple output texts into multiple text sets, with each text set corresponding to a different natural language processing task.
[0092] In some implementations, the task type corresponding to the input text corresponding to the output text is determined as the task type corresponding to the output text, and the output text is divided into multiple text sets according to the task type of each output text. The multiple text sets correspond one-to-one with the multiple task types, that is, the multiple text sets correspond one-to-one with the multiple natural language processing tasks.
[0093] Step 1022: For any natural language processing task, determine the first loss corresponding to the natural language processing task based on the text set corresponding to the natural language processing task and the task weight corresponding to the natural language processing task.
[0094] Specifically, as shown in Figure 7, step 1022 may include the following steps 1022-1 to 1022-3:
[0095] Step 1022-1: For any output text in the text set corresponding to the natural language processing task, determine multiple probabilities corresponding to the output text. The multiple probabilities correspond one-to-one with multiple target characters in the output text. Any probability is the maximum probability in the probability distribution corresponding to the target character. The probability distribution corresponding to any target character includes the probability that the character is each preset character. The target character is a character that satisfies the first preset condition.
[0096] In one implementation, the first preset condition can be used to constrain the length of the identifier corresponding to the character. Accordingly, before determining the multiple probabilities corresponding to the output text, the method further includes: for any character in the output text, determining the target identifier corresponding to the character in a preset mapping table; if the length of the target identifier is determined to be no greater than the preset length, then the character corresponding to the target identifier is determined as the target character.
[0097] As described above, during the generation of output text, the model to be trained determines each character in the output text sequentially. Upon determining any character, a probability distribution corresponding to that character is generated. This probability distribution includes the probability that the character is one of the preset characters in a preset mapping table. The preset character with the highest probability in the probability distribution is then determined as the given character. Accordingly, in step 1022-1, for any character in any output text, the target identifier corresponding to that character is obtained from the preset mapping table. If the length of the target identifier is not greater than a preset length, the character corresponding to the target identifier is determined as the target character, and the maximum probability is obtained from the probability distribution corresponding to the target character. Thus, multiple probabilities (i.e., multiple maximum probabilities) corresponding to multiple target characters in any output text can be obtained.
[0098] During the process of generating output text, if the identifiers longer than the preset length are not aligned to the preset length, some content in the identifier will be discarded. In this case, the identifier can also be called an invalid token. On the other hand, the target character is the character corresponding to the target identifier whose length is not greater than the preset length. Since the target identifier is complete and no content is discarded, it can also be called a valid token.
[0099] Step 1022-2: Determine the sub-loss corresponding to the output text based on the multiple probabilities corresponding to the output text.
[0100] In some implementations, for any probability of each output text, the negative logarithm to base 10 can be determined, and the negative logarithms corresponding to each probability can be added together to obtain the sub-loss corresponding to the output text.
[0101] In other words, the sub-loss can be expressed as:
[0102] Where L1 represents the sub-loss; k represents the k-th valid token, i.e., the k-th target character; T ij t represents the total number of valid tokens corresponding to the j-th training text of the i-th natural language processing task, which is the total number of target characters in the output text corresponding to the j-th training text of the i-th natural language processing task; ijk This represents the k-th valid token corresponding to the j-th training text of the i-th natural language processing task, which is the k-th target character corresponding to the j-th training text of the i-th natural language processing task; P θ (t ijk ) represents the probability of the k-th valid token corresponding to the j-th training text of the i-th natural language processing task (i.e., the aforementioned maximum probability), which is the probability of the k-th target character corresponding to the j-th training text of the i-th natural language processing task (i.e., the aforementioned maximum probability); -log represents the negative logarithm to the base 10.
[0103] Step 1022-3: Determine the first loss based on the sub-loss corresponding to each output text in the text set and the task weight corresponding to the natural language processing task.
[0104] Specifically, for any text set corresponding to a natural language processing task, the sub-losses corresponding to each output text in the text set are summed to obtain the first result; the total number of target characters corresponding to each output text is summed to obtain the second result; the square of the task weight corresponding to the natural language processing task is determined to obtain the square value; the logarithm of the task weight of any natural language processing task with base 10 is determined to obtain the target logarithm; the second result is multiplied by the square value to obtain the third result; the first result is divided by the third result to obtain the fourth result; and the fourth result is added to the target logarithm to obtain the first loss.
[0105] In other words, the first loss can be expressed as:
[0106] Where L2 represents the first loss; The sub-loss is represented by L1, as mentioned above; σ i M represents the task weight of the i-th natural language processing task; j represents the j-th training text of the i-th natural language processing task; M i T represents the total number of training texts corresponding to the i-th natural language processing task; ij The denot represents the total number of valid tokens corresponding to the j-th training text of the i-th natural language processing task, which is the total number of target characters in the output text corresponding to the j-th training text of the i-th natural language processing task; log represents taking the logarithm to the base 10.
[0107] Step 1023: Determine the target loss based on the first loss corresponding to each natural language processing task.
[0108] Specifically, the average value of the first loss corresponding to each natural language processing task is determined to obtain the target loss. In other words, the target loss can be expressed as:
[0109] Where N represents the total number of natural language processing tasks, and i represents the i-th natural language processing task. This represents the first loss, namely L2 mentioned above.
[0110] Therefore, when determining the target loss, the number of target characters (i.e., the number of valid tokens) corresponding to the training text is weighted according to the task weight, rather than the number of samples. This avoids the model favoring tasks with a large amount of data and solves the problem of task imbalance caused by the imbalance of sample size, i.e., the model performs well on some tasks but poorly on others.
[0111] In some implementations, the training text may also carry labels, which include the standard output text corresponding to the training text. Accordingly, the aforementioned determination of the first loss corresponding to the natural language processing task may also involve determining the similarity between any output text in the text set corresponding to the natural language processing task and the corresponding standard output text, determining the average similarity of each output text in the text set corresponding to the natural language processing task, and multiplying the average similarity by the task weight to obtain the first loss corresponding to the natural language processing task.
[0112] Furthermore, in order to avoid the aforementioned task imbalance problem, in one or more embodiments of this disclosure, after determining the target loss based on multiple task weights, the method further includes: updating multiple task weights according to the target loss.
[0113] By updating task weights based on the target loss, the task weights can be continuously optimized during training, ensuring task balance and preventing multi-task models from performing well on some tasks while performing poorly on others.
[0114] To ensure the performance of the trained multi-task model, in some implementations, the model is trained for multiple training epochs using the target training set, and candidate second parameters are obtained for each training epoch, and the target second parameter is determined from among the candidate second parameters. Specifically, as shown in Figure 8, step 103 may include the following steps 1031 to 1034:
[0115] Step 1031: Update the second parameters of at least one extended network and multiple task weights according to the target loss.
[0116] Specifically, the gradients of each second parameter and the gradients of each task weight are calculated based on the target loss. The second parameters are then updated based on their gradients, and the task weights are updated based on their gradients. The calculation methods for the gradients of the second parameters and the task weights can be found in relevant technologies, and are not specifically limited in this disclosure.
[0117] Step 1032: If it is determined that the current training period has ended, the current second parameter is determined as the candidate second parameter corresponding to the current training period. In any training period, each training text in the target training set participates in training once.
[0118] In this context, a training cycle, or Epoch, refers to the process by which the model performs a complete traversal and learning of the target training set. In other words, during any given training cycle, each training text in the target training set participates in one training iteration.
[0119] To avoid the aforementioned data imbalance problem, in one or more embodiments of this disclosure, when obtaining multiple training texts from the target training set for training in the current training round, a non-repeating sampling mechanism is employed, ensuring that all samples in the target training set are sampled once. When all samples in the target training set have been sampled once, it can be determined that each training text in the target training set has participated in training once, and the current second parameter is then determined as a candidate second parameter corresponding to the target training set. Then, training is performed in the next training cycle using the target training set until a preset number of candidate second parameters are obtained.
[0120] Step 1033: If the number of candidate second parameters reaches a preset number, then the training termination condition is met, and the target second parameter that meets the third preset condition is determined among the preset number of candidate second parameters.
[0121] Specifically, when the number of candidate second parameters reaches a preset number, indicating that a preset number of training cycles have been performed, the training termination condition is determined, and the target second parameter that satisfies the third preset condition is identified among the candidate second parameters. For example, the preset number is 5, meaning that the model is trained for 5 training cycles using the target training set, resulting in 5 candidate second parameters.
[0122] The third preset condition can be used to characterize the accuracy of the candidate second parameters. Specifically, it determines the target second parameter among a preset number of candidate second parameters that satisfies the third preset condition. This can include: for any candidate second parameter, validating the trainable model corresponding to the candidate second parameter using test text, obtaining a validation result, and the validation result characterizing the accuracy of the trainable model's processing of the test text; and determining the candidate second parameter corresponding to the highest accuracy as the target second parameter. It is understood that the higher the accuracy of the trainable model's processing of the test text, the higher the accuracy of the corresponding candidate second parameter. Therefore, determining the candidate second parameter corresponding to the highest accuracy as the target second parameter ensures the accuracy of the multi-task model.
[0123] In some implementations, the target training set can be further divided into a training set and a test set. Accordingly, in step 101, the training text in the training set can be used to train the model to be trained, and in step 1033, the test text in the test set can be used to verify the model to be trained corresponding to the candidate second parameter.
[0124] Furthermore, the training model corresponding to the candidate second parameter is validated using test text to obtain the validation result. This can include: inputting at least one test text into the training model corresponding to the first parameter and the candidate second parameter to obtain the predicted text; determining the similarity between the predicted text and the target output text in the labels of the test text; if the similarity is greater than a second similarity threshold, the corresponding predicted text is determined to be correctly predicted; if the similarity is not greater than the second similarity threshold, the corresponding predicted text is determined to be incorrectly predicted. A first number of correctly predicted texts is counted, and this first number is divided by the total number of test texts to obtain the accuracy corresponding to the candidate second parameter. Finally, the candidate second parameter corresponding to the highest accuracy is determined as the target second parameter.
[0125] Step 1034: Determine the training model corresponding to the first parameter and the target second parameter as a multi-task model.
[0126] Therefore, by training the model for multiple training cycles, multiple candidate second parameters are obtained, and each candidate second parameter is tested to ensure the accuracy of the target second parameter, thereby ensuring the accuracy of the multi-task model.
[0127] Corresponding to the aforementioned large model training method embodiments, this disclosure also provides a task processing method. Figure 9 is a flowchart of the task processing method provided by an embodiment of this disclosure. The method shown in Figure 9 can be executed by a task processing device. As shown in Figure 9, the method includes steps 201 and 202.
[0128] Step 201: Obtain the text data of the target task, which is any one of multiple natural language processing tasks.
[0129] In some implementations, the task processing device can receive text data of the target task input by the user, or it can receive text data of the target task sent by other devices. The method of obtaining the text data of the target task is not specifically limited in this disclosure.
[0130] Step 202: Call the multi-task model to process the text data and obtain the output text.
[0131] The multi-task model is trained according to the training method of the multi-task model provided in any of the foregoing embodiments. For details, please refer to the relevant description above. Repeated parts will not be repeated here.
[0132] The multi-task model used in the task processing method provided in this disclosure is obtained by training a training model that includes a large language model and at least one extended network. During training, since the first parameter of the large language model is locked, the training model can learn relevant knowledge of multiple natural language processing tasks through the extended network, avoiding the knowledge forgetting problem of the large language model and improving the accuracy of the multi-task model. Therefore, processing the target task based on this multi-task model ensures the accuracy of the processing results.
[0133] Corresponding to the aforementioned embodiments of the large model training method, this disclosure also provides embodiments of a large model training apparatus. Figure 10 is a schematic structural diagram of a large model training apparatus according to an exemplary embodiment, which can be configured in the training device shown in Figure 1. As shown in Figure 10, the large model training apparatus includes:
[0134] The training module 301 is used to call the model to be trained to output multiple output texts based on multiple training texts in any training round. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. The first parameter of the large language model is locked in the training round.
[0135] The first determining module 302 is used to determine the target loss based on the target data, which includes multiple output texts.
[0136] The update module 303 is used to update the second parameters of at least one extended network according to the target loss until the training termination condition is met, thus obtaining a multi-task model.
[0137] In some embodiments, the device further includes an adding module and an initialization module.
[0138] Add a module to add at least one extended network to the pre-trained large language model before the training module 301 calls the model to be trained to output multiple output texts based on multiple training texts, when the training round is the first training round, and the training module 301 calls the model to be trained to output multiple output texts based on multiple training texts.
[0139] The initialization module is used to configure the update state of the first parameter to a locked state and initialize the second parameter and multiple task weights before the training module 301 calls the model to be trained to output multiple output texts based on multiple training texts, when the training round is the first training round. The multiple task weights correspond one-to-one with multiple natural language processing tasks.
[0140] In some implementations, the large language model includes multiple contiguous decoding blocks; correspondingly, the added modules are specifically used for:
[0141] Divide multiple decoding blocks into at least one block combination.
[0142] For any block combination, add at least one extended network at a preset position in the block combination.
[0143] In some implementations, any decoding block includes multiple first processing layers, and the number of network layers in any extended network is the same as the number of network layers in the decoding block. Accordingly, the initialization module is specifically used for:
[0144] The parameters of the second processing layer in the extended network are initialized to zero.
[0145] For any extended layer, determine the first processing layer in the decoding block that corresponds to the extended layer. The position of the extended layer in the extended network is the same as the position of the first processing layer in the decoding block. The extended layer is the layer in the extended network other than the second processing layer.
[0146] Initialize the parameters of the extension layer to the parameters of the first processing layer.
[0147] In some implementations, the first determining module 302 is specifically used for:
[0148] Multiple output texts are divided into multiple text sets, and each text set corresponds to a different natural language processing task.
[0149] For any natural language processing task, the first loss corresponding to the natural language processing task is determined based on the text set corresponding to the natural language processing task and the task weight corresponding to the natural language processing task.
[0150] The target loss is determined based on the first loss corresponding to each natural language processing task.
[0151] In some implementations, the first determining module 302 is further specifically used for:
[0152] For any output text in the text set corresponding to the natural language processing task, determine multiple probabilities corresponding to the output text. These multiple probabilities correspond one-to-one with multiple target characters in the output text. Any probability is the maximum probability in the probability distribution corresponding to the target character. The probability distribution corresponding to any target character includes the probability that the character is each preset character. The target character is a character that satisfies a first preset condition. The first preset condition is used to constrain the length of the identifier corresponding to the character.
[0153] Based on multiple probabilities, determine the sub-loss corresponding to the output text.
[0154] The first loss is determined based on the sub-loss corresponding to each output text in the text set and the task weight corresponding to the natural language processing task.
[0155] In some implementations, the first determining module 302 is further configured to:
[0156] Before determining the multiple probabilities corresponding to the output text, for any character in the output text, determine the target identifier corresponding to the character in the preset mapping table.
[0157] If the length of the target identifier is determined to be no greater than the preset length, then the character corresponding to the target identifier is determined as the target character.
[0158] In some implementations, the update module 303 is also used to update the weights of multiple tasks based on the target loss.
[0159] In some embodiments, the apparatus further includes an acquisition module.
[0160] The acquisition module is used to acquire training texts that meet the second preset condition from multiple initial training sets when the training round is the first training round, and obtain the target training set. The second preset condition is used to characterize the accuracy of the large language model's processing results on the training texts. Multiple initial training sets correspond one-to-one with multiple natural language processing tasks, and each initial training set includes multiple training texts corresponding to the natural language processing task.
[0161] Accordingly, training module 301 is specifically used for:
[0162] Obtain multiple training texts from the target training set, and call the model to be trained to output multiple output texts based on the multiple training texts.
[0163] In some implementations, each training text carries a label, the label including the target output text corresponding to the training text; accordingly, the acquisition module is specifically used for:
[0164] For any initial training set, the large language model is invoked to process each training text in the initial training set, resulting in each predicted text that corresponds one-to-one with each training text.
[0165] For any training text, determine the similarity between the predicted text corresponding to the training text and the target output text corresponding to the training text.
[0166] Each training text corresponding to the target similarity is determined as the target training set. The target similarity is the similarity that is less than the first similarity threshold.
[0167] In some embodiments, the apparatus further includes a second determining module.
[0168] The second determining module is used to determine the current second parameter as the candidate second parameter corresponding to the current training period if it is determined that the current training period has ended. In any training period, each training text in the target training set participates in training once.
[0169] Accordingly, update module 303 is specifically used for:
[0170] If the number of candidate second parameters reaches a preset number, the training termination condition is determined; the target second parameter that meets the third preset condition among the preset number of candidate second parameters is determined, and the third preset condition is used to characterize the accuracy of the candidate second parameter; the training model corresponding to the first parameter and the target second parameter is determined as a multi-task model.
[0171] In some implementations, the update module 303 is further specifically used for:
[0172] For any candidate second parameter, the model to be trained corresponding to the candidate second parameter is validated using test text to obtain the validation result; the validation result characterizes the accuracy of the model to be trained in processing the test text.
[0173] The candidate second parameter corresponding to the maximum accuracy is determined as the target second parameter.
[0174] The large model training apparatus and the large model training method provided in this disclosure are based on the same inventive concept and have the same beneficial effects as the methods they employ, operate, or implement.
[0175] Corresponding to the aforementioned embodiments of the task processing method, this disclosure also provides an embodiment of a computing device. FIG11 is a schematic diagram of a computing device according to an exemplary embodiment, which can be configured in a task processing device. As shown in FIG11, the computing device includes:
[0176] The acquisition module 401 is used to acquire text data of the target task, which is any one of multiple natural language processing tasks.
[0177] The processing module 402 is used to call the multi-task model to process the text data and obtain the output text. The multi-task model is trained according to the training method of the multi-task model provided in any of the foregoing embodiments.
[0178] The computing device and the task processing method provided in the embodiments of this disclosure are based on the same inventive concept and have the same beneficial effects as the methods they employ, operate, or implement.
[0179] The specific implementation process of the functions and roles of each module in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0180] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate. The components illustrated as modules may or may not be physical modules, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this disclosure according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0181] It is understandable that the above division of modules is only a logical functional division. In actual implementation, the functions of the above modules can be integrated into hardware entities. For example, the functions of training module 301, first determination module 302, update module 303 and processing module 402 can be integrated into the processor, and the function of acquisition module 401 can be integrated into the transceiver, etc.
[0182] Based on this, some embodiments of this disclosure also provide an electronic device corresponding to the large model training method and task processing method provided in the foregoing embodiments. Figure 12 is a block diagram of an electronic device used to implement the embodiments of this disclosure. As shown in Figure 12, the electronic device includes a memory 501 and a processor 502. The memory 501 stores a computer program that can run on the processor 502. When the processor 502 executes the computer program, it implements the methods in the above embodiments. The number of memories 501 and processors 502 can be one or more. In a specific implementation, the electronic device may also include a communication interface 503 for communicating with external devices and performing data interaction and transmission.
[0183] In practical implementation, if the memory 501, processor 502, and communication interface 503 are implemented independently, they can be interconnected via a bus to complete communication. This bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used in Figure 12, but this does not indicate that there is only one bus or one type of bus.
[0184] Optionally, in a specific implementation, if the memory 501, processor 502 and communication interface 503 are integrated on a single chip, the memory 501, processor 502 and communication interface 503 can communicate with each other through an internal interface.
[0185] This disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods provided in this disclosure.
[0186] This disclosure provides a computer program product, including a computer program that, when executed by a processor, implements the methods provided in this disclosure.
[0187] This disclosure also provides a chip including a processor for calling and executing instructions stored in a memory, causing a communication device on which the chip is installed to perform the methods provided in this disclosure.
[0188] This disclosure also provides a chip, including: an input interface, an output interface, a processor, and a memory. The input interface, output interface, processor, and memory are connected through an internal connection path. The processor is used to execute code in the memory. When the code is executed, the processor is used to execute the method provided in the application embodiment.
[0189] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. General-purpose processors can be microprocessors or any conventional processor. It is worth noting that the processor can be a processor supporting Advanced Reduced Instruction Set Machines (ARM) architecture.
[0190] Further, optionally, the aforementioned memory may include read-only memory and random access memory. The memory may be volatile memory or non-volatile memory, or may include both. Non-volatile memory may include read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory may include random access memory (RAM), which serves as an external cache. By way of example, but not limitation, many forms of RAM are available. Examples include Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
[0191] In the above embodiments, implementation can be achieved, in whole or in part, by software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to this disclosure are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another.
[0192] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this disclosure. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of those different embodiments or examples.
[0193] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this disclosure, "a plurality of" means two or more, unless otherwise explicitly specified.
[0194] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process. Furthermore, the scope of the preferred embodiments of this disclosure includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functionality involved.
[0195] The logic and / or steps described in the flowchart or otherwise herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).
[0196] It should be understood that various parts of this disclosure can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware, the program being stored in a computer-readable storage medium, which, when executed, includes one or a combination of the steps of the method embodiments.
[0197] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. This storage medium can be a read-only memory, a disk, or an optical disk, etc.
[0198] The above description is merely an exemplary embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any person skilled in the art can easily conceive of various variations or substitutions within the technical scope described in this disclosure, and these should all be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.
Claims
1. A method for training a large model, the method comprising: In any training round, the model to be trained is invoked to output multiple output texts based on multiple training texts. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. In the training round, the first parameter of the large language model is locked. The target loss is determined based on the target data, which includes the plurality of output texts; The second parameters of the at least one extended network are updated according to the target loss until the training termination condition is met, thus obtaining a multi-task model.
2. The method according to claim 1, wherein, when the training round is the first training round, before calling the model to be trained to output multiple output texts based on multiple training texts, the method further includes: Call the preset function interface to add at least one extended network to the pre-trained large language model to obtain the model to be trained; Configure the update state of the first parameter to a locked state, and initialize the second parameter and multiple task weights, wherein each task weight corresponds to one of the multiple natural language processing tasks.
3. The method according to claim 2, wherein the large language model comprises multiple interconnected decoding blocks, and adding at least one extension network to the pre-trained large language model comprises: The plurality of decoding blocks are divided into at least one block combination; For any block combination, at least one extended network is added at a predetermined position in the block combination.
4. The method according to claim 3, wherein any decoding block includes a plurality of first processing layers, and the number of network layers of any extended network is the same as the number of network layers of the decoding block, wherein initializing the second parameter includes: Initialize the parameters of the second processing layer in the extended network to zero; For any extended layer, a first processing layer corresponding to the extended layer is determined in the decoding block. The position of the extended layer in the extended network is the same as the position of the first processing layer in the decoding block. The extended layer is a layer in the extended network other than the second processing layer. The parameters of the extended layer are initialized to the parameters of the first processing layer.
5. The method according to any one of claims 2-4, wherein the target data further includes the plurality of task weights, and the step of determining the target loss based on the target data includes: The multiple output texts are divided into multiple text sets, and each of the multiple text sets corresponds one-to-one with the multiple natural language processing tasks. For any natural language processing task, a first loss corresponding to the natural language processing task is determined based on the text set corresponding to the natural language processing task and the task weight corresponding to the natural language processing task. The target loss is determined based on the first loss corresponding to each natural language processing task.
6. The method according to claim 5, wherein determining the first loss corresponding to the natural language processing task based on the text set corresponding to the natural language processing task and the task weight corresponding to the natural language processing task comprises: For any output text in the text set corresponding to the natural language processing task, determine multiple probabilities corresponding to the output text. The multiple probabilities correspond one-to-one with multiple target characters in the output text. Any probability is the maximum probability in the probability distribution corresponding to the target character. The probability distribution corresponding to any target character includes the probability that the character is each preset character. The target character is a character that satisfies a first preset condition. The first preset condition is used to constrain the length of the identifier corresponding to the character. Based on the multiple probabilities, determine the sub-loss corresponding to the output text; The first loss is determined based on the sub-loss corresponding to each output text in the text set and the task weight corresponding to the natural language processing task.
7. The method according to claim 6, wherein before determining the plurality of probabilities corresponding to the output text, the method further comprises: For any character in the output text, determine the target identifier corresponding to the character in a preset mapping table; If it is determined that the length of the target identifier is not greater than the preset length, then the character corresponding to the target identifier is determined as the target character.
8. The method according to claim 2, wherein after determining the target loss based on the target data, the method further comprises: The weights of the multiple tasks are updated based on the target loss.
9. The method according to claim 2, wherein if the training round is the first training round, the method further comprises: Training texts that meet the second preset condition are obtained from multiple initial training sets to obtain a target training set. The multiple initial training sets correspond one-to-one with the multiple natural language processing tasks. Each initial training set includes multiple training texts corresponding to the natural language processing task. The second preset condition is used to characterize the accuracy of the processing results of the large language model on the training texts. The process of calling the model to be trained to output multiple output texts based on multiple training texts includes: Multiple training texts are obtained from the target training set, and the model to be trained is invoked to output multiple output texts based on the multiple training texts.
10. The method according to claim 9, wherein any training text carries a label, the label including the target output text corresponding to the training text; The step of obtaining training texts that satisfy the second preset condition from multiple initial training sets to obtain the target training set includes: For any initial training set, the large language model is invoked to process each training text in the initial training set to obtain each predicted text that corresponds one-to-one with each training text. For any training text, determine the similarity between the predicted text corresponding to the training text and the target output text corresponding to the training text; Each training text corresponding to the target similarity is determined as the target training set. The target similarity is the similarity that is less than the first similarity threshold.
11. The method according to claim 9, further comprising: If it is determined that the current training period has ended, the current second parameter is determined as the candidate second parameter corresponding to the current training period. In any training period, each training text in the target training set participates in one training session. The determination of whether the training termination condition is met, resulting in a multi-task model, includes: If the number of candidate second parameters reaches a preset number, then the training termination condition is determined to be met. Determine the target second parameter among the preset number of candidate second parameters that satisfies a third preset condition, wherein the third preset condition is used to characterize the accuracy of the candidate second parameter; The training model corresponding to the first parameter and the target second parameter is determined as the multi-task model.
12. The method according to claim 11, wherein determining the target second parameter among the preset number of candidate second parameters that satisfies the third preset condition includes: For any candidate second parameter, the model to be trained corresponding to the candidate second parameter is validated using test text to obtain a validation result, which characterizes the accuracy of the model to be trained in processing the test text. The candidate second parameter corresponding to the maximum accuracy is determined as the target second parameter.
13. The method according to claim 12, wherein verifying the trainable model corresponding to the candidate second parameter using test text to obtain a verification result includes: Input at least one test text into the training model corresponding to the first parameter and the candidate second parameter to obtain the predicted text corresponding to the at least one test text respectively, wherein the label of any test text contains the target output text; Determine the similarity between the predicted text and the corresponding target output text; The verification result is determined based on the similarity.
14. The method according to claim 13, wherein determining the verification result based on the similarity comprises: If the similarity is greater than the second similarity threshold, the corresponding predicted text is determined to be correctly predicted. If the similarity is not greater than the second similarity threshold, the corresponding predicted text is determined to be incorrect. The first number of correctly predicted texts is counted, and the first number is divided by the total number of test texts to obtain the accuracy corresponding to the candidate second parameter. The accuracy is then determined as the verification result.
15. A task processing method, the method comprising: Obtain text data for a target task, wherein the target task is any one of multiple natural language processing tasks; The text data is processed by calling a multi-task model to obtain output text, wherein the multi-task model is trained by the training method according to any one of claims 1-14.
16. A training apparatus for a large model, the apparatus comprising: The training module is used to call the model to be trained to output multiple output texts based on multiple training texts in any training round. The model to be trained includes a large language model and at least one extended network. Any extended network is used to learn relevant knowledge of multiple natural language processing tasks. The first parameter of the large language model is locked in the training round. The first determining module is used to determine the target loss based on the target data, wherein the target data includes the plurality of output texts; An update module is used to update the second parameters of the at least one extended network according to the target loss until the training termination condition is met, thereby obtaining a multi-task model.
17. A computing device, the device comprising: The acquisition module is used to acquire text data of the target task, which is any one of multiple natural language processing tasks. The processing module is used to call a multi-task model to process the text data and obtain output text, wherein the multi-task model is trained by the training method according to any one of claims 1-14.
18. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the method as claimed in any one of claims 1-15.
19. A computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the method as described in any one of claims 1-15.
20. A computer program product comprising a computer program that is executed by a processor to implement the method of any one of claims 1-15.