Method and apparatus for fine-tuning pre-trained language model

By adding a multi-view compressed representation module and a hierarchical autoencoder to the pre-trained language model, the overfitting problem in low-resource scenarios is solved, the robustness and accuracy of the prediction model are improved, and memory usage and cost are reduced.

CN116090508BActive Publication Date: 2026-06-16ALIBABA (CHINA) CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2022-10-08
Publication Date
2026-06-16

Smart Images

  • Figure CN116090508B_ABST
    Figure CN116090508B_ABST
Patent Text Reader

Abstract

The embodiment of the application discloses a fine-tuning method and device of a pre-trained language model. The method comprises the following steps: obtaining a pre-constructed enhanced model, the enhanced model being a model in which a multi-view compression representation module is added between at least two hidden layers of a pre-trained language model, and the multi-view compression representation module comprising an N-level autoencoder; fine-tuning the enhanced model by using training data to obtain a target model, the target model comprising the pre-trained enhanced model and a downstream prediction model; updating all model parameters in the process of fine-tuning, and the training target being to minimize the difference between the output result of the downstream prediction model and an expected value; and removing the multi-view compression representation module from the target model to obtain a prediction model after the fine-tuning is completed. The application can reduce the risk of overfitting and improve the robustness of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to a method and apparatus for fine-tuning a pre-trained language model. Background Technology

[0002] In recent years, fine-tuning large pre-trained language models to obtain predictive models has become one of the most commonly used methods in natural language processing tasks, achieving excellent performance in numerous tasks. However, pre-trained models are prone to overfitting in low-resource scenarios, i.e., when training data is limited, leading to a decline in performance. Labeling large amounts of training data often incurs high time and financial costs. Therefore, there is an urgent need for a model training method that can reduce the risk of overfitting and improve the predictive performance of the predictive model. Summary of the Invention

[0003] In view of this, this application provides a method and apparatus for fine-tuning a pre-trained language model to reduce the risk of overfitting and improve the prediction performance of the prediction model.

[0004] This application provides the following solution:

[0005] Firstly, a method for fine-tuning a pre-trained language model is provided, the method comprising:

[0006] Obtain a pre-built augmented model, wherein the augmented model is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of a pre-trained language model, and the multi-view compressed representation model includes N hierarchical autoencoders, where N is a positive integer;

[0007] The augmentation model is fine-tuned using training data to obtain a target model, which includes the augmentation model and the downstream prediction model. During the fine-tuning training, the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model are updated. The training objective is to minimize the difference between the output of the downstream prediction module and the expected value.

[0008] After the fine-tuning training is completed, the multi-view compressed representation module is removed from the target model obtained by training to obtain the prediction model.

[0009] According to one achievable method in an embodiment of this application, the N-level autoencoders employ different compression dimensions.

[0010] According to one achievable method in an embodiment of this application, before fine-tuning the augmented model using training data, the method further includes:

[0011] The augmented model is pre-trained using the training data. During the pre-training process, only the parameters of the multi-view compressed representation module are updated. The training objective is to minimize the difference between the input and output of the multi-view compressed representation module.

[0012] Fine-tuning the augmented model using the training data includes: fine-tuning the pre-trained augmented model using the training data.

[0013] According to one achievable method in an embodiment of this application, during the pre-training and fine-tuning training processes, the output of the previous hidden layer of the multi-view compressed representation module is implicitly expressed to the multi-view compressed representation module, and the multi-view compressed representation module randomly selects or randomly does not select a level autoencoder output from the N level autoencoders to implicitly express the next hidden layer.

[0014] According to one achievable method in an embodiment of this application, the hierarchical autoencoder includes an encoding module, an intra-layer encoding module, and a decoding module, wherein the intra-layer encoding module includes M intra-layer autoencoders, where M is a positive integer;

[0015] The encoding module in the hierarchical autoencoder that receives the implicit expression outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one of the M intra-layer autoencoders or randomly does not select any intra-layer autoencoder to output the implicit expression to the decoding module.

[0016] According to one achievable method in an embodiment of this application, the addition of a multi-view compressed representation module between at least two hidden layers of the pre-trained language model includes:

[0017] Add a multi-view compressed representation module between the top hidden layer and its adjacent hidden layers of the pre-trained language model; and / or,

[0018] A multi-view compressed representation module is added between the bottom hidden layer and its adjacent hidden layers of the pre-trained language model.

[0019] According to one achievable method in the embodiments of this application, N is 3, and the compression dimensions of the three-level autoencoders are 128, 256 and 512, respectively.

[0020] According to one achievable method in an embodiment of this application, the training data is text sequence pairs, and the expected value is the relation type of the text sequence pairs; or...

[0021] The training data is a text sequence, and the expected value is the sentiment type of the text sequence; or,

[0022] The training data is a text sequence, and the expected value is a named entity in the text sequence; or...

[0023] The training data is a text sequence, and the expected value is the part-of-speech tag of at least one word in the text sequence.

[0024] Secondly, a fine-tuning device for a pre-trained language model is provided, the device comprising:

[0025] The acquisition unit is configured to acquire a pre-built augmented model, which is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of a pre-trained language model. The multi-view compressed representation model includes N hierarchical autoencoders, where N is a positive integer.

[0026] The fine-tuning unit is configured to fine-tune the augmentation model using training data to obtain a target model, which includes the pre-trained augmentation model and the downstream prediction model. During the fine-tuning training, the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model are updated, and the training objective is to minimize the difference between the output of the downstream prediction module and the expected value. After the fine-tuning training is completed, the multi-view compressed representation module is removed from the trained target model to obtain the prediction model.

[0027] According to one achievable embodiment of this application, the apparatus further includes:

[0028] A pre-training unit is configured to pre-train the augmented model using the training data, during which only the parameters of the multi-view compressed representation module are updated, and the training objective is to minimize the difference between the input and output of the multi-view compressed representation module.

[0029] The fine-tuning unit is specifically configured to fine-tune the pre-trained augmented model using the training data to obtain the target model.

[0030] According to a third aspect, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any one of the first aspects above.

[0031] According to a fourth aspect, an electronic device is provided, characterized in that it comprises:

[0032] One or more processors; and

[0033] A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform the steps of the method described in any one of the first aspects above.

[0034] According to the specific embodiments provided in this application, the following technical effects are disclosed:

[0035] 1) This application adds N hierarchical autoencoders between at least two levels of the pre-trained language model, which can increase the diversity of implicit expressions during the fine-tuning process and reduce noise in the implicit expressions by using hierarchical autoencoders. This allows the downstream prediction module to reduce the weight of noise during learning, alleviate overfitting, and improve the robustness and prediction accuracy of the prediction model obtained by fine-tuning.

[0036] 2) Before fine-tuning the augmentation model in this application, the augmentation model is first pre-trained by updating only the parameters of the multi-view compressed representation module to minimize the difference between the input and output of the multi-view compressed representation module. This pre-training can retain the existing knowledge of the pre-trained language model to the greatest extent and reduce the impact of the multi-view compressed representation module on the existing knowledge of the original pre-trained language model, thereby ensuring the prediction performance of the final prediction model.

[0037] 3) The N-level autoencoders added between at least two levels of the pre-trained language model in this application employ different compression dimensions, thereby further improving the diversity of implicit representations and mitigating overfitting.

[0038] 4) In the pre-training and fine-tuning training processes, this application randomly selects or does not select a hierarchical autoencoder in the multi-view compressed representation module for processing, thereby further improving the diversity of implicit representations and reducing overfitting.

[0039] 5) The hierarchical autoencoder in this application includes M intra-layer autoencoders, and one or no intra-layer autoencoder is randomly selected for processing during pre-training and fine-tuning training, thereby further improving the diversity of implicit representation and the degree of information retention, and reducing overfitting.

[0040] 6) Since the underlying modules primarily generate data, inserting multi-view compressed representation modules between the bottom hidden layers and their adjacent hidden layers in this application can add more diversity to the model. Furthermore, inserting multi-view compressed representation modules between the top hidden layers and their adjacent hidden layers can more effectively mitigate overfitting of the model on downstream tasks.

[0041] 7) This application can be applied to the identification of relation types of text sequence pairs, the identification of sentiment types of text sequences, the identification of named entities of text sequences, and the identification of part-of-speech tags of at least one word in a text sequence. The prediction model obtained by the fine-tuning method based on the pre-trained language model provided in this application can improve the prediction effect of the above identification.

[0042] Of course, any product implementing this application does not necessarily need to achieve all of the advantages described above at the same time. Attached Figure Description

[0043] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0044] Figure 1 This is an exemplary system architecture diagram that can be applied to embodiments of this application;

[0045] Figure 2 A flowchart of a fine-tuning method for a pre-trained language model provided in an embodiment of this application;

[0046] Figure 3 A schematic block diagram of the enhanced model provided in the embodiments of this application;

[0047] Figure 4 A schematic block diagram of a hierarchical autoencoder provided for embodiments of this application;

[0048] Figure 5 A schematic block diagram of the target model provided in the embodiments of this application;

[0049] Figure 6 A flowchart of a fine-tuning device for a pre-trained language model provided in an embodiment of this application;

[0050] Figure 7 A schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0051] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.

[0052] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” as used in the embodiments of this invention and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0053] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0054] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."

[0055] Fine-tuning pre-trained language models is an effective way to transfer knowledge from large-scale text corpora to downstream NLP (Natural Language Processing) tasks. Most language models are designed to be generalizable across multiple NLP tasks. Therefore, when using pre-trained language models for feature extraction in downstream tasks, the resulting implicit representations often contain a large number of features irrelevant to the downstream task, leading to overfitting. Overfitting refers to a situation where, with limited training data, the model's error rate on the training data gradually decreases, but the learned knowledge has poor generalization ability, resulting in an increased error rate on the test data.

[0056] The existing methods mainly include the following two:

[0057] One approach is Dropout, which randomly sets some values ​​in the implicit representation to 0. This approach is simple to implement, but it adds noise indiscriminately to both task-related and irrelevant features.

[0058] Another approach is Mixout, which randomly replaces some model parameters with initial values ​​during the fine-tuning of the pre-trained language model. However, this method requires saving the initial model parameters, increasing memory usage.

[0059] In view of this, this application provides a novel approach that increases the diversity of training data in implicit representation and reduces noise in features by inserting dynamic random neural networks between the hidden layers of a pre-trained language model.

[0060] To facilitate understanding of this application, a brief description of the system architecture to which this application applies will be given first. Figure 1 An exemplary system architecture that can be applied to embodiments of this application is shown, such as Figure 1As shown, the system architecture includes a model training device that performs model training offline and a prediction processing device that performs prediction online.

[0061] In this embodiment, after acquiring the training data, the model training device can use the fine-tuning method of the pre-trained language model provided in this application to process the data and obtain the prediction model.

[0062] After receiving a prediction request, the prediction processing device generates prediction results using the established prediction model.

[0063] The model training and prediction processing units can be configured as separate servers, or they can be located on the same server or server cluster, or they can be located on a separate or the same cloud server. A cloud server, also known as a cloud computing server or cloud host, is a host product within the cloud computing service system, designed to address the management difficulties and weak service scalability inherent in traditional physical hosts and Virtual Private Servers (VPs). The model training and prediction processing units can also be located on computer terminals with strong computing capabilities.

[0064] It should be understood that Figure 1 The number of model training devices, prediction processing devices, pre-trained language models, and prediction models shown in the diagram is merely illustrative. Depending on implementation needs, any number of model training devices, prediction processing devices, pre-trained language models, and prediction models can be included.

[0065] Figure 2 A flowchart of the fine-tuning method for the pre-trained language model provided in the embodiments of this application is shown below. Figure 2 As shown, the method may include the following steps:

[0066] Step 202: Obtain the pre-built augmented model, which is the model obtained by adding a multi-view compressed representation module between at least two hidden layers of the pre-trained language model. The multi-view compressed representation model includes N hierarchical autoencoders, where N is a positive integer.

[0067] Step 206: Fine-tune the augmentation model using the training data to obtain the target model, which includes the pre-trained augmentation model and the downstream prediction model. During the fine-tuning process, update the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model. The training objective is to minimize the difference between the output of the downstream prediction module and the expected value.

[0068] Step 208: After fine-tuning the training, remove the multi-view compressed representation module from the target model obtained from the training to obtain the prediction model.

[0069] As can be seen from the above process, this application adds N hierarchical autoencoders between at least two levels of the pre-trained language model. This allows for the increase of the diversity of implicit expressions during the fine-tuning process, while also reducing noise in the implicit expressions by utilizing the hierarchical autoencoders. This enables the downstream prediction module to reduce the weight of noise during learning, alleviate overfitting, and improve the robustness of fine-tuning.

[0070] In addition, there is no need to randomly replace the initial model parameters during the entire training process, thus reducing memory usage compared to the Mixout method.

[0071] The following details each step in the above process. First, step 202, "obtaining the pre-built augmented model," will be described in detail with reference to an embodiment.

[0072] The augmentation model involved in this application embodiment is built on a pre-trained language model. The pre-trained language model can be UniLM (Unified Language Model), GPT (Generative Pre-Training), BERT (Bidirectional Encoder Representation from Transformers), XLNet, etc.

[0073] Pre-trained language models are a core component of NLP. Their unsupervised training nature makes it easy to acquire massive amounts of training samples, and well-trained language models contain a wealth of semantic and syntactic knowledge, significantly improving performance on downstream tasks. However, pre-trained language models are prone to overfitting during fine-tuning in resource-constrained scenarios, where training data is limited. Analysis and research have revealed that inserting multi-view compressed representation modules between the hidden layers of a pre-trained language model can effectively augment data at the implicit representation level.

[0074] Pre-trained language models typically have a multi-layered structure. When data is input into the model, each layer processes the output of the previous layer and then outputs it to the next layer. The layers in a pre-trained language model are called hidden layers, and the information passed between hidden layers is usually in vector form, called implicit representation, also known as implicit vector, implicit vector representation, etc. Implicit representation is used in all embodiments of this application.

[0075] To enhance the diversity of implicit representations while reducing noise in features, a multi-view compressed representation module can be added between at least two hidden layers. Each multi-view compressed representation module includes N autoencoders, where N is a positive integer. Verification has shown that when the number of hierarchical autoencoders exceeds three, the improvement in model performance becomes less significant; therefore, a value of three for N is preferred.

[0076] To distinguish them from the autoencoders discussed later, these N autoencoders will be referred to here as HAEs (Hierarchical Autoencoders). Figure 3 As shown, the pre-trained language model includes S hidden layers, denoted as Hidden Layer 1 to Hidden Layer S. A multi-view compression representation module is added between the nth and (n+1th)th hidden layers of the pre-trained language model. Each multi-view compression representation module includes three HAEs (Hybrid Access Objects). The I module in the multi-view compression representation module represents the pass-through channel, and its function will be specifically addressed in subsequent embodiments.

[0077] As one possible approach, the aforementioned N-level autoencoders can employ the same compression dimension.

[0078] However, to further enhance the diversity of implicit representations, the aforementioned N-level autoencoders can employ different compression dimensions. Verification has shown that when N is 3, the optimal combinations of compression dimensions are 128, 256, and 512.

[0079] An autoencoder is a special type of neural network that reduces noise in the input features that is irrelevant to the downstream task by compressing the input to a lower dimension and then restoring it to the same dimension as the input.

[0080] As one possible approach, the aforementioned hierarchical autoencoder can be a conventional autoencoder, namely an autoencoder consisting of an encoding module and a decoding module.

[0081] However, to further increase the diversity of implicit expressions and the degree of information retention, this application provides a more preferred implementation, namely, that each level of autoencoder can be as follows: Figure 4 As shown, it includes an encoding module, an intra-layer encoding module, and a decoding module. The intra-layer encoding module can include M intra-layer autoencoders, where M is a positive integer. Figure 4 Taking M=2 as an example. The I module in the intra-layer coding module represents the pass-through channel, and its function will be specifically discussed in subsequent embodiments.

[0082] The M intra-layer autoencoders mentioned above can use the same compression dimension or different compression dimensions. As one possible approach, the compression dimension of the intra-layer autoencoder can be half of the compression dimension of its HAE.

[0083] The aforementioned multi-view compressed representation module offers significant flexibility, allowing it to be inserted between any two hidden layers in a pre-trained language model. Verification has shown that inserting the multi-view compressed representation module between the top hidden layer and its adjacent hidden layer (i.e., the penultimate layer), or between the bottom hidden layer and its adjacent hidden layer (i.e., the second layer), achieves relatively good results. Since the bottom-level modules primarily function as data generators, inserting them between the bottom hidden layers and their adjacent hidden layers adds greater diversity to the model. Furthermore, inserting the multi-view compressed representation module between the top hidden layer and its adjacent hidden layers can more effectively mitigate overfitting on downstream tasks.

[0084] Figure 2 In the illustrated embodiment, fine-tuning training is performed directly on the enhanced model to obtain the target model. This method can alleviate overfitting to some extent and improve the robustness of fine-tuning. To better ensure model performance, step 204 can be further included between steps 202 and 206, such as... Figure 2 The step indicated by the dashed box is a preferred step.

[0085] Step 204: Pre-train the augmentation model using the training data. During the pre-training process, only the parameters of the multi-view compressed representation module are updated. The training objective is to minimize the difference between the input and output of the multi-view compressed representation module.

[0086] The pre-training of the augmented model involved in this step is actually pre-training the multi-view compressed representation module in the augmented model. The purpose is to retain the existing knowledge of the pre-trained language model to the greatest extent possible, and to avoid the addition of the multi-view compressed representation module from having an unreasonable impact on the implicit expression of the pre-trained language model, thereby ensuring the prediction effect of the final prediction model obtained after subsequent fine-tuning.

[0087] The training data involved is determined by the downstream prediction task. Below are some scenarios for downstream prediction tasks:

[0088] The first type: The training data includes a large number of text sequence pairs and annotations on the relational types of the text sequence pairs. These annotations are the expected values ​​used for subsequent fine-tuning training.

[0089] These numerous text sequence pairs can be taken from datasets such as SNLI (Stanford Natural Language Inference) and MNLI (The Multi-Genre Natural Language Inference). The relation types of text sequence pairs can include implication, contradiction, and neutral relations.

[0090] The prediction model trained using this training data can predict the relation type of input text sequence pairs.

[0091] The second type: The training data includes a large number of text sequences and annotations on the sentiment types of the text sequences. These annotations are the expected values ​​used for subsequent fine-tuning training.

[0092] These large amounts of text sequences can be taken from datasets such as IMDB (Internet Movie Database) and Yelp, and are mostly composed of sentence content. The IMDB dataset contains 50,000 highly polarized reviews from the Internet Movie Database, reflecting users' specific sentiment types. The Yelp dataset consists of approximately 160,000 merchants, 8.63 million reviews, and 200,000 images from eight major metropolitan areas. The 8.63 million reviews also reflect users' specific sentiment types, such as like, dislike, and neutral.

[0093] The prediction model trained using this training data can predict the sentiment type corresponding to the input text sequence.

[0094] The third type: The training data includes a large number of text sequences and annotations of named entities in the text sequences, which serve as expected values ​​for subsequent fine-tuning training.

[0095] These large numbers of text sequences can be taken from datasets such as WikiAnn. WikiAnn is a multilingual named entity recognition dataset consisting of Wikipedia articles labeled with tags such as location, task, and organization.

[0096] The prediction model trained using this training data is able to predict named entities in the input text sequence.

[0097] The fourth type: The training data includes a large number of text sequences and part-of-speech tags for at least one word in the text sequence.

[0098] These large numbers of text sequences can be taken from datasets such as Universal Dependencies v2.5. The Universal Dependencies v2.5 dataset contains annotations for texts in multiple languages, including part-of-speech tagging, morphological features, and syntactic features. This application uses the part-of-speech tagging from this dataset.

[0099] The prediction model trained using this training data can predict the part-of-speech of target words in the input text sequence.

[0100] During pre-training and training, the input sequence is constructed using text sequence pairs or text sequences in the training data, and then fed into the augmentation model.

[0101] Assuming the previous hidden layer of the multi-view compressed representation module is the nth hidden layer, the implicit representation output by the nth hidden layer is fed to the multi-view compressed representation module. The module then randomly selects one of the N HAEs (Hidden Alternates) or randomly does not select any HAE. If the random result is the selection of one HAE, the selected HAE processes the input implicit representation and outputs the processed implicit representation to the (n+1)th hidden layer. If the random result is the selection of no HAE, the implicit representation output by the nth hidden layer follows the following path: Figure 3 The I module shown represents a pass-through, whereby the implicit representation output from the nth hidden layer is directly input to the (n+1)th hidden layer. This randomness effectively increases the diversity of implicit representations.

[0102] If the HAE intralayer autoencoder adopts Figure 4 The structure shown includes M intra-layer autoencoders. For the hierarchical autoencoder receiving the input implicit expression (i.e., a randomly selected hierarchical autoencoder), its encoding module encodes the implicit expression and outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one of the M intra-layer autoencoders (or randomly does not select any) for processing and outputs the implicit expression to the decoding module. If the random selection results in one of the intra-layer autoencoders, then the selected intra-layer autoencoder processes the input implicit expression and outputs the processed implicit expression to the decoding module. If the random selection results in no intra-layer autoencoder being selected, then the implicit expression output by the encoding module follows the path shown in the diagram. Figure 4 The I module shown represents a pass-through channel that directly inputs the implicit expression output by the encoding module into the decoding module.

[0103] During pre-training, the initial parameters of the multi-view compressed representation module can be randomly initialized. During pre-training of the augmented model, the parameters of the pre-trained language model are kept unchanged; only the parameters of the multi-view compressed representation module are updated. This is because the purpose of pre-training is to ensure that the addition of the multi-view compressed representation module does not unreasonably affect the implicit representation of the original pre-trained language model, i.e., the difference between the input and output of the multi-view compressed representation module should be minimized. Therefore, a reconstruction loss function, L, can be used to optimize the parameters of the multi-view compressed representation module. MSE It can be represented as follows:

[0104]

[0105] in, HAE is the implicit representation of the output of the nth hidden layer for the i-th token (element) in the input sequence of the augmented model. m () is the processing function for the m-th HAE in the multi-view compressed representation module, M is the number of HAEs in the multi-view compressed representation module, and L is the length of the input sequence of the augmented model, i.e., the number of tokens it contains.

[0106] In each iteration of pre-training, the parameters of the multi-view compressed representation module in the augmentation model are updated using methods such as gradient descent, based on the value of the aforementioned reconstruction loss function, until a preset training termination condition is met. This termination condition may include, for example, the loss function value being less than or equal to a preset loss function threshold, or the number of iterations reaching a preset threshold.

[0107] The following describes step 206, namely "fine-tuning the target model using training data", in detail with reference to the embodiments.

[0108] After pre-training, the augmented model obtained from the pre-training can be used to construct the target model. This can be tailored to the specific downstream prediction task, by adding a downstream prediction model to the augmented model to obtain the target model, such as... Figure 5 As shown in the image.

[0109] Fine-tuning is a common training method in deep learning. It involves applying a pre-trained language model to a downstream task (in this embodiment, the downstream task is the prediction task corresponding to the downstream prediction model) and then adjusting the parameters of the pre-trained language model with a small learning rate to optimize its performance on the downstream task. This process is called fine-tuning.

[0110] During fine-tuning training, an input sequence is constructed using text sequence pairs or text sequences from the training data. This input sequence is then fed into the augmentation model, which outputs an implicit representation of the input sequence to the downstream prediction model. The downstream prediction model uses this implicit representation to obtain the prediction result. The downstream prediction module can employ regression or classification models depending on the specific training task. The training objective of fine-tuning is to minimize the difference between the prediction result and the expected value of the downstream prediction module. For example, for predicting the relation type of text sequence pairs, the goal is to minimize the difference between the predicted result and the relation type labeled in the training data. Similarly, for predicting the sentiment type of a text sequence, the goal is to minimize the difference between the predicted result and the sentiment type labeled in the training data. Furthermore, for predicting named entities in a text sequence, the goal is to minimize the difference between the predicted result and the named entity labeling in the training data. Finally, for predicting the part-of-speech tag of at least one word in a text sequence, the goal is to minimize the difference between the predicted result and the part-of-speech tagging in the training data.

[0111] In the embodiments of this specification, a loss function can be constructed based on the above-mentioned training objective. In each iteration, the model parameters are updated using the value of the loss function and methods such as gradient descent, until a preset training termination condition is met. The training termination condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, or the number of iterations reaching a preset threshold.

[0112] During fine-tuning training, all model parameters are updated, including those of the language model, the multi-view compressed representation module, and the downstream prediction model.

[0113] Similar to the pre-training process, the previous hidden layer of the multi-view compressed representation module is assumed to be the nth hidden layer. After the implicit representation output by this nth hidden layer is fed to the multi-view compressed representation module, the module randomly selects one of the N HAEs (Hidden Area Equalizers) or randomly does not select any HAE. If the random result is the selection of one HAE, the selected HAE processes the input implicit representation and outputs the processed implicit representation to the (n+1)th hidden layer. If the random result is the selection of no HAE, the implicit representation output by the nth hidden layer follows the following path: Figure 3 The I module shown represents a pass-through, whereby the implicit representation output from the nth hidden layer is directly input to the (n+1)th hidden layer. This randomness effectively increases the diversity of implicit representations.

[0114] If the HAE intralayer autoencoder adopts Figure 4The structure shown includes M intra-layer autoencoders. For the hierarchical autoencoder receiving the input implicit expression (i.e., a randomly selected hierarchical autoencoder), its encoding module encodes the implicit expression and outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one of the M intra-layer autoencoders or randomly does not select any, processes it, and outputs the implicit expression to the decoding module. If the random selection results in one intra-layer autoencoder, then the selected intra-layer autoencoder processes the input implicit expression and outputs the processed implicit expression to the decoding module. If the random selection results in no intra-layer autoencoder, then the implicit expression output by the encoding module follows the path shown in the diagram. Figure 4 The I module shown represents a pass-through channel that directly inputs the implicit expression output by the encoding module into the decoding module.

[0115] By adding an autoencoder, noise in the implicit representation is reduced, enabling the downstream prediction module to reduce the weight of noise during the learning process of fine-tuning training, thereby reducing overfitting.

[0116] The following describes step 208, "After fine-tuning the training, the multi-view compressed representation module is removed from the target model obtained from the training to obtain the prediction model," in conjunction with an embodiment.

[0117] After removing the multi-view compressed representation model, the pre-trained language model obtained through fine-tuning and the downstream prediction module constitute the prediction model. This prediction model can output corresponding prediction results for the input sequence. For example, for an input sequence consisting of text sequence pairs, it predicts the relation type of the text sequence pairs. For example, for an input sequence consisting of text sequences, it predicts the sentiment type of the text sequence. For example, for an input sequence consisting of text sequences, it predicts the named entities in the text sequence. For example, for an input sequence consisting of text sequences, it predicts the part-of-speech tag of at least one word in the text sequence. And so on.

[0118] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0119] According to another embodiment, a fine-tuning device for a pre-trained language model is provided. Figure 6 A schematic block diagram of a fine-tuning device for a pre-trained language model according to one embodiment is shown, the device being disposed in Figure 1 In the model training device of the architecture shown. For example... Figure 6 As shown, the device 600 includes an acquisition unit 601 and a fine-tuning unit 603, and may further include a pre-training unit 602. The main functions of each component are as follows:

[0120] The acquisition unit 601 is configured to acquire a pre-built augmented model, which is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of a pre-trained language model. The multi-view compressed representation model includes N hierarchical autoencoders, where N is a positive integer.

[0121] Fine-tuning unit 603 is configured to fine-tune the augmentation model using training data to obtain the target model. The target model includes the pre-trained augmentation model and the downstream prediction model. During fine-tuning training, the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model are updated. The training objective is to minimize the difference between the output of the downstream prediction module and the expected value. After fine-tuning training, the multi-view compressed representation module is removed from the trained target model to obtain the prediction model.

[0122] As one possible approach, the pre-training unit 602 is configured to pre-train the augmentation model using training data, updating only the parameters of the multi-view compressed representation module during the pre-training process, with the training objective being to minimize the difference between the input and output of the multi-view compressed representation module.

[0123] Accordingly, the fine-tuning unit 603 is specifically configured to fine-tune the pre-trained augmented model using training data to obtain the target model.

[0124] As one of the preferred methods, N-level autoencoders employ different compression dimensions.

[0125] As one possible approach, during pre-training and fine-tuning training, the output of the previous hidden layer of the multi-view compressed representation module is implicitly expressed to the multi-view compressed representation module, and the multi-view compressed representation module randomly selects or does not select a level autoencoder from N levels to output the implicit representation to the next hidden layer.

[0126] As one of the preferred methods, the hierarchical autoencoder includes an encoding module, an intra-layer encoding module, and a decoding module. The intra-layer encoding module includes M intra-layer autoencoders, where M is a positive integer.

[0127] The encoding module in the hierarchical autoencoder that receives the implicit expression outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one of the M intra-layer autoencoders or does not select any intra-layer autoencoder to output the implicit expression to the decoding module.

[0128] As a preferred approach, a multi-view compressed representation module is added between at least two hidden layers of the pre-trained language model, including:

[0129] Add a multi-view compressed representation module between the top hidden layer and its adjacent hidden layers of the pre-trained language model; and / or,

[0130] Add a multi-view compressed representation module between the bottom hidden layer and its adjacent hidden layers of the pre-trained language model.

[0131] As one of the preferred methods, N is 3, and the compression dimensions of the three-level autoencoders are 128, 256 and 512, respectively.

[0132] Depending on the application scenario, the following situations may occur, including but not limited to:

[0133] The training data consists of text sequence pairs, and the expected value is the relation type of the text sequence pairs; or,

[0134] The training data consists of text sequences, and the expected value is the sentiment type of the text sequence; or,

[0135] The training data is a sequence of text, and the expected values ​​are named entities in the text sequence; or,

[0136] The training data consists of text sequences, and the expected value is the part-of-speech tag of at least one word in the text sequence.

[0137] In addition, embodiments of this application also provide a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the method described in any of the foregoing method embodiments.

[0138] And an electronic device, comprising:

[0139] One or more processors; and

[0140] A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform the steps of the method described in any of the foregoing method embodiments.

[0141] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in any of the foregoing method embodiments.

[0142] in, Figure 7An exemplary architecture of an electronic device is shown, which may include a processor 710, a video display adapter 711, a disk drive 712, an input / output interface 713, a network interface 714, and a memory 720. The processor 710, video display adapter 711, disk drive 712, input / output interface 713, network interface 714, and memory 720 can communicate with each other via a communication bus 730.

[0143] The processor 710 can be implemented using a general-purpose CPU, microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits to execute relevant programs in order to implement the technical solution provided in this application.

[0144] The memory 720 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 720 can store the operating system 721 for controlling the operation of the electronic device 700, and the basic input / output system (BIOS) 722 for controlling the low-level operations of the electronic device 700. Additionally, it can store a web browser 723, a data storage management system 724, and a fine-tuning device 725 for a pre-trained language model, etc. The aforementioned fine-tuning device 725 for the pre-trained language model can be the application program that specifically implements the aforementioned steps in this embodiment. In summary, when the technical solution provided in this application is implemented through software or firmware, the relevant program code is stored in the memory 720 and is called and executed by the processor 710.

[0145] Input / output interface 713 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components in the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touch screens, microphones, various sensors, etc., and output devices may include displays, speakers, vibrators, indicator lights, etc.

[0146] Network interface 714 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0147] Bus 730 includes a pathway for transmitting information between various components of the device, such as processor 710, video display adapter 711, disk drive 712, input / output interface 713, network interface 714, and memory 720.

[0148] It should be noted that although the above-described device only shows the processor 710, video display adapter 711, disk drive 712, input / output interface 713, network interface 714, memory 720, bus 730, etc., in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the solution of this application, and does not necessarily include all the components shown in the figures.

[0149] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a computer program product. This computer program product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.

[0150] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for system or system embodiments, since they are basically similar to method embodiments, the description is relatively simple, and relevant parts can be referred to the descriptions in the method embodiments. The systems and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0151] The technical solutions provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A method for fine-tuning a pre-trained language model, characterized in that, The method includes: Obtain a pre-built augmented model, wherein the augmented model is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of a pre-trained language model, and the multi-view compressed representation module includes N hierarchical autoencoders, where N is a positive integer; The augmentation model is fine-tuned using training data to obtain a target model, which includes the augmentation model and the downstream prediction model. During the fine-tuning training, the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model are updated. The training objective is to minimize the difference between the output of the downstream prediction module and the expected value. After the fine-tuning training is completed, the multi-view compressed representation module is removed from the target model obtained by training to obtain the prediction model.

2. The method according to claim 1, characterized in that, The N-level autoencoders employ different compression dimensions.

3. The method according to claim 1, characterized in that, Before fine-tuning the augmented model using training data, the method further includes: The augmented model is pre-trained using the training data. During the pre-training process, only the parameters of the multi-view compressed representation module are updated. The training objective is to minimize the difference between the input and output of the multi-view compressed representation module. Fine-tuning the augmented model using the training data includes: fine-tuning the pre-trained augmented model using the training data.

4. The method according to claim 3, characterized in that, During the pre-training and fine-tuning training processes, the output of the previous hidden layer of the multi-view compressed representation module is implicitly expressed to the multi-view compressed representation module, and the multi-view compressed representation module randomly selects one or randomly does not select a level autoencoder output from the N level autoencoders to implicitly express the next hidden layer.

5. The method according to claim 1, 3, or 4, characterized in that, The hierarchical autoencoder includes an encoding module, an intra-layer encoding module, and a decoding module. The intra-layer encoding module includes M intra-layer autoencoders, where M is a positive integer. The encoding module in the hierarchical autoencoder that receives the implicit expression outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one of the M intra-layer autoencoders or randomly does not select any intra-layer autoencoder to output the implicit expression to the decoding module.

6. The method according to claim 1, characterized in that, The addition of a multi-view compressed representation module between at least two hidden layers of the pre-trained language model includes: Add a multi-view compressed representation module between the top hidden layer and its adjacent hidden layers of the pre-trained language model; and / or, A multi-view compressed representation module is added between the bottom hidden layer and its adjacent hidden layers of the pre-trained language model.

7. The method according to claim 2, characterized in that, The value of N is 3, and the compression dimensions of the three-level autoencoders are 128, 256, and 512, respectively.

8. The method according to any one of claims 1 to 4, 6 and 7, characterized in that, The training data consists of text sequence pairs, and the expected value is the relation type of the text sequence pairs; or... The training data is a text sequence, and the expected value is the sentiment type of the text sequence; or, The training data is a text sequence, and the expected value is a named entity in the text sequence; or... The training data is a text sequence, and the expected value is the part-of-speech tag of at least one word in the text sequence.

9. A fine-tuning device for a pre-trained language model, characterized in that, The device includes: The acquisition unit is configured to acquire a pre-built augmented model, which is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of a pre-trained language model. The multi-view compressed representation module includes N hierarchical autoencoders, where N is a positive integer. The fine-tuning unit is configured to fine-tune the augmentation model using training data to obtain a target model, which includes the pre-trained augmentation model and the downstream prediction model. During the fine-tuning training, the parameters of the pre-trained language model, the multi-view compressed representation module, and the downstream prediction model are updated, and the training objective is to minimize the difference between the output of the downstream prediction module and the expected value. After the fine-tuning training is completed, the multi-view compressed representation module is removed from the trained target model to obtain the prediction model.

10. The apparatus according to claim 9, characterized in that, The device further includes: A pre-training unit is configured to pre-train the augmented model using the training data, during which only the parameters of the multi-view compressed representation module are updated, and the training objective is to minimize the difference between the input and output of the multi-view compressed representation module. The fine-tuning unit is specifically configured to fine-tune the pre-trained augmented model using the training data to obtain the target model.

11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method according to any one of claims 1 to 8.

12. An electronic device, characterized in that, include: One or more processors; as well as A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform the steps of the method according to any one of claims 1 to 8.