BERT model training method and system based on multiplier alternating direction method

An alternate direction method and model training technology, applied in the field of BERT model training methods and systems based on the multiplier alternate direction method, can solve problems such as large memory space, consumption, and large memory space consumption, so as to improve efficiency and accuracy, improve Training efficiency, the effect of reducing the amount of calculation

Pending Publication Date: 2022-07-29
NAT UNIV OF DEFENSE TECH
0 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0006] 1. Gradient disappearance and gradient explosion: Since the gradient in the backpropagation process has a strong dependence, if the gradient value is too large or too small, it will affect the parameters of the model training, resulting in a decline in the performance of the final model, resulting in gradient disappearance and gradient explosion Problems, especially when applied to application fields such as event extraction (or text classification), because the model needs to be calculated according to the context information of the input sequence, during the reverse calculation process, the calculation of the gradient value of the target loss function is prone to the above problems, and then As a result, BERT model training is difficult, performance is poor, and there will be problems that cannot solve long text encoding
[0007] 2. GPU memory is limited: the number of parameters of the BERT model needs to depend on the matrix multiplication scale in the model, while the number of parameters of the BERT model for event extraction is large, and it needs to consume a large a...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a BERT model training method and system based on a multiplier alternating direction method. The method comprises the steps that S1, statements to be trained are extracted from a training set, word vectors are extracted, and then the word vectors are input into a BERT model; s2, when the BERT model trains an input word vector, solving a target function by using a multiplier alternating direction method, determining the target function, adding a limiting condition to the input word vector, converting the determined target function into an enhanced Lagrange function, and obtaining an enhanced Lagrange function; solving variable parameters in the target function and an output result of the BERT model by solving the enhanced Lagrange function; s3, variable parameters solved in the objective function are updated until training is completed, and a final BERT model training result is obtained and output. The method has the advantages that the problems of gradient disappearance and explosion in the training process can be avoided, parallel implementation is easy, the training efficiency is high, and the training performance is good.

Application Domain

Technology Topic

Training performanceMachine learning +4

Image

  • BERT model training method and system based on multiplier alternating direction method
  • BERT model training method and system based on multiplier alternating direction method
  • BERT model training method and system based on multiplier alternating direction method

Examples

  • Experimental program(1)

Example Embodiment

[0057] The present invention will be further described below with reference to the accompanying drawings and specific preferred embodiments, but the protection scope of the present invention is not limited thereby.
[0058] like figure 1 As shown, the steps of the BERT model training method based on the multiplier alternating direction method in this embodiment include:
[0059] Step S1. Data input: take out the sentence to be trained from the training set and extract the word vector and input it into the BERT model;
[0060] Step S2. Multiplier Alternate Direction Method Solution: When the BERT model trains the input word vector, the multiplier alternate direction method is used to solve the objective function, wherein the representation of the BERT model by the Encoder module in the Transformer model is used to determine the objective function, Constraints are added to the input word vector, and the enhanced Lagrangian algorithm is used to transform the objective function into an enhanced Lagrangian function, and the variable parameters in the objective function and the output of the BERT model are obtained by solving the enhanced Lagrangian function. ;
[0061] Step S3. Parameter update: update the variable parameters solved in the objective function until the training is completed, and obtain the final BERT model training result output for event extraction.
[0062] Alternating Direction Multiplier Method (ADMM) is a computational framework for solving convex optimization problems with separable, fast processing speed and good convergence performance, which is very suitable for solving distributed convex optimization problems, especially statistical learning problems. The multiplier alternating direction method restricts Ax+By=C to the objective function F(x, y), and uses the enhanced Lagrangian algorithm to transform the original objective function into an enhanced Lagrangian function, so as to convert the originally solved objective function It becomes the solution of the enhanced Lagrangian function, and the existing parameters are used to solve the variables x and y alternately in the solution process. Different from the traditional deep learning training method, the gradient of the objective function is not calculated in the whole solution process based on the multiplier alternating direction method, but the analytical solution of the objective function is directly solved, so the gradient is not used for backpropagation, which can be avoided. The problems of gradient disappearance and gradient explosion are common in traditional deep learning, and the multiplier alternating direction method also takes into account the advantages of training robustness, easy parallel implementation, and fast convergence speed. It is especially suitable for events involving a large amount of data and a large amount of parameters. Extraction tasks can effectively improve the efficiency and accuracy of event extraction.
[0063] In this embodiment, the above-mentioned characteristics of event extraction and the method based on the alternate direction of multipliers are fully considered. In the process of training the BERT model after the word vector is extracted from the sentences to be trained in the training set, the method of alternate directions of multipliers is used to solve the BERT model. In the solution process, by combining the structural characteristics of the BERT model, use the Encoder module in the Transformer model to represent the BERT model to determine the objective function, and add constraints to the input word vector, and convert the solution of the objective function to an enhanced Lagrangian function. On the one hand, it can solve the inevitable gradient disappearance and explosion problems existing in traditional stochastic gradient algorithms, etc., and realize the training of the BERT model. The characteristics of the implementation, parallel processing of the parameters in the training process can solve the problem that the BERT model cannot be trained by a single accelerator card when the number of parameters is too large, and greatly improve the training efficiency. In addition, the length of the input data will not affect the direction of the multiplier alternation. Therefore, it can also solve the model performance problem of the BERT model under long input text in the event extraction task.
[0064] To apply the multiplier alternating direction method to the training of the BERT model, it is necessary to first formalize its mathematical expression in the case of input data and corresponding labels (X, Y) according to the model form of the Transformer model, so as to facilitate subsequent construction. desired objective function. The BERT model generally contains L Encoder (coding) modules of the Transformer model. The Transformer model can be decomposed into three parts: Scaled Dot-Product Attention, Multi-Head Attention in Transformer (multi-head attention mechanism) and Transformer overall structure, The formalization process of each part includes:
[0065] (1) For Scaled Dot-Product Attention, such as figure 2 As shown, input the Q, K, V weight matrix, first multiply the Q and K matrix and then scale, the scaling result is input into the softmax (normalized exponential) function, and finally multiplied by the matrix V, the mathematical expression can be expressed as:
[0066]
[0067] where d kis the dimension of the input K.
[0068] (2) For the multi-head attention mechanism Multi-Head Attention, such as image 3 As shown, the three input weight matrices of Q, K, and V are linearly transformed, and the result of the linear transformation is input into h Scaled Dot-Product Attention, each Scaled Dot-Product Attention can be regarded as a head (head); all The calculation result of head is concatenated (concat) and linearly transformed, that is, it can be expressed mathematically as:
[0069] MultiHead(Q, K, V)=concat(head 1 , ..., head h )W o (2)
[0070] head i =Attention(QW i Q , KW i K , VW i V ) (3)
[0071] Among them, QW i Q , KW i K , VW i V Represents the product results of the three input matrices Q, K, and V and the corresponding weight calculation, W o Represents the weight under the multi-head attention mechanism, W is the training weight corresponding to the input, and concat represents the cascade operation.
[0072] Based on the above expressions, the complete model of Transformer can be mathematically formalized. like Figure 4 As shown, the Transformer model can be divided into two parts Encoder module and Decoder module, each of which has N encoders and decoders. Suppose the input data is X 0 , in the nth Encoder module, it is formulated as:
[0073] m eo =MultiHead(X n-1 , X n-1 , X n-1 )
[0074] l eo =LayerNorm(X n-1 +m eo )
[0075] f eo =FeedForward(l eo )
[0076] X n =LayerNorm(l eo +f eo ) (4)
[0077] where m eo , f eo , l eo Respectively represent the corresponding calculation output results after the FeedForward module, the multi-head mechanism module and the regularization calculation module.
[0078] In the nth Decoder module, its formulation is:
[0079] m do =MultiHead(Y, Y, Y)
[0080] l do1 =LayerNorm(Y+m do )
[0081] l do2 =LayerNorm(l do1 +Multihead(X n , X n , l do1 ))
[0082] l do3 =LayerNorm(l do2 +Multihead(l do2 , l do2 , Y))
[0083] o=LayerNorm(l do3 +FeedForward(l do3 )) (5)
[0084] where X n It is the calculation output result of the nth Encoder module, LayerNorm is regularization, and FeedForward is linear transformation.
[0085] The BERT model generally contains Encoder modules of L Transformer models, such as Figure 5 As shown, after the output of the last Decoder module, a linear transformation and softmax can be performed to obtain the output of the entire Transformer model. On the basis of the mathematical formula of the Transformer model above, the BERT model is represented by the Encoder module, and the mathematical formula can be obtained as follows:
[0086] m eo =MultiHead(X n-1 , X n-1 , X n-1 )
[0087] l eo =LayerNorm(X n-1 +m eo )
[0088] f eo =FeedForward(l eo )
[0089] X n =LayerNorm(l eo +f eo ) (6)
[0090] where X n It is the calculation output result of the nth Encoder module, n is generally 1, 2, ..., 12, where X 0 Indicates the output result of the embedding layer. This embodiment does not consider the embedding layer in the mathematical representation of the BERT model.
[0091] After N Encoder modules, the output result X N After a linear classification layer, the output result of the BERT model is:
[0092] o=classifier(X n ) (7)
[0093] Among them, classifier represents the linear classification layer, and o represents the output result.
[0094] Since the BERT model needs to use different classification layers for different tasks, in order to simplify the calculation, this embodiment uses the output result X after passing through N Encoder modules. N As the target output of the solution, that is, the target solved by the multiplier alternating direction method is X N , which is the final computational solution target.
[0095] In the actual calculation process, since the LayerNorm operation is not suitable for intuitive mathematical solutions, this embodiment performs the ReZero approximation calculation on the LayerNorm operation to simplify and approximate the complex regularization calculation in a simple form, thereby reducing the amount of calculation. and memory redundancy. Specifically, when the ReZeRO approximate calculation is performed in this embodiment, the LayerNorm layer in the Transformer model is simplified, and a residual parameter α is defined for each LayerNorm layer, then the calculation formula of the LayerNorm layer is:
[0096] x i+1 =x i +α i F i (x i ) (8)
[0097] where x i+1 , x i Respectively represent the output of the i-th LayerNorm layer, α i Represents the residual parameter of the i-th LayerNorm layer, F i Represents the computation function of the i-th layer.
[0098] After the above equivalent transformation, the high calculation amount introduced by the original LayerNorm layer becomes only the calculation of the residual parameter α. Then the original computational complexity of this layer is O(n 2 ), after ReZeRO processing, it can be said that the computational complexity of this layer is reduced to O(1). This embodiment simplifies the calculation of LayerNorm by using ReZero, which can greatly reduce the amount of calculation and the memory redundancy of the calculation.
[0099] After the ReZeRO processing in this embodiment, the mathematical expression finally obtained by using the Encoder module to represent the BERT model is:
[0100] m eo =MultiHead(X n-1 , X n-1 , X n-1 )
[0101] l eo =X n-1 +α n1 m eo
[0102] f eo =FeedForward(l eo )
[0103] X n =l eo +α n2 f eo (9)
[0104] where α n1 , α n2 respectively represent m eo and f eo the corresponding residual parameters.
[0105] After the above process, the Encoder module can be used to represent the BERT model, and the mathematical expression representing the BERT model can be obtained, as shown in equation (9), which is convenient for the subsequent solution of the BERT model based on the multiplier alternation method.
[0106] The detailed steps of using the Encoder module in the Transformer model to represent the BERT model in step S2 of this embodiment specifically include:
[0107] S201. Use the Encoder module to represent the BERT model, and obtain the mathematical expression of the BERT model, as shown in formula (6);
[0108]S202. Perform ReZero approximation calculation on the operation of the LayerNorm module: simplify the LN layer in the Transformer model, and define a residual parameter α for each LayerNorm layer to obtain a mathematical expression using the Encoder module to represent the BERT model, such as Formula (9) is shown.
[0109] For the target loss function L, the target loss function L in this embodiment is specifically a cross-entropy function, which can be expressed as L(X, Y), where (X, Y) represents input data and corresponding data labels. In step S2 of this embodiment, the objective function is specifically determined according to the following formula:
[0110]
[0111] Among them, for the convenience of representation, Φ represents the final objective function to be solved, L represents the loss function, Ω E , Ω D , are the regularization functions for E and D respectively, E and D respectively represent the Encoder and Decoder modules in the Transformer module, and W Ei , W Di , X i-1 respectively denote E i , D i The weight of and the input data corresponding to the i-1th time, FF represents FeedForward, FF(l eo ) is represented as input l eo The computed output of FeedForward, v is denoted as a hyperparameter.
[0112] In this embodiment, the BERT model is represented by the Encoder module in the Transformer model, and the objective function is constructed and formed based on the obtained mathematical expression, that is, the above formula (10) is used as the optimization target, and the mathematical form of the Transformer solved by the multiplier alternating direction method is obtained. , so that the multiplier alternating direction method can be applied to the training of the BERT model in event extraction, and the multiplier alternating direction method can be used to solve the BERT model, so that the advantages of the multiplier alternating direction method can be fully utilized to effectively improve the performance of the BERT model training. , to solve problems such as exploding gradients, limited memory, and performance degradation for long input texts.
[0113] In step S2 of the present embodiment, the Encoder module is used to set constraints on the mathematical expressions represented by the BERT model, so as to determine the constraints AX+BY=C between the input data (X, Y). The constraints are specifically:
[0114] m eo =Multihead(X n-1 , X n-1 , X n-1 )
[0115] l eo =X n-1 +α n1 m eo
[0116] f eo =FF(l eo )
[0117] X n =l eo +α n2 f eo (11)
[0118] In this embodiment, by constructing the above restriction conditions according to the specific calculation form of the BERT model, the combination of the BERT model and the multiplier alternating direction method can be realized, so that the BERT model can be solved by using the multiplier alternating direction method.
[0119] On the basis of the above formal expression, the multiplier alternating direction method is applied to the objective function, and the enhanced Lagrangian function after formalization is further formed. In step S2 of this embodiment, the enhanced Lagrangian function obtained after formalization is specifically:
[0120]
[0121] where λ 1 , λ 2 , λ 3 , λ 4 are Lagrangian operators, ρ 1 , ρ 2 , ρ 3 , ρ 4 are hyperparameters, respectively.
[0122] So far, the objective function of the original optimization solution is transformed into an optimization solution problem for the transformed enhanced Lagrangian function. Different from the solution method of traditional deep learning, the solution to the enhanced Lagrangian function in this embodiment is to determine the variable m in the function. eo , X n-1 , a n1 , l eo , f eo , a n2 , X n so as to complete the solution of the objective function. According to the calculated X n The value is input into the classification layer, and the result of the classification layer is compared with the data label Y to obtain the loss (loss) result of the calculation.
[0123] In the multiplier alternating direction method, the update order of parameters is very important, and an inappropriate parameter update order may lead to non-convergence of model training. According to the general principle of the multiplier alternating direction method, the Transformer model is solved, and the target solution parameter order is X n-1 →m eo →α n1 →l eo →f eo →α n2 →X n , it is also necessary to calculate the Lagrangian λ 1 , λ 2 , λ 3 , λ 4 to update. In order to accelerate the convergence of the model, in step S3 of this embodiment, the parameter update sequence of backward-forward is used to update the variable parameters solved in the objective function, that is, the weight is updated first and then the deviation is updated according to the forward process. The update order of the backward process is exactly the opposite of that of the forward process (update the bias first and then update the weight), which can not only ensure the performance of the model, but also accelerate the convergence of the model and reduce the training time of the model.
[0124] In this embodiment, the variable parameters to be solved in the objective function are specifically based on X n →α n2 →f eo →l eo →α n1 → m eo →X n-1 →X n-1 →m eo →α n1 →l eo →f eo →α n2 →X n are updated in the order of X, where X n , X n-1 are the calculation output results of the nth and n-1th Encoder modules, respectively, f eo , m eo , l eo Respectively represent the corresponding calculation output results after the FeedForward module, the multi-head mechanism module and the regularization calculation module; α n1 , α n2 respectively represent m eo and f eo the corresponding residual parameters.
[0125] In step S3 of this embodiment, after the update of the target parameter is completed, the memory occupied by the updated parameter is immediately released. Since the multiplier alternating direction method couples the update of each parameter in each iteration process, and has nothing to do with the gradient in the process of parameter update, this embodiment uses the parameters that are no longer needed immediately after the corresponding parameter update ends. The occupied memory is released and the memory is cleared in time, which can effectively achieve the purpose of saving memory, so that the GPU memory can be used reasonably and can meet the needs of large-scale model training under limited GPU resources.
[0126] In the specific application embodiment of the present invention, the detailed steps of using the above method to perform BERT training and realize event extraction are:
[0127] Step 1: Training set acquisition: Use English Wikipedia and BooksCorpus data sets as training sets;
[0128] Step 2: Model training: Obtain the initial BERT model, input the training set into the initial BERT model, and use the above model training method for training on the training set, namely:
[0129] Step 2.1. Data input: Take out the sentence to be trained from the training set and extract the word vector and input it into the BERT model;
[0130] Step 2.2. Multiplier Alternate Direction Method Solution: When the BERT model trains the input word vector, the multiplier alternate direction method is used to solve the objective function, in which the representation of the BERT model by the Encoder module in the Transformer model is used to determine the objective function, Constraints are added to the input word vector, and the enhanced Lagrangian algorithm is used to transform the objective function into an enhanced Lagrangian function, and the variable parameters in the objective function and the output results of the BERT model are obtained by solving the enhanced Lagrangian function. ;
[0131] Step 2.3. Parameter update: update the variable parameters solved in the objective function until the training is completed, and obtain the final BERT model training result output for event extraction.
[0132] The detailed process of each step is as described above.
[0133] Step 3: Model fine-tuning: Obtain the pre-training weights of the model trained in step 2, and fine-tune the pre-trained model weights according to the current event extraction task to obtain the final BERT model training result.
[0134] Step 4: Event extraction:
[0135] Step 4.1 Extract the word vector from the sentence to be extracted and output it to the BERT model to obtain the context information of each word in the sentence, so as to capture the interdependence between the components in the sentence from the perspective of the entire sentence;
[0136] Step 4.2 Calculate the correlation between the words in the sentence, so as to obtain the event trigger word and event element in each event, as well as the event type and role they represent, and complete the event extraction.
[0137] as if Image 6The example shown is taken as an example. There are two kinds of events in this example. From the perspective of syntactic dependency, according to the noun subject relationship (nsubj), it can be concluded from the event element cameraman and the trigger word die: a Victim role in the event Die is cameraman, but there is no direct dependency between cameraman and another event's trigger word fired. Therefore, it is very difficult to find the event element with the role of Target for the Attack type event in the graph only based on the traditional dependency relationship. In order to solve this problem, through the sequence-to-sequence (sequence-to-sequence) structure of the Transformer model, this embodiment can comprehensively utilize the contextual semantic and grammatical information of the entire sentence to capture and judge the potential between trigger words and event elements. At the same time, the BERT model is solved by using the multiplier alternating direction method, which can break through the limitations of the backpropagation algorithm and the stochastic gradient algorithm in the traditional deep learning field, and solve the inherent and unavoidable problems of gradient disappearance and explosion in the above algorithms. The analytical solution of the objective function is solved by using the enhanced Lagrangian function, so the solution performance of the model is better, the convergence speed is faster, and the solution process is more stable. It effectively alleviates the problem of GPU memory redundancy in data parallelism caused by too many parameters and the problem that model parallelism is not easy to achieve.
[0138] This embodiment also discloses a BERT model training system based on the multiplier alternating direction method, including a processor and a memory, the memory is used for storing a computer program, the processor is used for executing the computer program, and the processor is used for executing the computer program to execute the above method.
[0139] This embodiment also discloses a computer-readable storage medium storing a computer program, and the above method is implemented when the computer program is executed.
[0140] The present invention can be applied to event extraction in multiple fields such as question answering system and reading comprehension, text summarization, text classification, etc. It can also be applied to other types of information extraction, and can even be applied to other types of natural language processing.
[0141] The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention should fall within the protection scope of the technical solutions of the present invention without departing from the content of the technical solutions of the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Classification and recommendation of technical efficacy words

  • Improve training efficiency
  • Improve accuracy and efficiency

Automatic screw and nut combination locking device

Owner:SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products