Language model training method and device based on knowledge distillation and text classification method and device
A language model and training method technology, applied in semantic analysis, natural language data processing, character and pattern recognition, etc., can solve the problem of decreased accuracy, inability to accurately transfer teacher model sentence grammar and semantic representation, and poor student model transfer ability. To guarantee and other issues, to achieve the effect of meeting application requirements and improving migration ability
Pending Publication Date: 2021-04-30
IFLYTEK CO LTD
0 Cites 6 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0003] The existing pre-trained language model distillation methods usually use the distillation method of aligning the output scores and the middle layer. This method can make the output scores of the student model close to the output scores of the teacher model on the data of a specific task. However, , if the data in a new field is tested, the t...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreMethod used
Based on above-mentioned situation, the application provides a kind of language model training method, text classification method and device based on knowledge distillation, by constructing contrast learning positive and negative example in distillation process, utilize contrastive learning positive and negative example to second model Training to transfer the rich sentence grammar and semantic representation of the first model to the second model can make the distilled second model have better transfer ability, so as to meet cross-domain application requirements.
Different from the prior art, the present embodiment constructs the first memory bank and the second memory bank, and uses the first memory bank and the second memory bank to store the first hidden layer sentence content and the second hidden layer sentence content respectively, In order to directly select the corresponding hidden layer sentence representations from the first memory bank and the second memory bank when constructing positive and negative examples for comparative learning, it is possible to avoid repeated construction of hidden layer sentence content, thereby improving the efficiency of comparative training.
Different from the prior art, the present embodiment learns positive and negative examples by constructing a comparison, and uses the positive and negative examples of contrastive learning to train the student model, so that the representation of the same input text between the student model and the teacher model can be closer, while The characterization of different input texts is farther away, so that the grammatical and semantic representation capabilities of the teacher model can be transferred to the student model, so that the student model has better transfer capabilities, thus meeting the needs of cross-domain applications.
Different from the prior art, this embodiment constructs positive and negative examples of contrastive learning in the distillation process, and utilizes positive and negative examples of contrastive learning to train the second model, so as to transfer the rich sentence grammar and semantic representation of the first model In the second model, the distilled second model can have better transferability, and the trained second model can be used as a language model to apply classification tasks in different fields, which can not only achieve inference acceleration, but also achieve the same level as The teacher model has comparable accuracy, thus meeting the application requirements across domains.
Different from the prior art, this embodiment learns positive and negative examples by constructing a comparison, and uses the positive and negative examples of contrastive learning to train the student model, so that the representation of the same input text between the student model and the teacher model can be closer, while The characterization of different input texts is farther away, so that the grammatical and semantic representation capabilities of the teacher model can be transferred to the student model, so that the student model has better transfer capabilities, thus meeting the needs of cross-domain applications.
In the present embodiment, use inter-word relationship matrix to construct hidden layer sentence content, be because the relation value size between words can reflect the grammar and the semantics of sentence, for example, " he stole a car " in sentence " he The relationship between "" and "steal" and "car" is relatively large, reflecting a grammatical relationship between subject-ver...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
The invention discloses a language model training method and device based on knowledge distillation and a text classification method and device. The language model training method comprises the steps of inputting a training corpus into a first model and a second model for processing to obtain corresponding intermediate layer data and an output result; calculating and obtaining a first hidden layer sentence content and a second hidden layer sentence content by utilizing the corresponding intermediate layer data, constructing a comparison learning positive and negative example based on the first hidden layer sentence content and the second hidden layer sentence content, and training a second model by utilizing the comparison learning positive and negative example, the corresponding intermediate layer data and an output result, and determining the trained second model as a language model. Through the classification model, rich sentence grammars and semantic representations of the first model can be migrated to the second model, so that the second model obtained through distillation has better migration capability, and cross-domain application requirements are met.
Application Domain
Technology Topic
Image
Examples
- Experimental program(1)
Example Embodiment
[0030]Next, the technical scheme in the present application embodiment will be described in the present application, and it is understood that the described embodiments are intended to be described herein, not all of the embodiments of the present application. Based on the embodiments in the present application, one of ordinary skill in the art does not have all other embodiments obtained by creative labor, all of which are protected by the present application.
[0031]The terms used in the present application examples are only for the purposes of describing particular embodiments, not intended to limit the application. The "one", "one", "one", "one" and "" "" used in the present application embodiment and the appended claims are also intended to include many forms unless otherwise over the meaning of other meanings. "Generally comprises at least two, but does not exclude the case where at least one is included.
[0032]It should be understood that the terms "and / or" as used herein are merely a correlation relationship to the associated object, indicating that there are three relationships, such as A and / or B, which may be represented: Alone A, at the same time B, there is three cases of B alone. In addition, the characters "/" in this article generally indicate that the front and rear association objects are a "or" relationship.
[0033]It should be understood that the terms "including", "include" or any other variants used herein are intended to cover non-exclusive contained, such that the process, method, article, or device including a series of elements, not only those elements, but also Other elements that are not explicitly listed, or include elements inherent for such processes, methods, items, or equipment. In the case where there is no more restriction, the elements defined by the statement "include ..." do not exclude additional same elements in the process, methods, items, or devices including the elements.
[0034]The fine-tuning of the downstream task using the pre-training model has become a new field of Natural Language Processing, NLP. This paradigm is in many natural language processing tasks, such as reading understanding tasks and natural language understanding, glue. ) The task can be improved. Common pre-training models, such as BERT, ROBERTA, Albert, and Electra, all multi-storey Transformer as the core skeleton, while multi-layer Transformer skeleton has brought super nonlinear fitting capacity and intensive ability. It has brought huge parameter storage pressure and slowness of the reason for the preferences. Especially for scenes with high concurrent service average response time requirements, such as accessing judicial intelligent customer service on mobile phone or on domestic central processor's document review work, the pre-training model will pass low, and average response The problem of high time, therefore, it is necessary to reasonine acceleration on the pre-training model.
[0035]Knowledge distillation is a teacher-based model-based model compression method proposed by Hinton et al., By introducing a large-scale teacher model to induce a small-scale student model training, and achieve knowledge migration. The practice is to train a teacher model first, then use the teacher model's output and data label to train the student model, so that the student model can not only learn how to determine the correct sample from the labeling data, but also learn between the teacher model. relationship.
[0036]The existing pre-training language model distillation method typically uses a distillation mode of alignment output fraction and interlayer alignment, and can effectively improve the alignment effect of the final output score by aligning the intermediate layer data. However, this method can only be consistent with the data of the output score of the student model and the value of the output score of the student model and the value of the output score of the teacher model, and the effect of the student model will be better than the data of a new field. The teacher model has dropped a lot. For example, the student model of distillation training based on the aft crime is tested, and the student model effect is quite based on the use of theft crime. However, the use of dangerous driving crimes test, the effect of the student model is relatively related to the teacher model Differently 10-20%, that is, the student model of distillation is not guaranteed. It does not meet the syntax of the migration teacher model, the purpose of semantic characterization, and cannot meet the application needs of cross-domain.
[0037]Based on the above, the present application provides a language model training method, a text classification method, and apparatus based on knowledge distillation, and is trained by comparing the learning positive and negative remedies. Migrating the first model-rich sentence syntax and semantic metering to the second model, enabling the distilled second model to better migrate, thereby meeting the application requirements of cross-domain.
[0038]Specifically, seefigure 1 ,figure 1 It is a flow chart of the second embodiment of the language model training method based on knowledge distillation. Such asfigure 1 As shown in the present embodiment, the method includes:
[0039]S11: Get the sample data set, the sample data set includes multiple training corners and labels for training corners.
[0040]In the present embodiment, the training medium includes data and / or data of the sequence labeling task.
[0041]Among them, the data of the classification task includes emotional classification, topic classification, and textual data; the data of the sequence annotation task includes naming entity identification, mean labels, and semantic role labels.
[0042]In this embodiment, the data of the classified task can be extracted from the case association data set and intelligent customer service data set from the judicial field.
[0043]In other embodiments, the data of the classified task can also be extracted from related data sets in other fields, and this application is not limited.
[0044]In this embodiment, the data of the sequence labeling task can be extracted from the crime elements of the judicial field.
[0045]In other embodiments, the data of the sequence label task can also be extracted from related data sets in other fields, and this application is not limited.
[0046]S12: Enter a plurality of training media to the first model, processes the training language through the first model to obtain the intermediate layer data of the training medium and output results; and input multiple training corners to the second The model is processed by the second model to process the trained corner to obtain the intermediate layer data of the training medium and the output result; wherein the number of intermediate layers of the first model is greater than the number of intermediate layers of the second model.
[0047]In the present embodiment, the first model is a multilayer model, such as the teacher model, the second model is less than the number of layers of the first model, such as a student model. For convenience of understanding, in the present embodiment, the first model is a teacher model, and the second model is a specific example for the student model.
[0048]For example, the intermediate layer of the teacher model consists of 12th floor transformer, and the intermediate layer of the student model consists of 3 layers Transformer.
[0049]In the present embodiment, the use of the pre-trained 12-layer model is subjected to training, reverse propagation of the update model parameters, and the teacher model parameters are well trained, and the teacher model in distillation training is obtained. Use the pre-trained 3-layer model as an initialized parameter, or the first 3 layer of the pre-training 12-layer model as an initialized parameter as a student model in distillation training.
[0050]Specifically, the number of layers of the pre-training language model (e.g., BERT), the better the effect indicator, and thus the present embodiment selects the 12-story TRANSFORMER as a teacher model according to the effect index.
[0051]In other embodiments, the 24 layer TRANSFORMER can also be selected as a teacher model, and this application is not limited.
[0052]Specifically, the fewer the number of layers of the pre-training model, the less the number of hidden layer units, the shorter the average response time, for example, the reasoning time required for the 3-layer student model is only 1/4 of the 12-story teacher model reasoning time, Thus, the present embodiment selects the 3 layer transformer as a student model.
[0053]In other embodiments, since the number of student models of the 4-layer hidden unit is 384, the reasoning time on the T4 card is only 1/9 of the teacher model, and the 4-story transformer can be selected as a student model. This application is Not limited.
[0054]S13: The second hidden layer of the respective training data corresponding to the intermediate layer data corresponding to the intermediate layer data of the first model is calculated, respectively.
[0055]Seefigure 2 ,figure 2 Yesfigure 1 A flowchart of a specific embodiment of a specific embodiment in step S13. Such asfigure 2 As shown in the present embodiment, each training data corresponding to the intermediate layer data of the intermediate layer data corresponding to the first model is calculated, respectively, and the second hidden layer of each training data corresponding to the intermediate layer data of the second model. The steps of the content include:
[0056]S21: The intermediate layer data of the first model and the intermediate layer data of the second model are calculated, resulting in the intermediate layer data between the intermediate layer data of the first model, and the intermediate layer data of the second model. The relationship matrix of the word.
[0057]During the distillation, since the number of layers of the first model such as the teacher model is larger than the number of layers of the students such as the student model, in order to align the teacher model with the intermediate layer data of the student model, the teacher model and student model are structurally mapped. Relationship, an intermediate layer having a corresponding relationship.
[0058]In the present embodiment, the intermediate layer data of the intermediate layer function of the first model is selected, and the intermediate layer data of the intermediate layer of each layer is established, respectively, and the intermediate layer data of each layer is established, and the first model is obtained. Mapping between the second model, such as the mapping relationship between the teacher model and the intermediate layer of the student model.
[0059]Among them, "interval" mapping relationship is used to obtain mapping pairs of L intermediate layers Where l is the number of intermediate layers of the second model, that is, the number of student models.
[0060]For example, an example in which the upper surface is connected, as a total of 12 intermediate layers of the teacher model in the present embodiment {T1, T2, ..., t12}, Students models have 3 intermediate layers {s1, S2, S3}, Thus obtaining the mapping pair of 3 intermediate layers, the mapping result is Among them, Ti= T4i, Si= Si, L = 3. That is, the fourth layer of the teacher model, the 8th and 12th floors are selected from the first layer, the second layer, and the third layer of the student model as the same intermediate layer as the function.
[0061]Further, the fourth layer of the first model, the 8th layer, and the intermediate layer data corresponding to the 12th layer are selected. And intermediate layer data for the second model
[0062]among them, For the 4th floor, the 8th floor, and the 12th floor of the Transformer structure, respectively, the output vector of the TRANSFORMER structure of the Teacher Model, The output vectors of the Student Model Layer 1, Layer 2, and the 3rd floor of the Transformer structure.
[0063]Further, the intermediate layer data elected in the teacher model Performing the internal calculation to obtain the word relationship matrix between intermediate layer data of the first model And intermediate layer data for student models Performing the internal calculation to obtain the intermediate relationship matrix between the intermediate layer data of the second model
[0064]S22: The intermediate layer data of the first model is characterized by the intermediate layer data of the intermediate layer data of the first model, and the first hidden layer of the respective training data corresponds to the intermediate layer data of the first model; And the intermediate layer data of the second model is extracted by the intermediate layer data of the second model, and the second hidden layer of the respective training data corresponding to the intermediate layer data of the second model is obtained.
[0065]In the present embodiment, according to the configuration of the mapping relationship, the intermediate layer data of the first model is characterized by the intermediate layer data of the intermediate layer data of the selected first model, and the intermediate layer data of the first model is obtained. The first hidden layer sentence content of each training data.
[0066]Specifically, the intermediate layer data between the intermediate layer data of the first model is The hidden tester based on the relationship matrix structure of this term is characterized
[0067]among them, That is, the first hidden layer sentence content of each training data corresponding to the intermediate layer data constructed.
[0068]In the present embodiment, the intermediate layer data between the second model is the intermediate layer data. The hidden tester based on the relationship matrix structure of this term is characterized
[0069]among them, That is, the second hidden submissions of each training data corresponding to the intermediate layer data constructed of the second model.
[0070]Further, due to the size of the maximum sentence of the word relationship matrix and the maximum sentence of the maximum sentence in the model, and due to many training tangs, in order to avoid hypothylation and improve the convergence speed, the present embodiment will reduce the inter-word relationship matrix. Dimensional treatment to obtain the above hidden syndrome to improve training efficiency.
[0071]Specifically, the word relational matrix is spliced according to the order, and then the linear transformation is used to reduce the inter-layer relationship matrix obtained by the splicing to obtain the above hidden tester.
[0072]For example, in the case matching task, the maximum sentence of the model is 512, then the size of the word relational matrix is 512 * 512, and the word relational matrix is spliced according to the line, and then the linear transformation is reduced, and the hidden layer can be The dimension characterized by the sentence is 512 to 256 dimension.
[0073]In the present embodiment, the inter-layer relationship matrix is used to construct the hidden layer sentence content, because the relationship value of the words can reflect the syntax and semantics of the sentence, for example, "he" and "he stole the car" The relationship between the relationship between "car" is large, reflecting a syntax relationship of a subject. Training the student model using hidden syntax content based on the interpolation matrix structure, enabling the student model to obtain more accurate syntax and semantic characterization capability.
[0074]By the intermediate layer data between the intermediate layer data between the intermediate layer data of the first model, the intermediate layer data is characterized, and more grammar, semantics can be constructed. The first hidden layer sentence content and the second hidden subsection content, thus providing as many input text as possible for the constructor.
[0075]S14: Select a normal learning of the learning from the first hidden layer sentence content and the second hidden layer sentence content; where the correct case includes the first hidden layer sentence content and the second hidden Layer sentence content; Negative examples include the first hidden layer sentence content in the normal example and other different training programs.
[0076]In the present embodiment, a normalized learning is selected from the first hidden layer sentence content and the second hidden layer sentence content, and at least one proportion is selected from the first hidden layer sentence content and the second hidden layer sentence.
[0077]Specifically, the correct case of compare learning is constructed by the following manner:
[0078]Suppose the training language contains a training sample (X0Y0), Where X0Is the text input of this training sample, Y0Is the classification result of the training sample;
[0079]For training samples (X0Y0), Select the corresponding training sample from the first hidden layer sentence content (X0Y0) Hidden layer sentence character Then select the corresponding training sample from the second hidden layer sentence content (X0Y0) Hidden layer sentence character
[0080]Based on hidden tape And hidden level sentence character Constructs a correct study of comparison
[0081]Further, at least one of the compared learning is constructed by the following manner:
[0082]Select K, corresponding to other training samples, from the second hidden submissions.
[0083]Based on hidden tape And hidden layer sentence character Constructs K supplement of comparison learning
[0084]Seeimage 3 ,image 3 It is a flow chart of a comparative learning of the present application to learn the proportion. Such asimage 3 As shown in the present embodiment, the method includes:
[0085]S31: The intermediate layer data of the intermediate layer of the first model is selected from the intermediate layer function of the second model.
[0086]For example, the fourth layer of the first model, the 8th and 12th layer corresponding to the intermediate layer data corresponding to the 12th layer And intermediate layer data for the second model
[0087]S32: The intermediate layer data of the intermediate layer of each layer is established, and the first model and the middle layer of the second model are obtained.
[0088]S33: Using the mapping relationship, select the same training corpus corresponding to the first hidden layer of the intermediate layer data of the intermediate layer data of the first model and the second model, and the second hidden layer sentence content, as a normal example; and With the mapping relationship, other different training materials correspond to the second hidden layer of the intermediate layer data of the same intermediate layer, and other different training characters correspond to the second hidden of the intermediate layer data of the same intermediate layer. The layer sentence content and the first hidden layer sentence content in the normal example are used as a negative example.
[0089]In the present embodiment, a normalized learning of the second hidden layer syntax content and the second hidden layer sentence content are selected from the first hidden layer sentence content corresponding to the intermediate layer data of the mapping relationship and the second hidden layer sentence content, and at least one proportion.
[0090]Specifically, the correct case of compare learning is constructed by the following manner:
[0091]Suppose the training language contains a training sample (X0Y0), Where X0Is the text input of this training sample, Y0Is the classification result of the training sample;
[0092]Use mapping relationships, select training samples (X0Y0) The first hidden layer of intermediate layer data of the intermediate layer of the intermediate layer corresponding to the function of the first model and the second model.0TAnd the second hidden synchronization content Texture
[0093]Further, at least one of the compared learning is constructed by the following manner:
[0094]Use mapping relationships, selectively trained samples (X0Y0Other remaining training values corresponding to the second hidden layer of intermediate layer data of the same intermediate layer And other different training characters correspond to the second hidden layer of intermediate layer data of the same intermediate layer. And the first hidden layer sentence content in the correct case As a negative example, it is constructed from k that
[0095]In the prior art, the construction and use of the positive and negative examples have nothing to do with the distillation process. The way adopted during the distillation is to make the normal score of the student model similar to the positive score of the teacher model, and the remedies of the student model and teachers The model score of the model is similar.
[0096]In this embodiment, the configuration and use of the positive and negative examples and the application to the distillation process, each input sample is regarded as a separate class, and the input sample source is the same as a correct case, and the input sample source is different as a negative In the case, it is possible to make the second model and the first model to characterize the same input sample, the farther of the characterization of different input samples, thereby maximizing the lower boundary of the correction and rendering of the two probability distributions. Increase the second model for grammar and semantic training processes, thereby migrating the syntax and semantic characterization of the first model to the second model to improve the migration capabilities of the second model and generalization.
[0097]ContinueFigure 4 ,Figure 4 The flow diagram of another embodiment of the learning comparison of this application configuration. Such asFigure 4 As shown in the present embodiment, the method includes:
[0098]S41: Build a first memory and a second memory.
[0099]In the present embodiment, the size of the first memory and the second memory library is represented as {n * l * d}, where n is the number of trained materials in the sample, L is the number of model intermediate layers, and D is hidden The dimension of the layer sentence content.
[0100]Specifically, since the fourth layer of the teacher model is selected, the first layer, the second layer, and the third layer of the student model are selected as the intermediate layer as the student model, so L is 3; Dove the interpolator relationship matrix, thus the first hidden layer sentence content and the second hidden layer sentence content are 256 dimension.
[0101]S42: Store the first hidden layer sentence content into the first memory and store the second hidden subsection content in the second memory.
[0102]In the present embodiment, two memory banks are constructed to separately store first hidden layer sentence contents and second hidden layer sentence contents constructed on the first model and the second model in the training corpus.
[0103]Specifically, each set of input data can construct a plurality of sentence representations by hidden synthesis, and construct a large amount of a memory, the content of these hidden layer sentences, can avoid re-build, to facilitate subsequent calculation of normal examples and negative The contrast loss function of the example.
[0104]S43: Select the first hidden layer sentence content in the first memory in the first memory, and query the second hidden layer of the same training language corresponding to the first hidden layer sentence content; and Select the first hidden layer sentence content in the negative example in the first memory, and query the second hidden subsequent sentence content of different training values corresponding to the first hidden layer sentence content from the second memory.
[0105]Further, since the first model is fixed unchanged during distillation, the first memory is maintained unchanged after the first initialization, and the second memory is updated while distillation.
[0106]Difference to the prior art, the present embodiment constructs the first memory and the second memory, and stores the first hidden layer sentence content and the second hidden layer sentence content, respectively, using the first memory and the second memory. The corresponding hidden syndrome is selected from the first memory with the second memory of the second memory when the positive and negative examples can avoid repeated constructing hidden synchronization sentences, thereby increasing the efficiency of comparison training.
[0107]S15: Using the sample data set, the intermediate layer data of the first model, and the output result, the intermediate layer data of the second model, and the output result, the correct, and the negative, the second model is trained, and the second model after the training is completed. Determine as a language model.
[0108]SeeFigure 5 ,Figure 5 Yesfigure 1 A flow diagram of a specific embodiment of a specific embodiment. Such asFigure 5 As shown in the present embodiment, the second model is trained by the intermediate layer data of the sample data set, the intermediate layer data of the first model, and the intermediate layer data of the second model, and the output result, the correct, and the proportion, and will be trained. The second model after training is determined as the step of language model, including:
[0109]S51: Calculate the cross entropy loss function of the output of the second model to the training corpus relative to the label; and calculate the mean square difference loss function of the intermediate layer data of the first model and the intermediate layer data of the second model; and calculate A comparative loss function with a correct example is obtained; and the output result of the first model and the output result of the second model is calculated to obtain the relative entropy loss function of the output result of the second model.
[0110]In the present embodiment, based on the output of the training corpus, the probability value of the label tag corresponding to the training corpus, and the second model relative to the first model, the second model is calculated relative to the output of the second model. The cross entropy (CROSS ENTROPY, CE) loss function is labeled.
[0111]Specifically, the calculation formula of the second model to the output result of the training corpus relative to the cross entropy loss function of the label tag is:
[0112]LHard (zS, y; θ) = CE (zS, y; θ) (1)
[0113]Among them, ZSFor the second model, the output result of the training corpus, Y is the probability value of the label tag corresponding to the training language, and θ is the compression angle of the second model relative to the first model.
[0114]In the present embodiment, the intermediate layer data of the intermediate layer data based on the first model is the intermediate layer data as the intermediate layer function, and the second model is calculated from the compression angle of the first model and the linear mapping layer. The intermediate layer data of the model is lost in the mean square of the middle layer data of the second model.
[0115]Specifically, the calculation formula of the mean square difference loss function of the intermediate layer data of the first model and the intermediate layer data of the second model is:
[0116]
[0117]among them, Intermediate layer data for the first model of the i-th layer, The intermediate layer data of the second model of the i-th layer, the MSE is a meanlaxity function, The linear mapping layer of the i-th layer, θ is the compression angle of the second model relative to the first model.
[0118]Further, the calculation formula of the mean square fault loss function of the entire distillation process is:
[0119]
[0120]Where HTIntermediate layer data for the first model, HSFor the middle layer data of the second model, MSE is a meanlaxity function. The linear mapping layer of the first layer, θ is the second model relative to the compression angle of the first model, and the number of intermediate layers of the second model.
[0121]In other embodiments, the intermediate layer data of the second model of the first layer can also be linearly converted to enable the number of intermediate layer units of the second model to the same number of intermediate layer units of the first model.
[0122]In the present embodiment, the vector of the normal example is calculated from the vector of each of the negatives, and the similaritic characterization of the normal example and each of the negatives is obtained; the use of the normal example and the similarity characterization characterization of each of the reactions NOISECONTRASTIVE Estimation, NCE loss function.
[0123]Specifically, based on training samples (X0Y0), Get a form With k The calculation formula of the comparative loss function of the normal and negative example is:
[0124]
[0125]Among them, θiThe compression angle of the first model of the second model relative to the first model, SiTiThe second model and the first model of the first model, respectively, Operation indicates that the two vectors are dotted, and the log represents the logarithmic function, k is constant, and τ is hyperfeit.
[0126]Where k is usually taken 4096.
[0127]Further, the calculation formula of the contrast loss function of the entire distillation process is:
[0128]
[0129]Where θ is the compression angle of the second model relative to the first model, θiFor the second model of the second model relative to the compression angle of the first model, L is the number of intermediate layers of the second model.
[0130]In this embodiment, the contrast loss function is used to measure the similarity of the normal and the negative example.
[0131]In the present embodiment, the output result of the first model is calculated from the output result of the first model and the output of the second model and the compression angle of the second model relative to the first model, and the relative entropy of the output result of the second model ( Relative Entropy, RE) Loss function.
[0132]Specifically, the calculation formula of the output result of the first model and the relative entropy loss function of the output result of the second model is:
[0133]LKD (zT,zS; Θ) = CE (zS,zTΘ) (6)
[0134]Among them, ZTThe output result of the first model, ZSFor the output of the second model, θ is the second model relative to the compression angle of the first model.
[0135]Among them, the output result of the first model and the relative entropy loss function of the output result of the second model can be used to measure ZTAnd ZSTwo degrees of KL dispersion.
[0136]S52: Training the second model by crossing the cross entropy loss function, a mean square difference loss function, the contrast loss function, and the relative entropy loss function, and the second model after the training is determined as the language model.
[0137]In the present embodiment, a loss value of the cross-entropy loss function, a mean square difference loss function, a mean square difference loss function is calculated by the calculation formula, respectively, and a loss value of the relative entropy loss function, respectively.
[0138]Further, the loss value will be obtained and the total distillation loss value of the second model is obtained.
[0139]Specifically, the calculation formula of the total distillation loss value of the second model is as follows:
[0140]Lall= Α1LNCE (θ) + α2LHard (zS, Y; θ) + α3LKD (zT,zS; Θ) + α4LMSE(hT, HS; Θ) (7)
[0141]Among them, LNCE (θ) is a comparative loss function of the entire distillation process, LHard (zS, Y; θ) is the intersection function of the second model to the output of the training corpus relative to the intersection of the label tag, LKD (zT,zS; Θ) is the output result of the first model and the relative entropy loss function of the output result of the second model, LMSE(hT, HS; Θ) is a mean square difference loss function of the entire distillation process, α1Α2Α3Α4The weight values weight corresponding to the above four types of functions are respectively.
[0142]In the present embodiment, the model parameters of the second model are reverse training using the total distillation loss value to obtain a language model.
[0143]Specifically, the reverse training of the model parameters of the second model is specified using the total distillation loss value to use the ADAM optimizer to calculate the gradient value of all model parameters, and reverse update the parameter value of the second model to achieve an optimized model. purpose.
[0144]The parameter values of the reverse update the second model include updating the comparison learning positive and negative proposal in the second memory to calculate the second hidden layer sentence content corresponding to the new second model, and corresponding to the new second model The second hidden layer sentence content is stored in the second memory.
[0145]In the present embodiment, each reverse update is relatively small, which is to ensure that the effect smoothness before and after the second model parameter is updated.
[0146]Further, iterative inputs new training media to the first model and the second model, always fix the first model parameter constant, continuously repeating the distillation process in steps S12 to S15, until the effect of the distillation converges, and is optimal Two models, and determine the optimal second model as a language model.
[0147]The language model acquired by the present embodiment is a compressed 3-layer student model. The number of student models is approximately 1/3 of the teacher model, and the reasoning speed of the student model is 3 times the teacher model, and the student model is on the test set. The effect is equivalent to the teacher model.
[0148]The present embodiment will learn the positive and negative remedies by constructing, and use the comparative learning positive and negative examples to train the student model, and the student model and the teacher model are more close to the same input text, and the characterization of different input text is further. Thus, the grammar and semantic characterization ability of the teacher model are migrated into the student model, so that the student model has better migration capabilities to meet the application needs of cross-domain.
[0149]Difference to prior art, the present embodiment will learn the positive and negative remedies by constructing compare, and use comparison learning positive and negatives to train students model, enabling students models and teachers models to the same input text, and for different inputs The text is characterized further, thereby migrating the grammar and semantic characterization of the teacher model to the student model, making the student model better migration capabilities to meet the application needs of cross-domain.
[0150]In order to further illustrate the process of the above training method, please refer toFigure 6 ,Figure 6 A framework schematic of a language model training method based on knowledge distillation. Such asFigure 6 As shown, select the 4th sheet, frame 8, and the 12th layer of the Teacher model, respectively correspond to the first layer, the second layer, and the third layer of the Student model, respectively.
[0151]In this embodiment, based on training samples (X0Y0), The intermediate layer data of the TEACHER model is the output vector of the 4th floor, the 8th floor, and the 12th floor of the Transformer structure. The intermediate layer data of the student model Student model is the output vector of the first layer, the second layer, and the 3rd floor transformer structure.
[0152]Among them, by calculating the output vector corresponding to the linear mapping layer, the intermediate layer data of the TEACHER model and the homogeneity loss function (MSE LOSS) of the intermediate layer data of the Student model is obtained.
[0153]Among them, versus Performing the internal calculation to obtain the word relationship matrix between intermediate layer data of the first model The word relationship matrix between intermediate layer data with the second model Pass on versus Delicating, get the corresponding training sample (X0Y0The first hidden layer sentence content is content with the second hidden subsection.
[0154]Further, based on corresponding training samples (X0Y0The first hidden layer sentence content is compared to the second hidden layer sentence content constructs, where is the case Rendezvous
[0155]Among them, the vector of the normal example is calculated from the vector of each of the negatives, and the similarity characterization of the normal and each of the negatives is obtained; the comparative loss function is calculated using the similarity characterization characterization of each of the negatives. Nceloss.
[0156]In this embodiment, based on training samples (X0Y0), The output result of the TEACHER model and the output of the Student model are the result of the obtained results after the TEACHER model and the full connectivity layer (FC) of the Student model, respectively.TWith zS.
[0157]Among them, based on ZTWith zSAnd the STUDENT model calculates the output result of the TEACHER model with the compression angle of the Teacher model and the relative entropy loss function (RE LOSS) of the output result of the Student model.
[0158]In this embodiment, based on ZS, The probability value of the label label corresponding to the training corpus is calculated to obtain the output of the Student model to the training corpus loss function (CE Loss) with respect to the compression angle of the TEACHER model.
[0159]Difference to prior art, the present embodiment will learn the positive and negative remedies by constructing compare, and use comparison learning positive and negatives to train students model, enabling students models and teachers models to the same input text, and for different inputs The text is characterized further, thereby migrating the grammar and semantic characterization of the teacher model to the student model, making the student model better migration capabilities to meet the application needs of cross-domain.
[0160]Correspondingly, the present application provides a text classification method based on a language model.
[0161]SeeFigure 7 ,Figure 7 It is a flow chart of the flow of the text classification method based on the language model based on the language model. Such asFigure 7 As shown in the present embodiment, the language model is a second model after the training method training based on the above-described embodiment, and the text classification method includes:
[0162]S71: Receive the text of the classification.
[0163]S72: Input to be classified into the language model, processing the classified text through the language model to get a classified text.
[0164]In a specific implementation scenario, for example, the judicial instrument is related to the judicial instrument, first obtaining a well-trained language model, and then receives the relevant judicial instruments, and organizes the judicial instrument into the text data that meets the input protocol. Enter text data into the above language model to obtain the result of the case.
[0165]Since the language model used in this implementation scene is a student model of distillation training, the average response time of the case-related work is 1/4 of the original teacher model time, so that the average response time has reached the level acceptable. Then, the acceleration of reasoning work is achieved.
[0166]Further, since compared learning positive remedies are introduced in the process of distillation training language model, the prosecution of the original model can be enriched to the well-trained language model, so that the acquired language model has good Migration capabilities can be applied to more implementation scenarios.
[0167]Difference to the prior art, the present embodiment will train the second model by comparing the second model by comparing the learning comparison of the learning comparison in the distillation process, to migrate the first model rich sentence syntax and semantic metrics to the second In the model, the second model of distillation can have better migration capabilities, and the training good second model is applied as a language model to the classification tasks in different fields, not only can achieve reasoning acceleration, but also meets the teacher model. Accuracy to meet the application needs of cross-domain.
[0168]Correspondingly, the present application provides a language model training apparatus based on knowledge distillation.
[0169]SeeFigure 8 ,Figure 8 It is a schematic structural diagram of an embodiment of a language model training apparatus based on knowledge distillation based on knowledge distillation. Such asFigure 8 As shown, the language model training device 80 includes a processor 81 and a memory 82 that are coupled to each other.
[0170]In the present embodiment, the memory 82 is used to store program data, and the program data can be implemented when the program data is executed, and the step of the language model training method according to any of the above is implemented; the processor 81 is used to perform the program instructions stored by the memory 82 to implement The steps in the language model training method in any of the above method embodiments or in any of the above method training devices correspond to the procedure.
[0171]Specifically, the processor 81 is configured to control itself and the memory 82 to achieve the steps in the language model training method in any of the above embodiments. Processor 81 can also be referred to as CPU (Central Processing Unit, central processing unit). Processor 81 may be an integrated circuit chip with signal processing capability. Processor 81 can also be a general purpose processor, a digital signal processor (DSP), a dedicated integrated circuit (ApplicationsPecific Integrated Circuit, ASIC), Field-Programmable GateArray, FPGA or other programmable Logic devices, discrete doors or transistor logic devices, discrete hardware components. The general purpose processor can be a microprocessor or the processor can also be any conventional processor or the like. Additionally, processor 81 can be achieved by a plurality of integrated circuit chips.
[0172]Difference to prior art, the present embodiment will learn the positive and negative remedies by constructing compare, and use comparison learning positive and negatives to train students model, enabling students models and teachers models to the same input text, and for different inputs The text is characterized further, thereby migrating the grammar and semantic characterization of the teacher model to the student model, making the student model better migration capabilities to meet the application needs of cross-domain.
[0173]Correspondingly, the present application provides a text classification device based on a language model.
[0174]SeeFigure 9 ,Figure 9 It is a schematic structural diagram of an embodiment of a text classification device based on a language model based on a language model. Such asFigure 9 As shown, the text classification device 90 includes a processor 91 and a memory 92 that are coupled to each other.
[0175]In the present embodiment, the memory 92 is used to store program data. When the program data is executed, the step of the above-described text classification method is implemented; the processor 91 is used to perform the program instruction stored by the memory 92 to achieve the above method embodiment. The steps in the text classification method or any of the above method embodiments correspond to the steps of the text classification device.
[0176]Specifically, the processor 91 is configured to control itself and the memory 92 to achieve the steps in the text classification method in any of the above embodiments. Processor 91 can also be referred to as CPU (Central Processing Unit, central processing unit). Processor 91 may be an integrated circuit chip with signal processing capability. The processor 91 can also be a general purpose processor, a digital signal processor (DSP), a dedicated integrated circuit (ApplicationPecific Integrated Circuit, ASIC), Field-Programmable GateArray, FPGA or other programmable Logic devices, discrete doors or transistor logic devices, discrete hardware components. The general purpose processor can be a microprocessor or the processor can also be any conventional processor or the like. Additionally, processor 91 can be achieved by a plurality of integrated circuit chips.
[0177]Difference to the prior art, the present embodiment will train the second model by comparing the second model by comparing the learning comparison of the learning comparison in the distillation process, to migrate the first model rich sentence syntax and semantic metrics to the second In the model, the second model of distillation can have better migration capabilities, and the training good second model is applied as a language model to the classification tasks in different fields, not only can achieve reasoning acceleration, but also meets the teacher model. Accuracy to meet the application needs of cross-domain.
[0178]Correspondingly, the present application provides a computer readable storage medium.
[0179]SeeFigure 10 ,Figure 10 It is a schematic structural diagram of an embodiment of a computer readable storage medium of the present invention.
[0180]The computer readable storage medium 100 includes a computer program 1001 stored on a computer readable storage medium 100, the computer program 1001, is performed by the processor, performs the step or the above text in any of the above method practices. The steps in the classification method and the language model training apparatus or the text classification device corresponding to the above method embodiment.
[0181]Specifically, the integrated unit can be stored in one computer readable storage medium 100 if implemented in the form of a software functional unit and is used as a separate product sales or in use. Based on this understanding, the technical solution of the present application essentially ors the prior art or the entire or part of the technical solution can be embodied in the form of software products, the computer software product is stored in a computer readable In the storage medium 100, a number of instructions are included to enable a computer device (which can be a personal computer, server, or network device, etc.) or processor (Processor) to perform all or some of the steps of the various embodiments of the present application. The aforementioned computer readable storage medium 100 includes: a U disk, mobile hard disk, read-only memory (ROM, RAD-ONLY MEMORY), random access memory (RAM, RANDOM Access Memory), disk or disc or optical discs can be stored The media of the program code.
[0182]In several embodiments provided herein, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the device embodiment described above is merely a schematic, for example, the division of the module or unit is only one logic function division, and there may be additional division methods, such as a plurality of units or components, may be combined or in the actual implementation. Can be integrated into another system, or some features can be ignored, or not executed. Another point, the coupling or direct coupling or communication connections displayed or discussed may be an electrical, mechanical or other form.
[0183]As the unit illustrated as the separation component may be or may not be physically separated, the components displayed as the unit may be or may not be a physical unit, i.e., in one place, or can also be distributed to a plurality of network elements. The object of the present embodiment can be implemented according to the actual needs of the actual needs.
[0184]Further, each of the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit can be generated separately, or two or more units can be integrated into one unit. The above-described integrated units can be implemented in the form of hardware, or may be implemented in the form of a software functional unit.
[0185]Integrated Units If implemented in the form of a software functional unit and as a stand-alone product sales or use, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially ors a portion of the prior art or all or part of the technical solution can be embodied in the form of software products, the computer software product is stored in a storage medium. A number of instructions are used to enable a computer device (which can be a personal computer, server, or network device, etc.) or processor (Processor) to execute all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, RAD-ONLY MEMORY), random access memory (RAM, RANDOM Access Memory), disk or disc or optical discs can store program code .
[0186]The above embodiments are described above, and is not limited to the patent scope of the present application, and an equivalent structure or equivalent process transform is used by the present application and the drawings, or directly or indirectly used in other related technologies. The field, all interpretabilize, including the patent protection scope of this application.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Low-frequency magnetic dielectric composite wave-absorbing patch and preparation method thereof
ActiveCN112409936AMeet application needsExcellent reflection lossFilm/foil adhesivesFrequency bandBroadband
Owner:HUNAN INSTITUTE OF ENGINEERING
Method for producing 4-N,N-dimethylamino methylaniline
InactiveCN101735071AMeet application needsThe process steps are simpleOrganic compound preparationAmino compound preparationRaney nickelNitrobenzene
Owner:大连凯飞精细化工有限公司
Single-size strong current cluster pulse beam generating method
ActiveCN109862684AStrong instantaneous beam intensityMeet application needsVacuum evaporation coatingSputtering coatingPulse frequencyNucleation growth
Owner:NANJING UNIV
Reconfigurable four-section Gamma-shaped winding multi-fingered electromagnetic band gap structure
ActiveCN105977638AGood adjustabilityMeet application needsAntenna couplingsFrequency bandElectromagnetic band gap
Owner:BEIHANG UNIV
Classification and recommendation of technical efficacy words
- Meet application needs
Reconstituted tobacco and heated non-combustible cigarette prepared by using same
ActiveCN105768191AMeet application needsGuaranteed releaseTobacco preparationTobacco devicesPulp (paper)Engineering
Owner:CHINA TOBACCO HENAN IND +1
Preparation method of polymer microspheres with granularity gradient characteristics as well as prepared polymer microspheres and application of polymer microspheres
ActiveCN104761691ASuitable for applications in the field of opticsMeet application needsMicroballoon preparationMicrocapsule preparationDouble bondStabilizing Agents
Owner:SUZHOU UNIV
File storage system and storage method
InactiveCN104050248AMeet application needsEnsure consistencyFile access structuresSpecial data processing applicationsClient-sideMetadata
Owner:BEIJING JETSEN TECH
Terminal access method and related device
ActiveCN105530682AMeet application needsAssess restrictionNetwork topologiesQuality of serviceWireless access point
Owner:HUAWEI TECH CO LTD