A language model inference acceleration method, system, device and storage medium
By performing multi-dimensional compression and dynamic exit on pre-trained language models, combined with structured pruning, low-rank decomposition, and knowledge distillation techniques, the problem of low inference efficiency under limited hardware resources is solved, achieving efficient model deployment and performance improvement under limited hardware resources.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2024-04-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing pre-trained language models have low inference efficiency in scenarios with limited hardware resources. Existing compression methods mainly focus on a single dimension, ignoring the advantages of multi-dimensional compression. Furthermore, methods that allow network layers to exit early require all internal classifiers to correctly predict samples.
We employ structured pruning techniques to prune feedforward neural network neurons and attention heads, combine low-rank decomposition and knowledge distillation techniques to compress the model, and design a reward function through a reinforcement learning framework to allow the model to dynamically exit, thereby achieving multi-dimensional compression and acceleration.
While ensuring model performance, we can significantly improve inference efficiency, reduce redundant parameters, and achieve efficient model deployment under limited hardware resources. We can also reduce performance loss after pruning through iterative training methods and dynamically adjust the inference process to achieve a balance between speed and accuracy.
Smart Images

Figure CN118468964B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing, and more particularly to a method, system, apparatus, and storage medium for accelerating reasoning in a language model. Background Technology
[0002] Pre-trained language models have become a hot research topic in recent years, achieving significant breakthroughs in many natural language processing tasks. Their remarkable success is mainly attributed to their large number of model parameters. While performance improves with increasing parameter size, inference efficiency declines significantly. Typical pre-trained models like BERT have hundreds of millions of parameters; however, this large number of parameters incurs expensive computational costs, making them difficult to deploy in scenarios with limited hardware resources. To address this issue, the field of natural language processing currently employs two solutions: 1) static methods, which compress a certain dimension of the model to reduce its parameters; and 2) dynamic methods, which flexibly select inference paths for each sample.
[0003] To address the aforementioned issues, the first existing technical solution proposes a novel knowledge distillation method for BERT models. This method effectively transfers a large amount of knowledge from a large "teacher" model to a small "student" model. A two-stage learning framework for BERT model training is also introduced, which performs knowledge distillation operations in both the pre-training and task-specific learning stages, reducing model size and accelerating inference. However, the BERT model is redundant in multiple dimensions, and current compression research mainly focuses on a single dimension (network layer), neglecting the fact that combining multiple compression techniques can effectively leverage their respective advantages.
[0004] The second existing technical solution proposes an early exit method for network layers. This method adds an internal classifier to each intermediate layer of the model and trains all intermediate layer classifiers using the weighted sum of the cross-entropy losses of all internal classifiers as the training loss. This allows inference results to be obtained for samples in early layers, without needing to perform inference on the complete model, thus accelerating inference time. However, the current early exit method uses the sum of the cross-entropy losses of all internal classifiers as the training loss, which requires all internal classifiers to correctly predict all samples. In reality, during the inference phase, as long as at least one internal classifier can correctly predict a sample, the inference process can be accelerated without affecting accuracy. Summary of the Invention
[0005] In order to at least partially solve one of the technical problems existing in the prior art, the present invention aims to provide a method, system, device and storage medium for accelerating inference of a language model based on weight pruning and early exit of network layers.
[0006] The first technical solution adopted in this invention is:
[0007] A method for accelerating inference in a language model includes the following steps:
[0008] Build and train a language model;
[0009] Structured pruning techniques are used to prune the feedforward neural network neurons and attention heads in the language model;
[0010] The pruned language model is subjected to low-rank decomposition of the word embedding matrix to obtain the compressed model;
[0011] Based on the iterative training technique of knowledge distillation, the original language model before pruning is used as the teacher model to guide the training of the compressed model and obtain the final language model.
[0012] Based on a reinforcement learning framework, a reward function is designed to train an internal classifier, allowing the final language model to exit dynamically.
[0013] Furthermore, the design of the reward function, training of the internal classifier, and dynamic exit of the final language model include:
[0014] Insert an internal classifier into each intermediate layer in the language model to allow samples to exit at an earlier classifier, rather than the final classifier;
[0015] The design incorporates a reward function that accelerates performance, and determines the exit level based on the policy network to achieve a balance between accuracy and speed.
[0016] Furthermore, the design considers a reward function that accelerates performance, and determines the exit layer based on the policy network, including:
[0017] The likelihood of the model predicting a label is used as part of the reward function; the greater the likelihood of the predicted label, the greater the reward.
[0018] For performance acceleration, the number of layers a sample exits is considered; if a sample exits at an earlier network layer, it can receive a larger reward, as shown in the following reward function:
[0019]
[0020] In the formula, y is the real label, and P is the actual label. t (x) is the predicted probability of the classifier within layer t, H(·) is the cross-entropy function used to measure the difference between the final output and the true label, α is a hyperparameter, and action a t It is the action taken at level t, and the action space is represented by numbers 0 and 1 for exiting or continuing.
[0021] Furthermore, the objective function of the policy network is expressed as:
[0022]
[0023] In the formula, τ represents the motion trajectory of the sample, and T is the total number of trajectories; π(a) represents the expected value of the motion trajectory. t |s t ;θ) represents the given current state s t Take action a t The probability of the action trajectory, R(τ), represents the reward value of the action trajectory;
[0024] Add a classifier to each layer. The weight parameters of all classifiers are denoted as w, and their objective function is expressed as:
[0025]
[0026] In the formula, (x, y) represents the input and label in the training set D; H() represents the cross-entropy loss function; 1(·) is an indicator function that returns 1 if the condition in parentheses is met, and 0 otherwise.
[0027] Furthermore, the use of structured pruning techniques to prune the feedforward neural network neurons and attention heads in the language model includes:
[0028] Importance scores for individual weights are calculated using a metric based on a first-order Taylor expansion.
[0029]
[0030] In the formula, W ij Let be the weight of the i-th row and j-th column of the weight matrix W; and let x be a sample in the training dataset D. Indicates the expected value; It is a loss function;
[0031] Less important feedforward neural network neurons and attention heads are eliminated based on the calculated importance scores.
[0032] Furthermore, the step of performing low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model includes:
[0033] The model is compressed by decomposing the embedding matrix into two smaller matrices using Singular Value Decomposition (SVD), as shown in the following formula:
[0034]
[0035] in, and For decomposition matrix, Where d represents the vocabulary size, and d represents the hidden layer dimension of the model; ∑=diag(σ1,…,σ r ) represents the singular value σ i The diagonal matrix formed by these elements, where r is the rank of the matrix, satisfies the condition... U i and V i These are the i-th column of U and the i-th row of V, respectively.
[0036] Furthermore, the step of using the original language model before pruning as a teacher model to guide the training of the compressed model to obtain the final language model includes:
[0037] The compressed model is used as the student model, and the prediction results of the teacher model are used as soft labels to train the student model. The loss of the prediction results is represented by the soft cross-entropy loss function:
[0038]
[0039] In the formula, z T and z S , respectively, are the prediction logarithms of teachers and students, and H(·) represents the cross-entropy loss function;
[0040] Simultaneously, the intermediate hidden states of the teacher model are used as additional supervision signals to guide the training of the student model. The loss of the intermediate hidden states is represented by minimizing the mean squared error.
[0041]
[0042] In the formula, and , where are the hidden states of the teacher and student in the l-th layer, respectively; MSE(·) represents minimizing the mean squared error; L is the total number of layers in the model;
[0043] Total training loss is and The purpose of this is to enable the student model to better learn the knowledge of the teacher model, thereby compensating for the performance loss caused by pruning to some extent.
[0044] The second technical solution adopted in this invention is:
[0045] A reasoning acceleration system for a language model, comprising:
[0046] The model training module is used to build and train language models;
[0047] The model pruning module is used to prune the feedforward neural network neurons and attention heads in the language model using structured pruning techniques;
[0048] The model compression module is used to perform low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model.
[0049] The distillation training module is used for iterative training techniques based on knowledge distillation. It uses the original language model before pruning as the teacher model to guide the training of the compressed model and obtain the final language model.
[0050] The reinforcement learning module is used to design reward functions based on the reinforcement learning framework, train internal classifiers, and allow the final language model to exit dynamically.
[0051] The third technical solution adopted in this invention is:
[0052] A reasoning acceleration device for a language model, comprising:
[0053] At least one processor;
[0054] At least one memory for storing at least one program;
[0055] When the at least one program is executed by the at least one processor, the at least one processor implements the method described above.
[0056] The fourth technical solution adopted in this invention is:
[0057] A computer-readable storage medium storing a processor-executable program, which, when executed by a processor, performs the method described above.
[0058] The beneficial effects of this invention are: it compresses the language model from different dimensions, reducing a large number of redundant parameters and solving the problem of language model application under limited hardware resources. Furthermore, by employing an iterative training method and utilizing knowledge distillation techniques, this invention effectively reduces performance loss caused by structural differences after pruning. Attached Figure Description
[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following description is provided with accompanying drawings of the relevant technical solutions in the embodiments of the present invention or the prior art. It should be understood that the accompanying drawings described below are only for the purpose of clearly illustrating some embodiments of the technical solutions of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0060] Figure 1 This is a flowchart illustrating the steps of a language model inference acceleration method in an embodiment of the present invention;
[0061] Figure 2This is a schematic diagram of a method for integrating multiple acceleration technologies in an embodiment of the present invention;
[0062] Figure 3 This is a schematic diagram of the iterative training method based on knowledge distillation in an embodiment of the present invention;
[0063] Figure 4 This is a structural diagram of the BERT language model in an embodiment of the present invention. Detailed Implementation
[0064] The embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention. The step numbers in the following embodiments are set only for ease of explanation, and there is no limitation on the order between the steps. The execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
[0065] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention.
[0066] In the description of this invention, "several" means one or more, "multiple" means two or more, "greater than," "less than," and "exceeding" are understood to exclude the stated number, while "above," "below," and "within" are understood to include the stated number. If "first" or "second" is used, it is only for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly indicating the number of indicated technical features, or implicitly indicating the order of the indicated technical features. Furthermore, "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0067] In the description of this invention, unless otherwise explicitly defined, terms such as "set up," "install," and "connect" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.
[0068] To address the existing technical problems, this invention aims to solve the problem of accelerating language model inference under limited hardware conditions. Existing technologies employ single-dimensional compression methods, neglecting the advantages of multi-dimensional compression. This invention compresses the language model from different dimensions (including attention head and neuron weight pruning, and low-rank decomposition of the word embedding matrix), reducing a large number of redundant parameters and solving the challenge of language model application under limited hardware resources. This invention uses an iterative training method and knowledge distillation techniques to reduce performance loss caused by structural differences after pruning. Furthermore, at the network layer level, this invention employs an early exit method based on reinforcement learning, designing reward functions to train multiple policy networks to maintain consistency between training and inference processes. This allows for flexible adjustment of the inference process when handling different samples, achieving an optimal balance between speed and accuracy.
[0069] like Figure 1 As shown, this embodiment provides a method for accelerating inference in a language model, including the following steps:
[0070] S1. Construct and train the language model;
[0071] S2. Using structured pruning techniques, the feedforward neural network neurons and attention heads in the language model are pruned.
[0072] S3. Perform low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model;
[0073] S4. Iterative training technique based on knowledge distillation: using the original language model before pruning as the teacher model to guide the training of the compressed model and obtain the final language model.
[0074] S5. Based on the reinforcement learning framework, design a reward function, train an internal classifier, and allow the final language model to exit dynamically.
[0075] Inspired by model compression based on weight pruning and dynamic inference techniques based on early exit of network layers, this invention proposes a language model acceleration method that integrates weight pruning and early exit of network layers. This method includes weight pruning of attention heads and feedforward layer neurons, low-rank decomposition of word embedding matrices, iterative training methods based on knowledge distillation, and early exit strategies of network layers based on reinforcement learning. These techniques reduce redundant computations in different dimensions.
[0076] This embodiment uses the BERT language model as an example. By comprehensively utilizing the above techniques, inference efficiency can be significantly improved while ensuring model performance. It should be noted that this embodiment uses the BERT language model as an example for explanation for convenience, but the language model is not limited to the BERT language model. Other language models are also applicable to the method of this application, that is, other language models also fall within the protection scope of this application.
[0077] (1) BERT model
[0078] BERT is a pre-trained language model derived from the encoder part of a Transformer. Its overall framework consists of stacked layers of Transformer encoders. Each encoder layer comprises a multi-head attention layer and a feedforward neural network layer. The model has a total of 12 layers, with 12 attention heads per layer, as shown below. Figure 4 As shown.
[0079] (2) Weighted pruning
[0080] This invention employs structured pruning techniques to compress the BERT model, removing unimportant feedforward neural network neurons and some attention heads, such as... Figure 2 As shown in ② and ③, this invention uses a metric based on first-order Taylor expansion to calculate the importance score of individual weights.
[0081]
[0082] Among them W ij is the weight of the i-th row and j-th column of the weight matrix W. x is a sample in the training dataset D. This represents the expected value. This metric approximates the loss function when a specific weight is removed. The importance of neurons varies. For feedforward neural network neurons, the weight parameters connected to the input and output of intermediate neurons are used to calculate the importance of the neurons. For attention head modules, the importance score of each attention head is the sum of the importance scores of neurons in the output weight matrix of the attention module. Finally, based on the expected compression ratio, less important neurons or attention heads are removed.
[0083] (3) Low-rank decomposition of word embedding matrix
[0084] Low-rank decomposition is an effective dimensionality reduction technique that decomposes a large matrix into the product of several smaller matrices, thereby reducing the number of parameters while preserving as much information as possible. By performing low-rank decomposition on the embedding matrix, it can be decomposed into the product of two smaller matrices, such as... Figure 2 As shown in ④, this reduces the number of model parameters. This embodiment of the invention uses Singular Value Decomposition (SVD) to decompose the embedding matrix into two smaller matrices to compress the model, as shown in the following formula:
[0085]
[0086] in and Let |v| be the decomposition matrix, |v| be the vocabulary size, and d represent the hidden layer dimension of the model. ∑=diag(σ1,…,σr ) represents the singular value σ i The diagonal matrix formed by these matrix elements, where r is the rank of the matrix, satisfies the condition r << min(|v|, d). i and V i These are the i-th column of U and the i-th row of V, respectively.
[0087] (4) Iterative training method based on knowledge distillation
[0088] Weight pruning and low-rank decomposition techniques can reduce the number of model parameters and improve inference efficiency, but they face a potential challenge: performance degradation after model compression. To address this, this invention employs an iterative training technique based on knowledge distillation. The original BERT model before pruning is used as the teacher model to guide the training of the pruned model (i.e., the student model). To further optimize the performance of the student model, this invention uses an iterative training method. By setting a pruning frequency, the performance of the student model is gradually improved by reducing a portion of the parameters each time, combined with fine-tuning using knowledge distillation, until the parameters are reduced to the expected set value. Figure 3 As shown. Specifically, in this embodiment of the invention, the prediction results of the teacher model are used as soft labels to train the student model, and the loss of the prediction results is represented by the soft cross-entropy loss function:
[0089]
[0090] Where z T and z S Let H(·) be the predicted logarithm for the teacher and the student, respectively, and H(·) represent the cross-entropy loss function. Simultaneously, the intermediate hidden states of the teacher model are used as additional supervision signals to guide the training of the student model. The loss of the intermediate hidden states is represented by minimizing the mean squared error.
[0091]
[0092] in, and Let be the hidden states of the teacher and student at layer l, respectively, and MSE(·) represent minimizing the mean squared error. The total training loss is the sum of the two mentioned above. This is done to allow the student model to better learn the knowledge of the teacher model, thus compensating to some extent for the performance loss caused by pruning.
[0093] (5) Early Exit Strategy for Network Layers Based on Reinforcement Learning
[0094] The early exit method for network layers aims to address the problem of computational redundancy in network layers. This method inserts an internal classifier into each intermediate layer of the model, allowing samples to exit at an early classifier rather than the final classifier. This embodiment of the invention employs a reinforcement learning framework, designs a reward function that considers acceleration performance, and determines the number of layers to exit based on the policy network to achieve a balance between accuracy and speed. Figure 2 As shown in ①.
[0095] In this embodiment of the invention, the probability of the model predicting the label is used as part of the reward function. The higher the probability of the predicted label, the greater the reward. For performance acceleration, the number of layers a sample exits is considered. If a sample exits at an earlier network layer, it can obtain a larger reward. The specific reward function is as follows:
[0096]
[0097] Where y is the real label, P t (x) is the predicted probability of the classifier within layer t, H(·) is the cross-entropy function used to measure the difference between the final output and the true label, α is a hyperparameter, and action a t It is the action taken at level t, and the action space is represented by numbers 0 and 1 for exiting or continuing.
[0098] In this embodiment of the invention, a policy network π(a) is added to each layer of the BERT model. t |s t ;θ), based on the hidden state s of the sample at layer t. t Generate an action a t The probability distribution is given, and the weight parameters of the policy network are denoted as θ. This invention aims to optimize the policy network to obtain the maximum reward; the objective function of the policy network is expressed as:
[0099]
[0100] Where τ represents the motion trajectory of the sample, and T is the total number of trajectories. Meanwhile, the early exit method adds a classifier to each layer, with the weight parameters of all classifiers denoted as w, and its objective function expressed as:
[0101]
[0102] Where (x, y) represents the input and label in the training set D, and 1(·) is an indicator function that returns 1 if the condition in parentheses is met, otherwise it returns 0.
[0103] (6) GLUE dataset
[0104] To verify the effectiveness of this invention, experiments were conducted using the GLUE dataset. This dataset includes multiple natural language processing tasks, such as natural language inference, textual entailment, sentiment analysis, and semantic similarity, all in English.
[0105] MNLI (The Multi-Genre Natural Language Inference Corpus) is a collection of sentence pairs that are crowdsourced for textual entailment annotation in natural language inference tasks. Given a premise statement and a hypothesis statement, the task is to predict whether the premise statement contains an entailment, contradicts the hypothesis, or is neutral.
[0106] Number of samples: 392,702 in training set, 9,815 in development set (dev-matched), and 9,796 in test set (test-matched).
[0107] Evaluation criterion: accuracy.
[0108] MRPC (The Microsoft Research Paraphrase Corpus) is a similarity and paraphrase task that automatically extracts sentence pairs from online news sources and manually annotates them to determine whether the sentences in the pairs are semantically equivalent. The categories are unbalanced, with 68% being positive samples.
[0109] Number of samples: 3,668 in training set, 408 in development set, and 1,725 in test set.
[0110] Evaluation criterion: F1 score
[0111] SST-2: The Stanford Sentiment Treebank, a single-sentence classification task containing sentences from movie reviews and their sentiment annotations. This task assigns a sentiment to a given sentence, categorized into two classes: positive sentiment (labeled 1) and negative sentiment (labeled 0), using only sentence-level labels. This is a binary classification task.
[0112] Number of samples: 67,350 in training set, 873 in development set, and 1,821 in test set.
[0113] Evaluation criterion: accuracy.
[0114] QQP (The Quora Question Pairs, Quora) is a similarity and paraphrasing task based on a set of question pairs from the community question-and-answer website Quora. The task is to determine whether a pair of questions is semantically equivalent. Like MRPC, QQP also suffers from an imbalanced sample size, but unlike MRPC, QQP has 63% negative samples and 37% positive samples.
[0115] Number of samples: 363,870 in training set, 40,431 in development set, and 390,965 in test set.
[0116] Evaluation criterion: F1 score
[0117] QNLI: Question-answering Natural Language Inference, a natural language inference task. QNLI is derived from another dataset, The Stanford Question-Answering Dataset (SQuAD 1.0). SQuAD 1.0 is a question-answering dataset consisting of question-paragraph pairs, where the paragraphs are from Wikipedia, and a sentence in the paragraph contains the answer to the question.
[0118] Number of samples: 104,743 in training set, 5,463 in development set, and 5,461 in test set.
[0119] Evaluation criterion: accuracy.
[0120] RTE: The Recognizing Textual Entailment datasets, a natural language inference task. It is a collection of datasets from a series of annual textual entailment challenges. All these datasets are converted into binary classification. For ternary classification data, to maintain consistency, neutral and contradictory data are converted into non-entailment.
[0121] Number of samples: 2,491 in the training set, 277 in the development set, and 3,000 in the test set.
[0122] Evaluation criterion: accuracy.
[0123] (7) Experimental Results
[0124] Accuracy is a metric used to evaluate a model for classification tasks. Taking binary classification as an example, it can be calculated based on the positive and negative classes as follows:
[0125]
[0126] In this context, TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives.
[0127] The F1 score is a statistical metric used to measure the precision of binary classification (or multi-task binary classification) models. It considers both accuracy and recall of the classification model. Precision is the proportion of samples identified as positive. The formula for precision is defined as follows:
[0128]
[0129] Recall is the proportion of positive samples that are correctly identified as positive. The formulas for recall and F1 score are defined as follows:
[0130]
[0131]
[0132] FLOPs (floating point operations) refer to floating-point operations, which can be used to measure the complexity of an algorithm / model. This invention uses this metric to measure the inference efficiency of the method.
[0133] Table 1 below shows the experimental results of the embodiments of the present invention. All tests were conducted on the validation set of each dataset, and the experimental equipment was a server equipped with a 3090 graphics card.
[0134] Table 1 Experimental Results
[0135]
[0136] Experimental results from embodiments of this invention show that, compared to a 3.3% decrease in average accuracy compared to the original BERT model, the speedup of FLOPs is increased by 17.06 times. The model of this invention achieves higher computational efficiency with the same computing resources. Compared to the current DeeBERT method with its early exit strategy, the speedup performance improvement of this invention is more significant.
[0137] In summary, this embodiment has at least the following advantages and beneficial effects compared to the prior art:
[0138] (1) This invention combines multi-dimensional acceleration methods, including weight pruning and low-rank decomposition in width and early exit method of network layer in depth. By integrating these dynamic and static acceleration methods, the inference efficiency of language model is improved while ensuring model performance, thereby achieving more efficient model deployment under limited computing resources.
[0139] (2) The present invention uses the knowledge distillation method to iteratively train the pruned model and uses the knowledge of the original BERT model to guide the training of the student model, thereby alleviating the performance loss after model compression.
[0140] (3) This invention introduces a reinforcement learning framework and designs a reward function, which enables the model to autonomously select the number of layers to exit based on the current sample, thereby solving the problem of network layer computational redundancy and improving the speed and accuracy of the model.
[0141] (4) Regarding model width, this invention employs static weight pruning and low-rank decomposition to reduce the number of model parameters, and mitigates the performance loss caused by model compression through iterative training based on knowledge distillation. As for model depth, a dynamic network layer early exit strategy is adopted to further reduce redundant computation and improve inference efficiency.
[0142] (5) In terms of network layer dimensions, compared to traditional static pruning methods, the dynamic early exit strategy can autonomously select the inference path according to specific circumstances, thus better solving the problem of network layer computational redundancy. Current methods use a weighted sum of the cross-entropy losses of all internal classifiers during training, aiming for all internal classifiers to correctly predict all training samples. This invention adopts a reinforcement learning-based network layer early exit strategy, ensuring model consistency during training and inference, providing a more reliable foundation for model application.
[0143] This embodiment also provides a reasoning acceleration system for language models, including:
[0144] The model training module is used to build and train language models;
[0145] The model pruning module is used to prune the feedforward neural network neurons and attention heads in the language model using structured pruning techniques;
[0146] The model compression module is used to perform low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model.
[0147] The distillation training module is used for iterative training techniques based on knowledge distillation. It uses the original language model before pruning as the teacher model to guide the training of the compressed model and obtain the final language model.
[0148] The reinforcement learning module is used to design reward functions based on the reinforcement learning framework, train internal classifiers, and allow the final language model to exit dynamically.
[0149] This embodiment of the language model inference acceleration system can execute the language model inference acceleration method provided in the method embodiment of the present invention, and can execute any combination of the implementation steps of the method embodiment, and has the corresponding functions and beneficial effects of the method.
[0150] This embodiment also provides a reasoning acceleration device for language models, including:
[0151] At least one processor;
[0152] At least one memory for storing at least one program;
[0153] When the at least one program is executed by the at least one processor, the at least one processor implements Figure 1 The method shown.
[0154] This embodiment of the inference acceleration device for a language model can execute the inference acceleration method for a language model provided in the method embodiment of the present invention, and can execute any combination of implementation steps of the method embodiment, and has the corresponding functions and beneficial effects of the method.
[0155] This application also discloses a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device can read the computer instructions from the computer-readable storage medium and execute the computer instructions, causing the computer device to perform... Figure 1 The method shown.
[0156] This embodiment also provides a storage medium storing instructions or programs that can execute the low-light image enhancement processing method provided in the method embodiment of the present invention. When the instructions or programs are run, any combination of implementation steps of the method embodiment can be executed, and the method has the corresponding functions and beneficial effects.
[0157] In some alternative embodiments, the functions / operations mentioned in the block diagrams may not occur in the order shown in the operation diagrams. For example, depending on the functions / operations involved, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may sometimes be executed in reverse order. Furthermore, the embodiments presented and described in the flowcharts of this invention are provided by way of example to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is altered and sub-operations described as part of a larger operation are executed independently.
[0158] Furthermore, although the invention has been described in the context of functional modules, it should be understood that, unless otherwise stated, one or more of the described functions and / or features may be integrated into a single physical device and / or software module, or one or more functions and / or features may be implemented in a separate physical device or software module. It is also understood that a detailed discussion of the actual implementation of each module is unnecessary for understanding the invention. Rather, given the properties, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the module will be understood within the scope of conventional skill of an engineer. Therefore, those skilled in the art can implement the invention as set forth in the claims using ordinary techniques without excessive experimentation. It is also understood that the specific concepts disclosed are merely illustrative and not intended to limit the scope of the invention, which is determined by the full scope of the appended claims and their equivalents.
[0159] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0160] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.
[0161] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0162] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0163] In the foregoing description of this specification, references to terms such as "one embodiment," "another embodiment," or "some embodiments" indicate that a specific feature, structure, material, or characteristic described in connection with an embodiment or example is included in at least one embodiment or example of the present invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0164] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
[0165] The above is a detailed description of the preferred embodiments of the present invention. However, the present invention is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. All such equivalent modifications or substitutions are included within the scope defined by the claims of this application.
Claims
1. A method for accelerating inference in a BERT language model, characterized in that, Includes the following steps: Build and train a language model; Structured pruning techniques are used to prune the feedforward neural network neurons and attention heads in the language model; The pruned language model is subjected to low-rank decomposition of the word embedding matrix to obtain the compressed model; Based on the iterative training technique of knowledge distillation, the original language model before pruning is used as the teacher model to guide the training of the compressed model and obtain the final language model. Based on the reinforcement learning framework, a reward function is designed, an internal classifier is trained, and the final language model is dynamically exited. The design of the reward function, training of the internal classifier, and dynamic exit of the final language model include: Insert an internal classifier into each intermediate layer of the language model; The design incorporates a reward function that prioritizes performance acceleration, and determines the exit level based on the policy network to achieve a balance between accuracy and speed. The design considers a reward function that accelerates performance, and determines the exit layer based on the policy network, including: The likelihood of the model predicting a label is used as part of the reward function; the greater the likelihood of the predicted label, the greater the reward. For performance acceleration, the number of layers a sample exits is considered; if a sample exits at an earlier network layer, it can receive a larger reward, as shown in the following reward function: In the formula, It's a real label. It is the first The predicted probabilities of the classifiers within the layer. The cross-entropy function is used to measure the difference between the final output and the true label. For hyperparameters, actions It is in the The action taken by the layer is represented by numbers 0 and 1, with the action space being either exit or continue. The objective function of the policy network is expressed as: In the formula, The motion trajectory representing the sample, It represents the total number of trajectories; This represents the expected value of the motion trajectory. Represents the given current state Take action The probability, The reward value representing the action trajectory; Add a classifier to each layer, and the weight parameters of all classifiers are represented as follows: Its objective function is expressed as: In the formula, Indicates training set Input and labels in the text; Represents the cross-entropy loss function; It is an indicator function that returns 1 if the condition in parentheses is met, and 0 otherwise. Experiments were conducted using the GLUE dataset, which includes multiple natural language processing tasks, such as natural language inference, textual entailment, sentiment analysis, and semantic similarity, all in English.
2. The inference acceleration method for the BERT language model according to claim 1, characterized in that, The structured pruning technique is used to prune the feedforward neural network neurons and attention heads in the language model, including: Importance scores for individual weights are calculated using a metric based on a first-order Taylor expansion. In the formula, It is a weight matrix The Line number The weight of the column; It is the training dataset. One of the samples, Indicates the expected value; It is a loss function; Based on the calculated importance scores, feedforward neural network neurons and attention heads are eliminated.
3. The inference acceleration method for the BERT language model according to claim 1, characterized in that, The step of performing low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model includes: Singular value decomposition is used to decompose the embedding matrix into two matrices to compress the model, as shown in the following formula: in, and The decomposition matrix; Representative of singular values The diagonal matrix formed, where It is the rank of the matrix, satisfying the condition ; and They are The column sum The OK; Size of the vocabulary list.
4. The inference acceleration method for the BERT language model according to claim 1, characterized in that, The process of using the original language model before pruning as a teacher model to guide the training of the compressed model to obtain the final language model includes: The compressed model is used as the student model, and the prediction results of the teacher model are used as soft labels to train the student model. The loss of the prediction results is represented by the soft cross-entropy loss function: In the formula, and The predicted logarithms for teachers and students, respectively. Represents the cross-entropy loss function; Simultaneously, the intermediate hidden states of the teacher model are used as additional supervision signals to guide the training of the student model. The loss of the intermediate hidden states is represented by minimizing the mean squared error. In the formula, and The first for teachers and students respectively The hidden state of the layer This represents minimizing the mean square error; This represents the total number of layers in the model. Total training loss is and sum.
5. An inference acceleration system for the BERT language model, characterized in that, include: The model training module is used to build and train language models; The model pruning module is used to prune the feedforward neural network neurons and attention heads in the language model using structured pruning techniques; The model compression module is used to perform low-rank decomposition of the word embedding matrix on the pruned language model to obtain the compressed model. The distillation training module is used for iterative training techniques based on knowledge distillation. It uses the original language model before pruning as the teacher model to guide the training of the compressed model and obtain the final language model. The reinforcement learning module is used to design reward functions based on the reinforcement learning framework, train internal classifiers, and allow the final language model to exit dynamically. The design of the reward function, training of the internal classifier, and dynamic exit of the final language model include: Insert an internal classifier into each intermediate layer of the language model; The design incorporates a reward function that prioritizes performance acceleration, and determines the exit level based on the policy network to achieve a balance between accuracy and speed. The design considers a reward function that accelerates performance, and determines the exit layer based on the policy network, including: The likelihood of the model predicting a label is used as part of the reward function; the greater the likelihood of the predicted label, the greater the reward. For performance acceleration, the number of layers a sample exits is considered; if a sample exits at an earlier network layer, it can receive a larger reward, as shown in the following reward function: In the formula, It's a real label. It is the first The predicted probabilities of the classifiers within the layer. The cross-entropy function is used to measure the difference between the final output and the true label. For hyperparameters, actions It is in the The action taken by the layer is represented by numbers 0 and 1, with the action space being either exit or continue. The objective function of the policy network is expressed as: In the formula, The motion trajectory representing the sample, It represents the total number of trajectories; This represents the expected value of the motion trajectory. Represents the given current state Take action The probability, The reward value representing the action trajectory; Add a classifier to each layer, and the weight parameters of all classifiers are represented as follows: Its objective function is expressed as: In the formula, Indicates training set Input and labels in the text; Represents the cross-entropy loss function; It is an indicator function that returns 1 if the condition in parentheses is met, and 0 otherwise. Experiments were conducted using the GLUE dataset, which includes multiple natural language processing tasks, such as natural language inference, textual entailment, sentiment analysis, and semantic similarity, all in English.
6. A reasoning acceleration device for a language model, characterized in that, include: At least one processor; At least one memory for storing at least one program; When the at least one program is executed by the at least one processor, the at least one processor implements the method of any one of claims 1-4.
7. A computer-readable storage medium storing a processor-executable program, characterized in that, The processor-executable program, when executed by the processor, is used to perform the method as described in any one of claims 1-4.