Lightweight visual transformer model with adaptive MLP pruning

CN122287751APending Publication Date: 2026-06-26CENT SOUTH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CENT SOUTH UNIV
Filing Date
2026-04-03
Publication Date
2026-06-26

Smart Images

  • Figure CN122287751A_ABST
    Figure CN122287751A_ABST
Patent Text Reader

Abstract

This invention discloses an adaptive MLP pruning method for large-scale vision Transformers, addressing the practical deployment requirements of real-time visual perception, low power consumption, low latency, and high throughput in scenarios such as robots, drones, and mobile terminals. It solves the technical problems of existing large-scale vision Transformers, including redundant parameters, high computational and memory overhead, slow inference speed, and difficulty in edge deployment. The core method first accurately evaluates the importance of hidden neurons in the MLP based on Taylor expansion combined with the information entropy criterion. Then, it adaptively prunes and sorts the neurons according to the redundancy of different MLP modules using a binary search algorithm. Combined with knowledge distillation, it restores the performance of the pruned model, ultimately achieving a reduction of approximately 40% in parameters and computational load, and increasing the inference speed to about 1.5 times the original. This method not only demonstrates near-consistent performance with the original model in multiple benchmark tests such as zero-shot image classification, retrieval, and kNN evaluation, but also slightly surpasses it in some scenarios. Furthermore, it has the advantages of not relying on the original model's loss function and additional modules, and being compatible with word reduction methods. Ultimately, this invention provides a lightweight, high-performance, and easily adaptable practical solution for various application scenarios in the field of computer vision that require efficient model inference, and powerfully promotes the low-cost engineering deployment and commercialization of large-vision Transformer models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and deep learning model compression, specifically involving an adaptive MLP pruning method for large visual Transformer models. It achieves model lightweighting while maintaining near-lossless performance, and is suitable for various visual Transformer application scenarios such as visual classification, cross-modal retrieval, and visual search. Background Technology

[0002] Visual Transformers, with their excellent scalability, have achieved leading performance in various computer vision tasks such as image classification, cross-modal retrieval, and object detection. Model accuracy continues to optimize with increasing capacity, making them the mainstream architecture for large-scale visual representation learning. However, in real-world applications such as autonomous robots, drone inspections, edge intelligent devices, and real-time visual perception, large-scale visual Transformers face severe deployment bottlenecks: the number of model parameters and computational loads explodes, resulting in high inference latency, low throughput, and high energy consumption, failing to meet the stringent requirements of robots for real-time environmental perception, rapid decision-making, and continuous interaction; at the same time, edge devices have limited computing power and memory resources, making it difficult to deploy ultra-large-scale models directly.

[0003] In scenarios requiring real-time robot perception and efficient edge inference, existing pruning techniques cannot maintain performance while significantly reducing the number of parameters and FLOPs, nor can they meet the deployment requirements of low latency, high throughput, and low power consumption. Therefore, designing a label-independent, accurate, adaptive, and lossless compression MLP pruning scheme for large-vision Transformers, which can significantly reduce computational and memory overhead while ensuring inference speed and model performance, has become a key technical challenge for large-vision Transformers to be applied in robots, drones, and edge devices. Summary of the Invention

[0004] The technical problem solved by this invention is to address the shortcomings of existing technologies where large-scale visual Transformer models have a large number of parameters and excessive computational and memory requirements. This invention provides a lightweight visual Transformer method with adaptive MLP pruning. This method accurately evaluates the importance of neurons in MLP modules and adaptively prunes redundant neurons. Combined with knowledge distillation to restore model performance, it significantly reduces the number of model parameters and computational load while ensuring no significant performance degradation, thus enabling lightweight deployment of large-scale visual Transformers.

[0005] To achieve the above objectives, the technical solution of the present invention is as follows:

[0006] A lightweight visual Transformer model construction method with adaptive MLP pruning includes the following steps:

[0007] Step 1: Based on Taylor expansion, introduce the unlabeled information entropy criterion to evaluate the importance score of hidden neurons in the MLP module of the visual Transformer model.

[0008] Step 2: Sort the hidden neurons of the MLP module according to the obtained importance scores, and use the binary search algorithm to adaptively prune the sorted neurons according to the redundancy of different MLP modules to avoid a preset fixed compression ratio.

[0009] Step 3: Use the original visual Transformer model as the teacher model and the pruned model as the student model to perform knowledge distillation and guide the performance recovery of the pruned model.

[0010] Furthermore, in step 1, we express the model prediction criterion as follows: ,in The importance of the k-th hidden neuron in the hidden feature set obtained after inputting the dataset into the model is measured by the variance of the pruned prediction criterion, as shown in the formula:

[0011]

[0012] in These are the feature values ​​of the neuron before pruning. =0 indicates that the k-th neuron is pruned; by using Taylor expansion to approximate the variance, we obtain an approximate expression for the importance of the neuron. ,in for For = The gradient at point k. For a visual Transformer with N token sequences, the final importance score of the k-th hidden neuron:

[0013]

[0014] in The feature value of the k-th neuron corresponding to the n-th token. This corresponds to the gradient. To address the issue of traditional one-hot cross-entropy loss ignoring potential predictions from other categories and leading to distorted importance assessment, unlabeled information entropy is introduced as the criterion for importance assessment. First, the similarity matrix is ​​obtained through inter-instance similarity calculation. , In the formula, Let B represent the i-th image output from the last block of the Transformer model, where B is the number of images in the mini-batch; then, a softmax operation is performed on the similarity matrix to obtain the predicted probability matrix. The similarity between the i-th and j-th images is: In the formula, τ is the temperature coefficient, which scales the similarity range from [−1,1] to [−1 / τ,1 / τ]; finally, the information entropy criterion is obtained.

[0015]

[0016] This criterion was used to assess the importance of hidden neurons in the MLP module.

[0017] Further, in step 2, we first sort the importance scores of all hidden neurons in the MLP modules obtained in step 1 in descending order to determine the pruning order of neurons and minimize the performance degradation after model compression. Let the hidden layer dimension of the MLP in the original model be denoted as... The initial search range for pruning is ,in =0, = Set the information entropy increment threshold To ensure that the change in information entropy after pruning is controlled within a threshold, the performance degradation is kept within an acceptable range. A binary search algorithm is used to search for the optimal pruned hidden layer dimension among the sorted neurons. In the t-th search step, the hidden layer dimension of the l-th Transformer block is pruned to... In the pruning dataset The size of the hidden layer is evaluated above. Information entropy of the time model ,like - If the value is less than ΔE, then update the search range to... And record the current optimal pruning size as Otherwise, update the search scope to [ , Reduce the number of pruned neurons. Repeat the binary search steps above until the maximum number of search steps is reached. Alternatively, the search range can be narrowed down to 1 to obtain the optimal hidden layer dimension after pruning each MLP module, thus completing the adaptive pruning of the model's MLP modules, and ensuring that the output dimension of the pruned model remains consistent with the original model.

[0018] Furthermore, in step 3, the performance of the visual Transformer model degrades somewhat after pruning by the MLP module. To restore its performance, we use the original model as the teacher model and the pruned model as the student model for knowledge distillation. Since their weights and structures are compatible and their output dimensions are consistent, distillation can be performed directly without an additional alignment module. The output of the last Transformer block of the teacher model is represented as... and ,in Embedding features for category tokens, The embedding features of the patch tokens are N, where N is the number of token sequences and C is the feature dimension; the output of the last block of the student model is and Its dimensions are completely consistent with the output of the teacher model. To achieve efficient knowledge transfer, the embedding features of both category tokens and patch tokens are constructed using mean squared error loss to create a distillation loss function.

[0019] By training the student model using this loss function, the knowledge from the original model can be efficiently transferred to the pruned model, thus restoring the model's performance.

[0020] Beneficial effects

[0021] This invention presents a lightweight visual transformer method based on adaptive MLP pruning, aiming to significantly compress large-scale visual transformer models in a near-lossless manner, reducing computational and memory requirements while ensuring no significant performance degradation. This method focuses on the Multilayer Perceptron (MLP) module, which constitutes the largest component of the visual transformer. First, it accurately evaluates the importance of hidden neurons in the MLP using an unlabeled information entropy criterion combined with Taylor expansion. Then, it uses a binary search algorithm to adaptively prune and sort neurons based on the redundancy of different MLP modules, abandoning the traditional method of predefined compression ratios. Finally, it achieves performance recovery of the pruned model through knowledge distillation. Experimental results on several mainstream large-scale visual transformers, such as CLIP and DINOv2, show that this method can achieve a reduction of approximately 40% in parameters and floating-point operations (FLOPs). The performance of the un-fine-tuned model after pruning is significantly better than other pruning methods. The distilled model can fully recover or even slightly surpass the performance of the original model. Furthermore, this method is fully compatible with token reduction methods, and combining them can further improve the inference efficiency of the visual transformer. This effectively solves the problem of high deployment cost of large-scale visual transformers and achieves efficient and lightweight modeling. Attached Figure Description

[0022] Figure 1 This is a schematic diagram of the overall method of the present invention.

[0023] Figure 2 This is a schematic diagram comparing the use of one-hot cross-entropy and information entropy for neuron importance assessment in the method of this invention. Detailed Implementation

[0024] As shown in Figure 1, the lightweight visual transformer method based on adaptive MLP pruning proposed in this invention mainly includes the following steps:

[0025] Step 1: Select several mainstream large-scale visual transformer models, including the CLIP series OpenCLIP-g, OpenCLIP-G, EVA-CLIP-E, EVA-CLIP-8B, and the pure visual transformer DINOv2-g. Focus on pruning the MLP module, which has the dominant parameters in the model. The text encoder part of the model remains fixed throughout the process, without pruning or fine-tuning. Represent the hidden neuron feature set of the MLP module as follows: ,in Let the feature value of the k-th hidden neuron be used as the basis for neuron importance assessment. First, the change in the model prediction criterion after pruning is approximated by Taylor expansion, and a basic formula for neuron importance assessment is constructed. The importance of the k-th hidden neuron is expressed as the product of the feature value and the corresponding gradient. Then, the importance of all neurons at the same position in the token sequence is summed and the absolute value is taken to obtain the final neuron importance score. .

[0026] Step 2: Abandoning the one-hot cross-entropy criterion used in traditional Taylor pruning, we introduce the unlabeled information entropy criterion to accurately evaluate neuron importance. This solves the problems of one-hot cross-entropy ignoring unlabeled category prediction and having low evaluation accuracy, while achieving a generalized importance evaluation that does not rely on the original model's loss function, additional modules, or labeled datasets. First, we calculate the inter-instance similarity matrix between B image representations in a mini-batch. Similarity is achieved through the category tokens output by the last block of the Transformer. The cosine similarity is calculated, then a softmax operation is applied to the similarity matrix, and a temperature coefficient τ is introduced to scale the similarity range, resulting in the prediction probability matrix. Finally, the information entropy is calculated based on this probability matrix.

[0027] Use it as a prediction criterion for Taylor pruning Substitute the values ​​into the importance score formula to calculate the importance of all hidden neurons in the MLP, and sort the neurons in descending order according to their importance scores to determine the pruning order.

[0028] Step 3: Based on the sorted neurons, an adaptive pruning algorithm is used to prune the MLP modules, avoiding the drawbacks of a predefined pruning ratio. The optimal number of prunes is dynamically determined based on the redundancy of different MLP modules. First, 50,000 images are randomly sampled from the ImageNet-1K training set to construct the pruning dataset. Set an information entropy increment threshold ΔE to control the change in information entropy of the pruned model within this threshold, ensuring that performance degradation is within an acceptable range. Initialize the original MLP model's hidden layer size to... The pruning search range is [Mmin, Mmax] = [0, ...]. Maximum number of search steps =6. Pruning is performed on the MLP modules of each block sequentially from the last block to the first block of the Transformer. At each step, the median value of the current search range is calculated as the candidate pruning size. Evaluate the information entropy of the pruning model at this size. ,like - If the value is less than ΔE, then the search range is narrowed down to... Continue pruning; otherwise, adjust the search scope to [ , Reduce the number of pruning steps by halving the search range at each step until the maximum number of search steps is reached. This yields the optimal hidden layer size to be retained for each MLP module, completing the structured pruning. The output dimension of the pruned model remains consistent with the original model, eliminating the need for additional alignment modules.

[0029] Step 4: The performance of the pruned model suffers some degradation. Using the original visual transformer model as the teacher model and the pruned model as the student model, performance recovery for the student model is achieved through knowledge distillation. The dataset used for distillation is the unlabeled ImageNet-1K training set, which contains only 0.06% of the LAION-2B data. All images used for distillation and evaluation are resized to 224×224. The images are input into the teacher and student models, respectively, to obtain the class tokens output from the last Transformer block of each model. and patch tokens And the student model corresponding to and To balance the predictive power and feature fitting ability of the model, a distillation loss function is constructed using mean squared error loss for both class tokens and patch tokens.

[0030] Guide students to learn the feature representations of the teacher's model.

[0031] Step 5: All models were trained for 10 epochs on a server equipped with 8×A6000 GPUs. The first epoch was used for learning rate warm-up, employing the AdamW optimizer and training with bfloat16 precision. The learning rate followed a cosine scheduling strategy from the base learning rate to zero. The learning rate was calculated using the following formula: Based on the model size, distributed data parallelism (DDP) and fully sharded data parallelism (FSDP) strategies were adopted respectively. Hyperparameters such as temperature coefficient τ and information entropy increment threshold ΔE were set specifically according to the model backbone network. After distillation, the model was evaluated on multiple tasks such as zero-shot image classification, zero-shot image-text retrieval, and kNN classification to verify the model performance. The pruned and distilled model of this method can restore the performance level of the original model while achieving a reduction of about 40% in parameters and FLOPs. Some models even achieved a slight performance improvement, and the inference speed of the pruned model was improved by about 1.5 times, realizing near-lossless lightweighting of large-scale visual transformers.

[0032] Step 6: This method is fully compatible with token reduction methods such as token pruning and token fusion. Combining the MLP parameter pruning of this method with the token reduction method can further reduce the inference computation of the visual transformer and improve the inference efficiency of the model. It is suitable for lightweight deployment of various large-scale visual transformers. At the same time, the core idea of ​​this method can be extended to the adaptive reduction of multi-head self-attention modules and the accelerated optimization of large language models.

Claims

1. A lightweight visual search method based on diversity, comprising the following steps: Step 1: Based on the Taylor expansion method, the unlabeled information entropy criterion is introduced as the evaluation criterion to assess the importance score of hidden neurons in the MLP module of the large vision Transformer. Step 2: Sort the hidden neurons of the MLP module according to their importance scores. Using a binary search algorithm, adaptively prune the sorted neurons based on the redundancy of different MLP modules until the change in information entropy predicted by the model after pruning does not exceed a set threshold. Step 3: Use the original large-view Transformer as the teacher model and the pruned model as the student model. Guide the student model to restore performance through knowledge distillation to complete the lightweighting of the large-view Transformer.

2. The lightweight method for adaptive MLP pruning of large-vision Transformers according to claim 1, characterized in that, In step 1, the model prediction criterion is denoted as where is the hidden feature set after the model input data set, and the importance of the kth hidden neuron is measured by the variance of the model prediction criterion as follows: In the formula, is the eigenvalue of the kth neuron before pruning, = 0 indicates pruning the kth neuron, indicates only setting the value of in to ; right exist = Performing a Taylor expansion at the k-th neuron and ignoring the first-order remainder, we obtain an approximate importance value of the k-th neuron. ,in for For = gradient at; For a sequence containing N tokens in a large vision Transformer, the final importance score of the k-th hidden neuron is calculated as follows: In the formula, The importance of the k-th neuron corresponding to the n-th token in the sequence. The feature value of the k-th neuron corresponding to the n-th token. This corresponds to the gradient.

3. The lightweight method for adaptive MLP pruning of large-vision Transformers according to claim 1, characterized in that, In step 1, the unlabeled information entropy criterion calculates the model's predicted probability based on the similarity between instances, without relying on the original model's loss function, labeled dataset, or additional prediction module. The specific calculation process is as follows: First, calculate the instance similarity matrix of the B image representations in the mini-batch. The similarity between the i-th image and the j-th image is: In the formula, This represents the i-th image output by the last block of the Transformer model; Then, a softmax operation is applied to the similarity matrix S to obtain the prediction probability matrix. : In the formula, τ is a temperature coefficient used for scaling. The range of values ​​for ; The final information entropy criterion is:

4. The lightweight method for adaptive MLP pruning of large-vision Transformers according to claim 1, characterized in that, In step 2, the hidden layer size of the MLP in the original model is denoted as... The initial search range for binary search is [Mmin, Mmax] = [0, ...]. The information entropy increment threshold ΔE is set, and the specific operations of the pruning process include: Step 2.1: In the t-th step of the pruning search, calculate the size of the hidden layer to be pruned. ,in The block number for the Transformer; Step 2.2, in the pruning dataset The size of the hidden layer is evaluated above. Information entropy of the time model Let the information entropy of block l before pruning be denoted as Step 2.3, if - If the value is less than ΔE, then update the search range to... And record the current optimal pruning size as Otherwise, update the search scope to [ , ]; Step 2.4: Repeat steps 2.1 to 2.3 until the maximum number of search steps is reached. This yields the optimal pruned hidden layer size for each Transformer block in the MLP. This completes adaptive pruning.

5. The lightweight method for adaptive MLP pruning of large-vision Transformers according to claim 1, characterized in that, In step 3, the knowledge distillation process utilizes mean squared error loss to perform feature alignment on the category tokens and patch tokens output by the last Transformer block of the teacher and student models. Specifically: The category token and patch token output by the teacher model are denoted as follows: and The category token and patch token output by the student model are denoted as follows: and Where C is the token dimension, N is the number of patch tokens, and the token output by the student model has the same dimension as the token output by the teacher model; The loss function for knowledge distillation is: By minimizing this loss function, knowledge transfer from the teacher model to the student model is achieved, thus restoring the performance of the student model.