Projection-based knowledge distillation method based on adaptive mask weighting
By using an adaptive mask-weighted projective knowledge distillation method, the problem of limited student network representation ability caused by random masking is solved, which improves the representation ability and information utilization efficiency of student models, and enhances the robustness and generalization ability of the models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF MINING & TECH
- Filing Date
- 2023-11-16
- Publication Date
- 2026-06-12
AI Technical Summary
In existing knowledge distillation methods, random masking results in a limited receptive field of adjacent pixels of student features, affecting the student network representation ability and insufficient information utilization, leading to insufficient model robustness and generalization ability.
An adaptive mask-weighted projective knowledge distillation method is adopted. By constructing an adaptive mask matrix and a relation matrix, the features of the student network are optimized by adaptive mask relation weighting and projective loss, thereby improving the expressive power of the student model.
It improves the representation ability and information utilization efficiency of student networks, and enhances the robustness and generalization ability of the knowledge distillation model.
Smart Images

Figure CN117454971B_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to the field of computer vision, and particularly to a projection-based knowledge distillation method based on adaptive mask weighting. Background Art
[0002] Deep convolutional neural networks have been widely applied to various computer vision tasks. Generally speaking, the larger the model, the better the performance, but the slower the inference speed, and it is difficult to deploy in the case of limited resources. To overcome this problem, knowledge distillation has been proposed. Currently, feature-based distillation methods usually make the student imitate the teacher's features as much as possible so that the student's features have stronger representation ability.
[0003] Yang et al. proposed in the paper "Masked Generative Distillation" that improving the representation ability of the student does not necessarily need to be achieved by directly imitating the teacher. Starting from this point, Yang et al. modified the imitation task into a generation task, that is, in the distillation process, by randomly masking the student's features, the student uses its own weaker features to generate the teacher's stronger features to improve the student's representation ability. However, randomly masking the student's features will make the masked areas of the feature map too randomized, affecting the subsequent restoration effect according to the adjacent pixels in the masked area. Moreover, in the paper, the random masking operation is directly performed on the features, and the receptive field of the adjacent pixels of the student's features is relatively limited, and the complete features cannot be effectively restored based on this, that is, the representation ability of the student network is still limited. Summary of the Invention
[0004] The purpose of the present invention is to provide a projection-based knowledge distillation method based on adaptive mask weighting, which solves the problems of limited representation ability of the student network and insufficient information utilization caused by randomly masking the student's features and the limited receptive field of the adjacent pixels of the student's features, and at the same time improves the robustness and generalization ability of the knowledge distillation model.
[0005] The technical solution for achieving the purpose of the present invention is as follows: A projection-based knowledge distillation method based on adaptive mask weighting, comprising the following steps:
[0006] Step 1: Randomly collect K labeled images in the CIFAR-100 dataset, 10000 < K ≤ 60000, normalize the above K images, and uniformly set the pixel size to h0×w0, where h0 is the image height and w0 is the image width; randomly divide the images with unified size into a training dataset and a test dataset according to a ratio of 5:1, perform data augmentation on the training dataset to form a teacher-student network training dataset, and use the teacher-student network training dataset to pre-train the teacher network to obtain a pre-trained teacher network, and then proceed to Step 2.
[0007] Step 2: Based on the depth of the convolutional layers and the size of the feature maps, divide the teacher network into n teacher modules and the student network into n student modules, then proceed to Step 3.
[0008] Step 3: Construct n-1 relation matrices based on the output features of n student modules in the student network, and proceed to step 4.
[0009] Step 4: Based on the relation matrix constructed in Step 3, construct the corresponding adaptive mask matrix. Use the adaptive mask matrix to perform adaptive mask relation weighting on the output features of the first n-1 student modules of the student network to obtain the first n-1 adaptive mask relation weighted features. Perform adaptive masking on the output features of the nth student module of the student network separately to obtain the adaptive mask features, and then proceed to Step 5.
[0010] Step 5: Construct a projection layer. Use the teacher network to guide the corresponding projection layer so that the projection of the n-1 adaptive mask relationship weighted features obtained by the student network approximates the output features of the corresponding n-1 teacher modules. Calculate the projective loss of the adaptive mask relationship weighted feature. Make the projection of the nth adaptive mask feature approximate the output feature of the nth teacher module. Calculate the projective loss of the adaptive mask feature. Proceed to Step 6.
[0011] Step 6: Calculate the distillation loss of the traditional distillation method using the output features of the nth teacher module of the teacher network and the output features of the nth student module of the student network; then calculate the total distillation loss using the traditional distillation loss and the adaptive mask weighted projection loss, and update the network parameters of the student network accordingly to finally obtain the trained student network, and proceed to step 7.
[0012] Step 7: Input the test dataset into the trained student network, output the prediction result for each sample in the test set, and test the accuracy of the trained student network.
[0013] Compared with the prior art, the advantages of the present invention are as follows:
[0014] 1) Compared with existing knowledge distillation methods, this invention focuses on improving the expressive ability of the student model. Under the guidance of the teacher model, the student model fully explores and expresses the rich information contained in the dual features of relation matrix and feature map. At the same time, it solves the problems of insufficient utilization of feature knowledge and large differences in the expressive ability between student model and teacher model.
[0015] 2) The present invention first constructs a projection distillation model that adaptively weights the masked relationships and adaptively weights the output features of the features extracted in each stage of the student network, solving the problems of limited representation ability of the student network and insufficient information utilization caused by randomly masking the student features and the limited receptive field of adjacent pixels of the student features, and improving the robustness and generalization ability of the knowledge distillation model. BRIEF DESCRIPTION OF THE DRAWINGS
[0016] Figure 1 It is a model diagram of the projection-based knowledge distillation method based on adaptive mask weighting of the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS
[0017] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below.
[0018] Combined with Figure 1 , a projection-based knowledge distillation method based on adaptive mask weighting includes the following steps:
[0019] Step 1: Randomly collect K labeled images in the CIFAR-100 dataset, where 10000 < K ≤ 60000. Normalize the above K images, and uniformly set the pixel size to h0×w0, where h0 is the image height and w0 is the image width. Randomly divide the images with the unified size into a training dataset and a test dataset at a ratio of 5∶1. Perform data augmentation on the training dataset to form a teacher-student network training dataset, and use the teacher-student network training dataset to pre-train the teacher network to obtain a pre-trained teacher network, and then proceed to Step 2.
[0020] Step 2: Divide the teacher network into n teacher modules and the student network into n student modules according to the depth of the convolutional layer and the size of the feature map for extracting features at each stage, and then proceed to Step 3.
[0021] Step 3: Construct n - 1 relationship matrices based on the output features of the n student modules of the student network, specifically as follows:
[0022] First, define the output feature of the i-th student module as 1 ≤ i ≤ n, S represents the student network, H, W, and C respectively represent the height, width, and dimension of the output feature; define the output feature of the i-th teacher module as 1 ≤ i ≤ n, T represents the teacher network; then use dilated convolution to sparsely sample the feature F Si to obtain a feature map with the same size Define F Si and Feature fusion increases the shared receptive field between adjacent pixels, making it easier for the projection layer to project the masked feature pixels. The fused feature representation is as follows: Finally, the fused features are utilized. and Constructing the relation matrix G N :
[0023]
[0024] That is, G N Let represent the relationship matrix constructed by fusing the output features of the i-th student module with the output features of the (i+1)-th student module, where 1 ≤ i ≤ n-1 and 1 ≤ N ≤ n-1; h represents the pixel position in the height dimension and w represents the pixel position in the width dimension. Constructing this relationship matrix makes the relationship between adjacent pixels closer, increases the overlapping receptive field, and is more conducive to the projection of pixels after masking.
[0025] Proceed to step 4.
[0026] Step 4: Based on the relation matrix constructed in Step 3, construct the corresponding adaptive mask matrix. Use the adaptive mask matrix to apply adaptive mask relation weighting to the output features of the first n-1 student modules of the student network, obtaining the first n-1 adaptive mask relation weighted features. Apply adaptive mask weighting to the output features of the nth student module of the student network separately to obtain the adaptive mask weighted features, as detailed below:
[0027] First, the relation matrix is processed using the softmax function to obtain its feature map scores. These scores are then sorted from largest to smallest, and the top k1 scores are selected. Finally, the original, unsorted positions of these k1 values in the feature map are used as the attention regions for the adaptive mask relation, and the remaining positions are assigned a value of 0. The adaptive mask relation matrix is represented by the following expression:
[0028]
[0029] in Representing the relation matrix G N The corresponding adaptive mask matrix, The original positions of the top k1 high-scoring values corresponding to the relation matrix are represented by v and j, which are the horizontal and vertical coordinates of the relation matrix, respectively. The newly proposed adaptive mask matrix is more targeted than the previous random masking operation. It adaptively retains the high-scoring features as the weights, giving the target features a higher proportion and thus better projecting the important features.
[0030] Then, the corresponding relation matrix is masked using an adaptive masking matrix to obtain a weight matrix for adaptively masked relation weighting. Where ⊙ represents the Hadamard product.
[0031] Finally, the weight matrix is used. The feature F extracted from the i-th student module Si Adaptive mask weighting is performed to obtain the adaptive mask relationship weighted features.
[0032] Similarly, the output features of the nth student module in the student network are processed by the softmax function to obtain its feature map score. These scores are then sorted from largest to smallest, and the top k2 values with the highest scores are selected. Finally, the original, unsorted positions of these k2 values in the feature map are used as the attention regions for the adaptive mask, and the remaining positions are assigned a value of 0. The adaptive mask matrix is represented by the following expression:
[0033]
[0034] in F represents the output feature F of the nth student module in the student network. Sn The corresponding adaptive mask matrix, Indicates the output feature F Sn The original positions of the top k2 values with the highest scores, v and j are the output features F. Sn The horizontal and vertical coordinates; the output features F of the nth student module of the student network are obtained using an adaptive mask matrix. Sn Perform masking to obtain adaptive mask features.
[0035] Proceed to step 5.
[0036] Step 5: Construct projection layers. Using the teacher network to guide the corresponding projection layers, the projections of the n-1 adaptive mask relationship weighted features obtained from the student network approximate the output features of the corresponding n-1 teacher modules. Calculate the projective loss of the adaptive mask relationship weighted features. Similarly, the projection of the nth adaptive mask weighted feature approximates the output feature of the nth teacher module, and the projective loss of its adaptive mask feature is calculated, as follows:
[0037] First, the projection layer is constructed using convolutional blocks and the ReLU function. Its structure consists of sequentially connected 3×3 convolutional blocks, ReLU function layers, and 3×3 convolutional blocks; then, adaptive masking relationship weighted features are applied. The input is fed into the projection layer, and the feature F extracted in the corresponding teacher module is... Ti Under guidance, students were forced to project images onto the internet in shapes and sizes that approximate F. TiRelational projection features Finally, the output feature F of the corresponding teacher module is calculated. Ti and projection features Adaptive mask relation weighted projection loss L admp1 The formula is expressed as follows:
[0038]
[0039] In the formula, F Ti This represents the feature extracted by the i-th teacher module in the teacher network. F represents the feature extracted from the i-th student module in the student network. Si The projected features are obtained after weighting the adaptive mask relationship, where c represents the number of channels, h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension.
[0040] Similarly, the adaptive mask weighted features The input is fed into the projection layer, and the feature F extracted in the corresponding teacher module is... Tn Under guidance, students were forced to project images onto the internet in shapes and sizes that approximate F. Tn Mask projection features Finally, the feature F extracted from the corresponding teacher module is calculated. Tn and mask projection features Projective loss of adaptive mask features L admp2 The formula is expressed as follows:
[0041]
[0042] The projective loss of the adaptive mask features in the adaptive mask-weighted projective knowledge distillation method is reconstructed as:
[0043] L admp =α1L admp1 +α2L admp2
[0044] In the formula, α1 is the weight hyperparameter of the projection loss that adjusts the adaptive mask relationship weighting, and α2 is the weight hyperparameter of the projection distillation loss that adjusts the adaptive mask features. This loss function is used in modules to correct the deviation between the relationship matrix of the mask student module, the features projected from the output features, and the output features of the corresponding teacher module, so that the teacher network can achieve a better guidance effect, thereby enabling the student network to better mine and make full use of the information learned by itself under the guidance of the teacher network.
[0045] Proceed to step 6.
[0046] Step 6: Calculate the distillation loss using the output features of the nth teacher module in the teacher network and the nth student module in the student network; then calculate the total distillation loss using the traditional distillation loss and the adaptive mask-weighted projection loss, and use this to update the network parameters of the student network, finally obtaining the trained student network, as follows:
[0047] The loss of the most traditional feature-based knowledge distillation method is expressed as:
[0048]
[0049] Among them, F Tn F represents the output feature of the nth module (i.e., the last teacher module) in a network of n teacher modules. Sn This represents the output characteristic of the nth student module, i.e., the last student module, among the n student modules in the student network.
[0050] Therefore, the total loss can be expressed as: L totally =L admp +L classical .
[0051] Proceed to step 7.
[0052] Step 7: Input the test dataset into the trained student network, output the prediction result for each sample in the test set, and test the accuracy of the trained student network.
[0053] Example 1
[0054] Combination Figure 1 The present invention discloses a projection-based knowledge distillation method based on adaptive mask weighting, comprising the following steps:
[0055] Step 1: Randomly collect 60,000 labeled images from the CIFAR-100 dataset. Normalize these 60,000 images to a pixel size of 32×32. Randomly divide the uniformly sized images into training and testing datasets at a ratio of 5:1. Perform data augmentation on the training dataset to form a teacher-student network training dataset. Use the teacher-student network training dataset to pre-train the teacher network to obtain the teacher network. The data augmentation operations include image scaling and random flipping. The image scaling ratio is 10% of the original image, scaling inward and outward. The random flipping angle is between -20° and 20°. The number of image categories is 100.
[0056] Step 2: Based on the depth of the convolutional layers and the size of the feature maps, divide the teacher network into 4 teacher modules and the student network into 4 student modules, then proceed to Step 3.
[0057] Step 3: Construct three relation matrices based on the output features of the four student modules of the student network, and proceed to Step 4.
[0058] Step 4: Based on the relation matrix constructed in Step 3, construct the corresponding adaptive mask matrix. Use the adaptive mask matrix to perform adaptive mask relation weighting on the output features of the first three student modules of the student network to obtain the first three adaptive mask relation weighted features. Perform adaptive mask feature weighting on the output features of the fourth student module of the student network separately to obtain the adaptive mask feature features, and then proceed to Step 5.
[0059] Step 5: Construct projection layers and use the teacher network to guide the corresponding projection layers so that the projections of the three adaptive mask relationship weighted features obtained by the student network approximate the output features of the three corresponding teacher modules, and calculate their adaptive mask relationship weighted projection loss; make the projection of the fourth adaptive mask weighted feature approximate the output features of the fourth teacher module, calculate its adaptive mask feature projection loss, and proceed to step 6.
[0060] Step 6: Calculate the distillation loss of the traditional distillation method using the output features of the fourth teacher module of the teacher network and the fourth student module of the student network; then calculate the total distillation loss using the traditional distillation loss and the adaptive mask-weighted projection loss, and update the network parameters of the student network accordingly to obtain the trained student network, and then proceed to step 7.
[0061] Step 7: Input the test dataset into the trained student network, output the prediction result for each sample in the test set, and test the accuracy of the trained student network.
[0062] This invention's method was tested on an Nvidia 2080Ti GPU host using Python and PyTorch to build a network framework. For classification tasks, we calculated the sum of the losses from traditional knowledge distillation and adaptive mask-weighted projective loss. The method uses two hyperparameters α1 and α2 to balance the distillation loss in the equation. Hyperparameters {α1 = 0.000007, α2 = 0.0000003} were set for classification experiments. We trained all models for 240 epochs using the SGD optimizer, with a momentum of 0.9 and weight decay of 0.0001. We initialized the learning rate to 0.025 and decayed it every 30 epochs. Multiple training iterations on the training set yielded a projective knowledge distillation model based on adaptive mask weighting.
[0063] To demonstrate the superior performance of the algorithm of this invention, a popular knowledge distillation algorithm from recent years was selected as a comparison model. The comparative experimental results under the objective conditions of ResNet-32×4 teacher network, ResNet-8×4 student network, the same dataset, and the same equipment are shown in Table 1:
[0064] Table 1. Comparative experimental results under the same objective conditions and datasets.
[0065]
[0066] The experimental results demonstrate the effectiveness of the method of the present invention.
Claims
1. A projection-based knowledge distillation method based on adaptive mask weighting, characterized in that, The steps are as follows: Step 1: Randomly collect K labeled images from the CIFAR-100 dataset, 10000 <K 60000, normalize the above K images to unify the pixel size to 60000. ,in, Image height, The image width is defined. The images with uniform size are randomly divided into training and testing datasets at a ratio of 5:
1. Data augmentation is performed on the training dataset to form a teacher-student network training dataset. The teacher network is pre-trained using the teacher-student network training dataset to obtain a pre-trained teacher network. Proceed to step 2. Step 2: Based on the depth of the convolutional layers and the size of the feature maps, divide the teacher network into... The teacher module and the student network are divided into: For each student module, proceed to step 3; Step 3, based on student network The output features of each student module are used to construct The relational matrices are as follows: First, the first The output characteristics of a student module are defined as follows: , S represents the student network, and H, W, and C represent the height, width, and dimension of the output features, respectively; The first... The output characteristics of each teacher module are defined as follows: , T represents the teacher network; then dilated convolution is used to analyze the features. Sparse sampling is performed to obtain feature maps of uniform size. ,Will and Feature fusion is performed to increase the receptive field, and the fused features are represented as follows: = Finally, utilize the fused features. and Constructing a relation matrix : ; Right now Indicates the first The features obtained by feature fusion of the output features of the first student module and the features of the second student module are combined with the features of the first student module. The relation matrix constructed from the output features of each student module after feature fusion. , h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension; Proceed to step 4; Step 4: Based on the relation matrix constructed in Step 3, construct the corresponding adaptive mask matrix, and use the adaptive mask matrix to analyze the front end of the student network. The output features of each student module are weighted by adaptive masking relationships to obtain the first... A weighted feature of adaptive mask relationship; for the student network The output features of each student module are individually adaptively masked to obtain adaptive mask features, as follows: First, the relation matrix is processed using the softmax function to obtain its feature map scores. These scores are then sorted from largest to smallest, and the highest-scoring features are selected. Each value, finally The values at the original, unsorted positions in the feature map are used as attention regions for the adaptive mask relation, and the remaining positions are assigned a value of 0. The adaptive mask relation matrix is represented by the following expression: ; in Representation of relation matrix The corresponding adaptive mask matrix, The relation matrix represents the top scorers. The original position of each value These are the horizontal and vertical coordinates of the relation matrix, respectively. Then, the corresponding relation matrix is masked using an adaptive masking matrix to obtain a weight matrix for adaptively masked relation weighting. , ,in Represents the Hadamard product; Finally, the weight matrix is used. For the Features extracted from the student module Adaptive mask relationship weighting is performed to obtain adaptive mask relationship weighted features. ; Similarly, the first [part of the student network] The output features of each student module are processed by the softmax function to obtain their feature map scores. These scores are then sorted from highest to lowest, and the top-scoring modules are selected. Each value, finally The values at the original, unsorted positions in the feature map are used as the attention regions of the adaptive mask, and the remaining positions are assigned a value of 0. The adaptive mask matrix is represented by the following expression: ; in The first one represents the student network Output characteristics of each student module The corresponding adaptive mask matrix, Indicate output features The corresponding high scorers The original position of each value These are the output features The horizontal and vertical coordinates; using an adaptive mask matrix to analyze the student network's first... Output characteristics of each student module Perform masking to obtain adaptive mask features. ; Proceed to step 5; Step 5: Construct projection layers and use the teacher network to guide the corresponding projection layers, so that the student network can obtain... The projection approximation of the corresponding adaptive mask relation weighted features The output features of each teacher module are used to calculate its adaptive mask-weighted projection loss; making the first... The projection of the adaptive mask feature approximates the first... Calculate the projection loss of the adaptive mask features of each teacher module based on its output features, and proceed to step 6. Step 6: Utilize the teacher network The output characteristics of the teacher module and the first student network The distillation loss of the distillation method is calculated based on the output features of each student module; then the total distillation loss is calculated using the distillation loss and the adaptive mask weighted projection loss, and the network parameters of the student network are updated accordingly to finally obtain the trained student network, and then proceed to step 7. Step 7: Input the test dataset into the trained student network, output the prediction result for each sample in the test set, and test the accuracy of the trained student network.
2. The projection-based knowledge distillation method based on adaptive mask weighting according to claim 1, characterized in that, , 。 3. The projection-based knowledge distillation method based on adaptive mask weighting according to claim 1, characterized in that, In step 5, a projection layer is constructed, and the teacher network guides the corresponding projection layer, enabling the student network to obtain... The projection approximation of the corresponding adaptive mask relation weighted features The output features of each teacher module are used to calculate its adaptive mask-weighted projection loss; making the first... The projection of the adaptive mask feature approximates the first... The output features of each teacher module are used to calculate its adaptive mask-weighted projection loss, as follows: First, the projection layer is constructed using convolutional blocks and the ReLU function. Its structure is a series of connected Convolutional blocks, ReLU function layers Convolutional blocks; then, adaptive masking relationship weighted features are applied. The features are input to the projection layer and extracted in the corresponding teacher module. Under guidance, students were forced to project shapes and sizes onto the network. Relational projection features Finally, calculate the output features of the corresponding teacher module. and projection features Adaptive mask relationship weighted projection loss The formula is expressed as follows: ; In the formula, The first division of the teacher network Features extracted from the teacher module This represents the first division of the student network. Features extracted from the student module The projected features after weighting by adaptive masking relationships. The number of channels is represented by h, the pixel position in the height dimension is represented by w, and the pixel position in the width dimension is represented by w. Similarly, the adaptive mask weighted features The features are input to the projection layer and extracted in the corresponding teacher module. Under guidance, students were forced to project shapes and sizes onto the network. Mask projection features Finally, the features extracted from the corresponding teacher module are calculated. and mask projection features Projective loss of adaptive mask features The formula is expressed as follows: ; The adaptive masking projection loss of the adaptive masking relation matrix weighted projection knowledge distillation method is reconstructed as: ; In the formula It is a weight hyperparameter that adjusts the projective loss weighted by the adaptive mask relationship. It is the weight hyperparameter of the projection loss that adjusts the adaptive mask features.
4. The projection-based knowledge distillation method based on adaptive mask weighting according to claim 3, characterized in that, , 。 5. The projection-based knowledge distillation method based on adaptive mask weighting according to claim 4, characterized in that, In step 6, the teacher network is used. The output characteristics of the teacher module and the first student network The distillation loss of the distillation method is calculated based on the output features of each student module; then, the total distillation loss is calculated using the distillation loss and the adaptive mask-weighted projection loss, and the network parameters of the student network are updated accordingly, finally obtaining the trained student network, as follows: The loss of the most traditional feature-based knowledge distillation method is expressed as: ; in, This indicates the division of the teacher network. The first teacher module The output characteristics of each module, i.e., the last teacher module. This indicates the division of the student network. The first student module The output characteristics of the last student module; Therefore, the total loss can be expressed as: 。 6. The projection-based knowledge distillation method based on adaptive mask weighting according to claim 1, characterized in that, 。