A model distillation method and device based on feature information difference

By obtaining anchor box position information and feature difference from teacher and student models, and using entropy distillation loss to train the student model, the problem of poor distillation effect due to large feature differences between teacher and student models is solved, achieving performance improvement and resource optimization.

CN115457343BActive Publication Date: 2026-06-16BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2022-07-26
Publication Date
2026-06-16

Smart Images

  • Figure CN115457343B_ABST
    Figure CN115457343B_ABST
Patent Text Reader

Abstract

The application provides a model distillation method and device based on feature information difference, the method comprises the following steps: obtaining a student feature map and a teacher feature map of a sample image; the student feature map is obtained by performing feature extraction on the sample image by using a student model, and the teacher feature map is obtained by performing feature extraction on the sample image by using a teacher model; according to anchor box position information of the sample image, a first image feature and a second image feature corresponding to each anchor box are obtained respectively; the first image feature is an image feature of the student feature map, and the second image feature is an image feature of the teacher feature map; and according to a feature difference degree between the first image feature and the second image feature, the student model is distilled and trained by using the teacher model. Compared with the existing distillation method, the model distillation method based on feature information difference has better distillation effect when the feature information difference degree between the teacher model and the student model is large.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine learning technology, and in particular to a model distillation method, apparatus, electronic device, and storage medium based on feature information differences. Background Technology

[0002] In recent years, the performance of object detection networks has been greatly improved due to the development of deep convolutional neural networks. However, deep convolutional neural network models typically contain millions of parameters and billions of floating-point operations, which limits their deployment on embedded platforms. Techniques such as compact network design, network pruning, low-rank decomposition, and quantization have been developed to address these limitations while effectively ensuring that the accuracy of object detection models is not significantly compromised. For example, binary detection networks utilizing quantization accelerate object detection by speeding up the feature extraction of the backbone network, enabling real-time object localization and foreground object classification.

[0003] Knowledge distillation can effectively improve the performance of object detection networks. However, it ignores the inherent information differences between binary detectors and full-precision detectors. Existing knowledge distillation methods are only effective for student models when the feature differences between the teacher model and the corresponding student model at their respective anchor boxes are relatively small. However, when the feature differences between the teacher model and the corresponding student model at their respective anchor boxes are relatively large, existing knowledge distillation methods perform poorly for student models. Summary of the Invention

[0004] This invention provides a model distillation method based on feature information differences to address the shortcomings of existing technologies. When the teacher model and the corresponding student model have significant feature differences on the corresponding anchor frames, it achieves effective distillation of the student model.

[0005] In a first aspect, the present invention provides a model distillation method based on feature information difference, comprising: acquiring student feature maps and teacher feature maps of sample images; wherein the student feature maps are obtained by extracting features from the sample images using a student model, and the teacher feature maps are obtained by extracting features from the sample images using a teacher model; acquiring a first image feature and a second image feature corresponding to each anchor frame according to the anchor frame position information of the sample images; wherein the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map; and training the student model by distillation using the teacher model based on the feature difference degree between the first image feature and the second image feature.

[0006] According to a model distillation method based on feature information difference provided by the present invention, a student model is distilled and trained using the teacher model based on the feature difference degree between the first image feature and the second image feature. The method includes: obtaining the feature difference degree between the first image feature and the second image feature corresponding to each anchor box; determining at least one target anchor box from all anchor boxes based on each feature difference degree; determining the entropy distillation loss between the teacher model and the student model based on the first image feature and the second image feature corresponding to each target anchor box; and performing distillation training on the student model based on the entropy distillation loss.

[0007] According to a model distillation method based on feature information difference provided by the present invention, before obtaining the first image feature and the second image feature corresponding to each anchor frame according to the anchor frame position information of the sample image, the method further includes: performing target detection on the sample image through the student model to obtain the first anchor frame position information of the sample image; performing target detection on the sample image through the teacher model to obtain the second anchor frame position information of the sample image; and using the first anchor frame position information and the second anchor frame position information as the anchor frame position information of the sample image.

[0008] According to a model distillation method based on feature information difference provided by the present invention, the step of obtaining the feature difference degree between the first image feature and the second image feature corresponding to each anchor box includes: calculating the Mahalanobis distance between the first image feature and the second image feature; and using the Mahalanobis distance as the feature difference degree.

[0009] According to the present invention, a model distillation method based on feature information difference is provided, wherein both the student model and the teacher model are object detection models; the student model is distilled and trained based on the entropy distillation loss, including: obtaining the task loss of the student model for object detection; and updating the parameters of the student model according to the task loss and the entropy distillation loss.

[0010] According to the model distillation method based on feature information differences provided by the present invention, the student model is a binarized network; the teacher model is a full-precision network.

[0011] According to the present invention, a model distillation method based on feature information difference is provided, wherein the student model is a binarized Faster-RCNN network, and the feature extraction backbone network of the student model is one of ResNet-18, ResNet-34 and ResNet-50.

[0012] Secondly, the present invention also provides a model distillation apparatus based on feature information differences, comprising:

[0013] The first module is used to obtain student feature maps and teacher feature maps of sample images; the student feature maps are obtained by extracting features from the sample images using a student model, and the teacher feature maps are obtained by extracting features from the sample images using a teacher model.

[0014] The second module is used to obtain a first image feature and a second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map;

[0015] The third module is used to perform distillation training on the student model using the teacher model based on the feature difference between the first image features and the second image features.

[0016] Thirdly, the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the model distillation method based on feature information differences as described above.

[0017] Fourthly, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the model distillation method based on feature information differences as described above.

[0018] The present invention provides a model distillation method, apparatus, electronic device, and storage medium based on feature information difference. According to the anchor frame position information of the sample image, it obtains the first image feature of the student feature map and the second image feature of the teacher feature map corresponding to each anchor frame. Furthermore, based on the feature difference between the first and second image features, it uses the teacher model to perform distillation training on the student model. Compared with existing distillation methods, the present invention achieves better distillation results when the feature information difference between the teacher model and the student model is large. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0020] Figure 1 This is a schematic flowchart of the model distillation method based on feature information differences provided by the present invention;

[0021] Figure 2This is a schematic diagram of the structure of the model distillation apparatus based on feature information differences provided by the present invention;

[0022] Figure 3 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0024] It should be noted that in the description of the embodiments of the present invention, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. The terms "upper," "lower," etc., indicating orientation or positional relationships based on the orientation or positional relationships shown in the accompanying drawings, are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the present invention. Unless otherwise expressly specified and limited, the terms "installed," "connected," and "linked" should be interpreted broadly, for example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two elements. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0025] The terms "first," "second," etc., used in this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, a first object can be one or more. Furthermore, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects have an "or" relationship.

[0026] The following is combined Figures 1-3 This invention describes a model distillation method and apparatus based on feature information differences, as provided in embodiments of the present invention.

[0027] Distillation learning is an important research area in computer vision and machine learning. It involves two networks: a pre-trained, high-performance teacher network (also called the teacher model) with high computational complexity and large storage requirements; and a student network (the student model) to be trained, which often has significantly lower computational complexity and storage requirements than the teacher network. Distillation learning aims to extract useful information and knowledge from the teacher network to guide the training process of the student network. Under the guidance of the teacher network, the student network can achieve better performance than when trained alone. Thus, distillation learning can yield high-performance, low-computational-complexity, and low-storage-consumption student networks. This method is particularly suitable for mobile and embedded devices with limited computing power.

[0028] Figure 1 This is a schematic flowchart of the model distillation method based on feature information differences provided by the present invention, as shown below. Figure 1 As shown, including but not limited to the following steps:

[0029] Step 101: Obtain student feature maps and teacher feature maps from the sample images.

[0030] Optionally, the acquired sample image is input into the student model, and feature extraction is performed on the sample image through the student model to obtain the student feature map of the sample image; the acquired sample image is input into the teacher model, and feature extraction is performed on the sample image through the teacher model to obtain the teacher feature map of the sample image. The above student feature map and teacher feature map can reflect the image features of the sample image after feature extraction.

[0031] The student model is a lighter and simpler model than the teacher model, while the teacher model is a more complex model, such as a combined model. Taking the training scenario of an object detection model as an example, both the student model and the teacher model are object detection models. In one possible implementation, they can be single-stage detectors, such as the Single Shot MultiBox Detector (SSD), or two-stage detectors, such as Convolutional Neural Networks (CNNs) or Faster Region-based Convolutional Neural Networks (Fast-RCNNs).

[0032] Optionally, the student model is a binarized network; the teacher model is a full-precision network.

[0033] Step 102: Based on the anchor frame position information of the sample image, obtain the first image feature and the second image feature corresponding to each anchor frame.

[0034] Optionally, the anchor frame position information in this invention includes first anchor frame position information and second position information.

[0035] When performing target detection on the sample image using the student model, the position information of the object in the sample image can be extracted, and the object can be outlined in the form of an anchor box (such as a rectangle). The position information of the first anchor box is the position information of the anchor box.

[0036] When performing target detection on the sample image using the teacher model, the position information of the object in the sample image can also be extracted, and the object can be outlined in the form of an anchor box (such as a rectangle) to obtain the position information of the second anchor box.

[0037] The first anchor frame position information and the second anchor frame position information are used as the anchor frame position information of the sample image.

[0038] Based on the anchor box location information, the first image feature and the second image feature corresponding to each anchor box can be determined. Each anchor box includes anchor boxes obtained from training the student model and anchor boxes obtained from training the teacher model. The first image feature is the image feature bounded by the anchor box on the student feature map; the second image feature is the image feature bounded by the anchor box on the teacher feature map.

[0039] Step 103: Based on the feature difference between the first image features and the second image features, the student model is trained using the teacher model.

[0040] The most common implementation of model distillation based on feature information differences is in two ways: one is to perform distillation at the output layer, and the other is to perform distillation on the intermediate feature map. This invention can perform distillation through either the intermediate feature map or the output layer.

[0041] Based on the above embodiments, as an optional embodiment, the teacher model is a full-precision teacher network, and the student model is a binarized student network.

[0042] Optionally, the teacher model and the student model each correspond to N. T and N S There are N anchor boxes. Each anchor box in one model will select the corresponding feature at the same location in another model, therefore N T +N SEach anchor box can be considered. Then, the feature difference between the first image feature and the second image feature corresponding to each anchor box is evaluated. This is unlike many existing knowledge distillation methods that only use the anchor box positions calculated by the student model or the actual anchor box positions.

[0043] To eliminate the huge difference in scale between the teacher model and the student model, this invention can perform feature transformation on the intermediate layer features of the model before evaluating the feature difference between the first image features and the second image features.

[0044]

[0045] Where n represents the anchor box number, c indicates the position of the pixel in the c-th feature channel (x,y)∈(W,H) within the transformed feature map, and R represents the transformed image features. H represents the image features before feature transformation, H represents the height of the feature map, and W represents the width of the feature map; (x',y')∈(W,H) represents the pixel position within the feature map before feature transformation, and T represents the adjustment parameter.

[0046] After transformation, the features of the teacher model and the student model are projected into the same feature space and follow a Gaussian distribution. Next, this invention evaluates the information differences between the image features in the corresponding anchor frames of the teacher network and the student network.

[0047] The model distillation method based on feature information difference provided by this invention obtains the first image features of the student feature map and the second image features of the teacher feature map corresponding to each anchor frame according to the anchor frame position information of the sample image; and performs distillation training on the student model using the teacher model based on the feature difference degree between the first and second image features. Compared with existing distillation methods, this invention has better distillation effect when the feature information difference between the teacher model and the student model is large.

[0048] Based on the above embodiments, as an optional embodiment, the model distillation method based on feature information difference provided by the present invention, which uses the teacher model to distill and train the student model according to the feature difference degree between the first image feature and the second image feature, includes: obtaining the feature difference degree between the first image feature and the second image feature corresponding to each anchor box; determining at least one target anchor box from all anchor boxes according to each feature difference degree; determining the entropy distillation loss between the teacher model and the student model according to the first image feature and the second image feature corresponding to each target anchor box; and performing distillation training on the student model based on the entropy distillation loss.

[0049] The feature difference can be represented by the Mahalanobis distance between the first image features and the second image features, as shown in the following formula:

[0050]

[0051] Where n represents the anchor frame number, c takes values ​​from 1 to C, Σ n;c Let ε represent the covariance matrix of the feature in the c-th channel within the n-th anchor frame. n The Mahalanobis distance represents the distance between the second image feature of the teacher model and the first image feature of the student model corresponding to the nth anchor box; s represents the student model and t represents the teacher model. The Mahalanobis distance considers both the pixel-level distance between feature pairs within the anchor box and the statistical feature differences between anchor box pairs.

[0052] Optionally, after calculating the feature difference between the first image feature and the second image feature corresponding to each anchor frame, at least one target anchor frame is determined from all the anchor frames based on each feature difference.

[0053] Optionally, the present invention selects the anchor boxes with the largest feature differences as target anchor boxes based on the magnitude of each feature difference.

[0054] For example, if there are 10 anchor boxes, they can be sorted in descending order of their corresponding feature differences, and then the first 5 anchor boxes can be selected as target anchor boxes.

[0055] Optionally, in order to apply the above research to the actual distillation process, it is necessary to select several representative anchor boxes (i.e. target anchor boxes) with large feature differences, and then use these target anchor boxes to obtain a 1×H×W mask containing only 0 and 1 for distillation training. The specific rule is that if a pixel is within the selected anchor box range, then the value of the corresponding pixel is 1, otherwise it is 0.

[0056] For each pair of image features within an anchor box (i.e., the first and second image features), aligning the features of corresponding parts of the student and teacher models is only meaningful when their distributions are significantly different, requiring a distillation process. Based on the above derivation, knowledge distillation is used to align image features within anchor boxes with large differences, thereby effectively optimizing the student network's performance. This invention uses the maximum conditional probability p(Rs|Rt) to distill the image feature pairs within the selected anchor boxes. In other words, after knowledge distillation or optimization, the feature distributions of corresponding items in the teacher and student models become similar. Therefore, combined with m... n n∈{1,…,N T +N SIn this invention, p(Rs|Rt) is defined as:

[0057]

[0058] in, Represents a Gaussian distribution. This represents the mean of the first image feature corresponding to the nth target anchor box. The standard deviation of the first image feature corresponding to the nth target anchor box; This represents the mean of the second image feature corresponding to the nth anchor box. m represents the standard deviation of the second image feature corresponding to the nth anchor box. n The mask representing the nth target anchor frame.

[0059] This invention also provides a method for detecting distillation, specifically defined as follows:

[0060]

[0061]

[0062] Using this method, the present invention selects the image feature pair (i.e., the first image feature and the second image feature) corresponding to the target anchor box containing the most representative feature information differences for distillation.

[0063] For each training iteration, this invention first completes the selection of representative anchor boxes through exhaustive sorting. For example, for N T +N S For each anchor box, the top γ(N) can be selected from largest to smallest based on the magnitude of feature difference. T +N S There are 1 target anchor boxes, where γ is a parameter ranging from 0 to 1, for example, it can be 0.5.

[0064] Then, the entropy distillation loss proposed in this invention is used to distill the image feature pairs corresponding to the target anchor boxes, thereby optimizing the student network.

[0065] Based on the above embodiments, as an optional embodiment, the model distillation method based on feature information difference provided by the present invention, when both the student model and the teacher model are object detection models, performs distillation training on the student model based on the entropy distillation loss, including: obtaining the task loss of the student model for performing object detection; and updating the parameters of the student model according to the task loss and the entropy distillation loss.

[0066] Optionally, after selecting target anchor boxes with significant feature differences, this invention draws the corresponding image features based on the selected target anchor boxes. Most state-of-the-art detection models are based on feature pyramid networks, which can significantly improve robustness in multi-scale detection.

[0067] Optionally, when using the Faster-RCNN framework for object detection in this invention, the size of the selected target anchor box can be adjusted according to the size of each stage of the intermediate layer features, thereby defining the features.

[0068] Optionally, for SSD frameworks, the present invention generates anchor boxes from the regression layer of the SSD framework and clips features from the feature map with the largest spatial size.

[0069] The entropy distillation process can be expressed by the following formula:

[0070]

[0071] in, Let represent the covariance matrix. Therefore, this invention performs end-to-end training on the binary student model, defining the total loss of the student model extraction as:

[0072] L = L GT (w,α)+L P (w,α;γ)

[0073] Among them, L GT (w,α) represents the loss of the object detection model under the supervision of the real label, that is, the task loss of performing object detection; where w represents the parameters of the student model and α represents the quantization range.

[0074] It should be noted that the student model in this invention is a binary student network. To achieve better distillation results, the student model can be pre-trained using the stochastic gradient descent algorithm before distilling it using the feature information difference-based model distillation method provided in this invention. The pre-training process of the student model is briefly described below.

[0075] For a CNN network, we define For convolution kernel parameters, For feature maps. C out C is the number of channels in the output feature map. in Here, (W, H) represents the number of channels in the input feature map, (W, H) represents the width and height of the feature map, and K represents the size of the convolution kernel. The full-precision convolution operation is defined as... in This represents the convolution operation. The computation of a binary convolutional network is defined as follows: in This represents the feature map after binary processing. This represents the convolution kernel parameters after binary processing, and ⊙ represents the XOR operation. The product represents the channel dimensions, α represents the quantization range as before, and binarization is achieved through the sign() function. The training objective is to make the binary parameters as similar as possible to the full-precision parameters; therefore, a loss function is used. Conduct training.

[0076] To provide a more convincing demonstration of the advantages of the feature information difference-based model distillation method proposed in this invention, extensive experiments were conducted on two mainstream object detection datasets, PASCAL VOC and COCO, to test the feature information difference-based model distillation method (IDa-Det) of this invention.

[0077] This invention employs two mainstream object detectors to test IDa-Det: the two-stage object detector Faster R-CNN and the single-stage object detector SSD. For Faster R-CNN, this invention uses ResNet-18, ResNet-34, and ResNet-50 as backbone networks, respectively. This invention uses VGG-16 as the backbone network of SSD. This invention implements IDa-Det based on the PyTorch deep learning framework.

[0078] This invention was tested on four NVIDIA GTX 2080Ti graphics cards, each with 11GB of video memory and 128GB of RAM. The ImageNet ILSVRC12 dataset was used to pre-train the binary student network backbone.

[0079] This invention uses stochastic gradient descent to train the model, setting the batch size of SSD to 24 and Faster-RCNN to 8. Within the Faster-RCNN framework, this invention maintains full precision for the first, shortcut, and last layers (the 1×1 convolutional layer of the RPN and the fully connected layer of the bounding box module), and sets the CNNs to binary networks. Following the BiDet model, the additional layer is also retained as full-precision parameters of the SSD model. This invention modifies the ResNet-18 / 34 architecture, adding an extra shortcut module and a PReLU layer.

[0080] Meanwhile, this invention made certain modifications to the ResNet-50 and VGG-16 networks in the experiments. The lateral connections of the FPN were replaced with 3×3 binary convolutional layers to improve performance. This adjustment was deployed in all Faster-RCNN model experiments. For Faster-RCNN, the model was trained for 12 epochs at a learning rate of 0.02, which was multiplied by 0.1 in the 9th and 11th epochs. For SSD, the model was trained for 24 epochs at a learning rate of 0.01, which was multiplied by 0.1 in the 16th and 22nd epochs. This invention selects Faster-RCNN with full precision (mAP of 81.9% on the VOC dataset and 39.8% on the COCO dataset) and SSD300 with full precision (mAP of 74.5% on the VOC dataset and 25.0% on the COCO dataset) with VGG16 as the backbone network as the teacher network.

[0081] First, the experimental results based on the PASCAL VOC dataset are analyzed. In the same object detection model, this invention compares the proposed IDa-Det with the state-of-the-art quantization network ReActNet and other knowledge distillation methods (such as FGFI, DeFeat, and LWS-Det). This invention also compares the detection performance of the four-valued quantization network DoReFa-Net. The input resolutions of the two detection models are set to 1000×600 for Faster R-CNN and 300×300 for SSD.

[0082] Table 1 compares several quantization methods and detection models in terms of computational complexity, storage cost, and mAP. According to Table 1, the IDa-Det of this invention significantly accelerates the computation of various detection models and reduces storage requirements. This invention follows XNOR-Net to calculate memory usage, estimated by adding 32-bit multiplied by the number of full-precision kernels and 1-bit multiplied by the number of binary kernels in the network. The calculation method for floating-point operations (FLOPs) is the same as Bi-Real-Net. Current generation CPUs can process bitwise XNOR and bit-counting operations in parallel. Full-precision FLOPs plus 1 / 64 of the number of binary multiplications equals OPs. XNOR operations and bit-counting operations can be processed in parallel by current CPUs. OPs are calculated as: full-precision parameter FLOPs plus 1 / 64 of the number of binary parameter multiplications.

[0083] The experimental results show that the model trained in this invention exhibits quite advanced performance in both Faster-RCNN and SSD object detection frameworks.

[0084] Next, the experimental results based on the COCO dataset are analyzed. Due to its larger size and greater diversity, the COCO dataset is more challenging than the PASCAL VOC dataset in object detection tasks.

[0085] Table 1 Comparison of experimental results based on the PASCAL VOC dataset

[0086]

[0087] On the COCO dataset, this invention compares the proposed IDa-Det with the state-of-the-art binary detection model ReActNet, as well as state-of-the-art object detection distillation techniques FGFI, DeFeat, and LWS-Det. Simultaneously, this invention compares the detection performance of the quaternary quantization models FQN and DoReFa-Net for reference. Next, for Faster-RCNN and SSD, the input image resolutions are 1333×800 and 300×300, respectively.

[0088] Table 2 is a comparison table of experimental results based on the PASCAL VOC dataset.

[0089]

[0090] Table 2 shows the mAP, AP at different IoU thresholds, and AP for objects of different scales. It can be seen that on the COCO dataset, the method provided in this invention outperforms existing state-of-the-art models and other knowledge distillation methods in test metrics such as AP at different IoU thresholds and AP for detecting targets of different sizes on COCO, demonstrating the superiority and versatility of IDa-Det in many application scenarios.

[0091] To illustrate the hardware deployment efficiency of the method provided in this invention, a binary model implemented with IDa-Det is deployed on an ODROID C4, which has a 2.016 GHz 64-bit quad-core ARM Cortex-A55 processor. Evaluation of its actual speed demonstrates that IDa-Det is sufficiently efficient when deployed to real-world mobile devices.

[0092] To ensure compatibility between the BOLT inference framework and the IDa-Det of this invention, this invention utilizes the SIMD instruction SSHL on ARM NEON. This invention compares IDa-Det with the full-precision networks in Table 3. This invention tests the model using the VOC dataset. For Faster-RCNN, the input image resolution is adjusted to 1000×600, and the SSD resolution is 300×300. It is clear that under the efficient BOLT framework, IDa-Det achieves significantly faster inference speeds. For example, the speedup achieved on Faster-RCNN is approximately 4.7× to 5.6×.

[0093] Furthermore, IDa-Det can achieve a 13.91x acceleration via SSD. All deployment results are of significant importance for target detection on real-world hardware devices.

[0094] Table 3 Hardware Deployment Efficiency Analysis Table

[0095]

[0096] In conclusion, by comparing with other state-of-the-art binary CNNs and other knowledge distillation methods for detectors, the superiority of the feature information difference-based model distillation method (IDa-Det) provided by this invention is demonstrated, and IDa-Det also achieves good results in hardware deployment efficiency.

[0097] Figure 2 This is a schematic diagram of the structure of the model distillation apparatus based on feature information differences provided by the present invention, as shown below. Figure 2 As shown, the device includes: a first module 201, a second module 202, and a third module 203.

[0098] The first module 201 is used to acquire student feature maps and teacher feature maps of the sample image; the student feature map is obtained by extracting features from the sample image using a student model, and the teacher feature map is obtained by extracting features from the sample image using a teacher model.

[0099] The second module 202 is used to obtain a first image feature and a second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map;

[0100] The third module 203 is used to perform distillation training on the student model using the teacher model based on the feature difference between the first image features and the second image features.

[0101] It should be noted that the model distillation apparatus based on feature information differences provided in this embodiment of the invention can execute the model distillation method based on feature information differences described in any of the above embodiments during specific operation, and this embodiment will not elaborate on this.

[0102] Figure 3 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 3 As shown, the electronic device may include a processor 310, a communication interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communication interface 320, and the memory 330 communicate with each other via the communication bus 340. The processor 310 can call logical instructions in the memory 330 to execute a model distillation method based on feature information difference. This method includes: acquiring student feature maps and teacher feature maps of sample images; the student feature map is obtained by extracting features from the sample image using a student model, and the teacher feature map is obtained by extracting features from the sample image using a teacher model; acquiring a first image feature and a second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map; and training the student model using the teacher model based on the feature difference between the first image feature and the second image feature.

[0103] Furthermore, the logical instructions in the aforementioned memory 330 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0104] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the model distillation method based on feature information difference provided in the above embodiments, the method comprising: acquiring student feature maps and teacher feature maps of sample images; the student feature maps being obtained by extracting features from the sample images using a student model, and the teacher feature maps being obtained by extracting features from the sample images using a teacher model; acquiring a first image feature and a second image feature corresponding to each anchor frame according to the anchor frame position information of the sample images; the first image feature being the image feature of the student feature map, and the second image feature being the image feature of the teacher feature map; and training the student model by distillation using the teacher model based on the feature difference degree between the first image feature and the second image feature.

[0105] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the model distillation method based on feature information difference provided in the above embodiments. The method includes: acquiring student feature maps and teacher feature maps of a sample image; the student feature map is obtained by feature extraction from the sample image using a student model, and the teacher feature map is obtained by feature extraction from the sample image using a teacher model; acquiring a first image feature and a second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map; and training the student model using the teacher model based on the feature difference degree between the first image feature and the second image feature.

[0106] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0107] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0108] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A model distillation method based on feature information differences, characterized in that, include: The student feature map and teacher feature map of the sample image are obtained; the student feature map is obtained by extracting features from the sample image using a student model, and the teacher feature map is obtained by extracting features from the sample image using a teacher model. Based on the anchor frame position information of the sample image, the first image feature and the second image feature corresponding to each anchor frame are obtained respectively; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map; Obtain the feature difference between the first image feature and the second image feature corresponding to each anchor box; Based on each feature difference degree, select the anchor boxes with the largest feature difference degrees from all anchor boxes as the target anchor boxes; Based on the first image features and the second image features corresponding to each target anchor box, determine the entropy distillation loss between the teacher model and the student model; Based on the entropy distillation loss, the student model is trained using distillation. The step of obtaining the feature difference degree between the first image feature and the second image feature corresponding to each anchor frame includes: Calculate the Mahalanobis distance between the first image feature and the second image feature; The Mahalanobis distance is used as the feature difference degree.

2. The model distillation method based on feature information differences according to claim 1, characterized in that, Before obtaining the first image feature and the second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image, the method further includes: The student model is used to perform target detection on the sample image to obtain the first anchor box position information of the sample image; The teacher model is used to perform target detection on the sample image to obtain the second anchor box position information of the sample image; The first anchor frame position information and the second anchor frame position information are used as the anchor frame position information of the sample image.

3. The model distillation method based on feature information differences according to claim 1, characterized in that, Both the student model and the teacher model are object detection models; Based on the entropy distillation loss, the student model is trained by distillation, including: Obtain the task loss of the student model for object detection; The parameters of the student model are updated based on the task loss and the entropy distillation loss.

4. The model distillation method based on feature information differences according to claim 1, characterized in that, The student model is a binary network; the teacher model is a full-precision network.

5. The model distillation method based on feature information differences according to claim 4, characterized in that, The student model is a binarized Faster-RCNN network, and the feature extraction backbone network of the student model is one of ResNet-18, ResNet-34, and ResNet-50.

6. A model distillation apparatus based on feature information differences, characterized in that, include: The first module is used to obtain student feature maps and teacher feature maps of sample images; the student feature maps are obtained by extracting features from the sample images using a student model, and the teacher feature maps are obtained by extracting features from the sample images using a teacher model. The second module is used to obtain a first image feature and a second image feature corresponding to each anchor frame based on the anchor frame position information of the sample image; the first image feature is the image feature of the student feature map, and the second image feature is the image feature of the teacher feature map; The third module is used to perform distillation training on the student model using the teacher model based on the feature difference between the first image features and the second image features; The step of training the student model by distillation using the teacher model based on the feature difference between the first image features and the second image features includes: Obtain the feature difference between the first image feature and the second image feature corresponding to each anchor box; Based on each feature difference degree, select the anchor boxes with the largest feature difference degrees from all anchor boxes as the target anchor boxes; Based on the first image features and the second image features corresponding to each target anchor box, determine the entropy distillation loss between the teacher model and the student model; Based on the entropy distillation loss, the student model is trained using distillation. The step of obtaining the feature difference degree between the first image feature and the second image feature corresponding to each anchor frame includes: Calculate the Mahalanobis distance between the first image feature and the second image feature; The Mahalanobis distance is used as the feature difference degree.

7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the model distillation method based on feature information differences as described in any one of claims 1 to 5.

8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the model distillation method based on feature information differences as described in any one of claims 1 to 5.