A target detection knowledge distillation method and device, terminal equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a teacher-student model system in the object detection task and using attention mask to calculate the loss function to optimize the student model parameters, the problem of small model parameters and high accuracy is solved, thus improving the detection performance.

CN115457364BActive Publication Date: 2026-06-16CHANGSHA INTELLIGENT DRIVING INST CORP LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHANGSHA INTELLIGENT DRIVING INST CORP LTD
Filing Date: 2022-08-30
Publication Date: 2026-06-16

Application Information

Patent Timeline

30 Aug 2022

Application

16 Jun 2026

Publication

CN115457364B

IPC: G06V10/25; G06V10/82; G06N3/096; G06N3/0464; G06N3/0495; G06N3/045; G06N3/082

CPC: G06V10/82; G06N3/082; G06V2201/07

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing target detection models struggle to simultaneously maintain a small number of model parameters and high detection accuracy when the computing power of the equipment is insufficient. Knowledge distillation methods are not effective in target detection tasks.

⚗Method used

By constructing a teacher model and a student model, the student model is trained using the supervised information from the teacher model. Spatial attention mask, target channel attention mask, and background channel attention mask are calculated. Target distillation loss and background distillation loss are calculated respectively, and a loss function is constructed to optimize the student model parameters.

🎯Benefits of technology

This improved the distillation effect of the target detection model, enhancing its detection performance and accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115457364B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of target detection, and proposes a target detection knowledge distillation method and device, terminal equipment and a storage medium. The method first inputs a sample image into a teacher model and a student model to obtain feature images output by the two models; then, spatial attention masks, target channel attention masks and background channel attention masks are calculated according to the feature images; next, target distillation loss and background distillation loss are respectively calculated according to data such as the feature images, the spatial attention masks, the target channel attention masks and the background channel attention masks; finally, a loss function is constructed according to the target distillation loss and the background distillation loss, and the parameters of the student model are optimized based on the loss function. The method can refine the granularity of distillation and improve the model distillation effect of target detection to a certain extent.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target detection technology, and in particular to a target detection knowledge distillation method, apparatus, terminal equipment, and storage medium. Background Technology

[0002] In recent years, deep learning-based object detection methods have developed rapidly. However, ensuring both real-time performance and effectiveness of the model in practical applications remains a challenge. For example, when computing power is insufficient, the number of model parameters and computational cost need to be sufficiently small, while detection accuracy must also be guaranteed. However, generally, a larger number of model parameters leads to better accuracy, and vice versa—a contradiction. Knowledge distillation of object detection models is one method to resolve this contradiction, reducing model parameters while maintaining or even improving model accuracy.

[0003] Knowledge distillation is a common method for model compression. It mainly involves constructing a lightweight, smaller model and training it using supervision information from a larger, more powerful model, thereby improving the smaller model's performance and accuracy. Typically, the larger model is called the teacher model, and the smaller model is called the student model.

[0004] Currently, knowledge distillation is mainly used to achieve target classification tasks, but it has not achieved good model distillation results when applied to target detection tasks. Summary of the Invention

[0005] In view of this, embodiments of this application provide a method, apparatus, terminal device, and storage medium for knowledge distillation of target detection, which can improve the model distillation effect of target detection.

[0006] A first aspect of this application provides a target detection knowledge distillation method, comprising:

[0007] Input the sample images with labeled target detection boxes into the teacher model and the student model respectively;

[0008] Obtain the first feature image output by the teacher model and the second feature image output by the student model;

[0009] The first spatial attention mask, the first target channel attention mask, and the first background channel attention mask are calculated based on the first feature image.

[0010] The target distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask.

[0011] The background distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask.

[0012] Construct a loss function based on the target distillation loss and the background distillation loss;

[0013] Based on the loss function, the parameters of the student model are optimized.

[0014] In this embodiment, sample images are first input into the teacher model and student model respectively to obtain the first feature image output by the teacher model and the second feature image output by the student model. Then, spatial attention mask, target channel attention mask, and background channel attention mask are calculated based on the first feature image. Next, target distillation loss and background distillation loss are calculated based on the first feature image, second feature image, spatial attention mask, target channel attention mask, and background channel attention mask. Finally, a loss function is constructed based on the target distillation loss and background distillation loss, and the parameters of the student model are optimized based on this loss function. The above process distinguishes between the target and background when calculating channel attention, calculating channel-level attention adapted to the target and background respectively to obtain the target channel attention mask and background channel attention mask. Then, the target distillation loss and background distillation loss are calculated separately, thereby guiding the distillation of the target and background respectively. This processing can refine the granularity of distillation and improve the model distillation effect of target detection to a certain extent.

[0015] In one implementation of this application, after obtaining the first feature image output by the teacher model and the second feature image output by the student model, the method may further include:

[0016] Based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image, the target Gaussian mask and the background Gaussian mask are calculated by Gaussian encoding.

[0017] The calculation of the target distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask can specifically be as follows:

[0018] The target distillation loss is calculated based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask.

[0019] The background distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask. Specifically, this can be achieved by:

[0020] The background distillation loss is calculated based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask. Further, the step of calculating the target Gaussian mask and the background Gaussian mask using Gaussian encoding based on the labels of each target detection box in the first feature image and / or the labels of each target detection box in the second feature image may include:

[0021] Based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image, obtain the center point coordinates of each target detection box.

[0022] For each target detection box, a Gaussian radius is set according to the height and width of the target detection box, and the Gaussian mask of the target detection box is calculated based on the two-dimensional Gaussian formula and the coordinates of the center point of the target detection box;

[0023] The target Gaussian mask is constructed based on the Gaussian mask of each target detection box;

[0024] The background Gaussian mask is constructed based on the target Gaussian mask.

[0025] Furthermore, after constructing the background Gaussian mask based on the target Gaussian mask, the process may further include:

[0026] A scale mask is constructed based on the height and width of each target detection box and the background Gaussian mask;

[0027] Perform element-wise multiplication on the target Gaussian mask and the scale mask at corresponding positions to obtain the updated target Gaussian mask;

[0028] The background Gaussian mask and the scale mask are multiplied element-wise at corresponding positions to obtain the updated background Gaussian mask.

[0029] Furthermore, calculating the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask may include:

[0030] Calculate the product of the first spatial attention mask and the first target channel attention mask to obtain the target fusion distillation mask;

[0031] The target distillation loss is calculated based on the first feature image, the second feature image, the target Gaussian mask, and the target fusion distillation mask.

[0032] The step of calculating the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask may include:

[0033] Calculate the product of the first spatial attention mask and the first background channel attention mask to obtain the background fusion distillation mask;

[0034] The background distillation loss is calculated based on the first feature image, the second feature image, the background Gaussian mask, and the background fusion distillation mask.

[0035] In one implementation of this application, after obtaining the first feature image output by the teacher model and the second feature image output by the student model, the method may further include:

[0036] The second spatial attention mask, the second target channel attention mask, and the second background channel attention mask are calculated based on the second feature image.

[0037] The model distillation loss is calculated based on the first spatial attention mask, the second spatial attention mask, the first target channel attention mask, the second target channel attention mask, the first background channel attention mask, and the second background channel attention mask.

[0038] The loss function constructed based on the target distillation loss and the background distillation loss can specifically be as follows:

[0039] The loss function is constructed based on the target distillation loss, the background distillation loss, and the model distillation loss.

[0040] Furthermore, the step of calculating the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image may include:

[0041] The first intermediate variable is calculated based on the first feature image and the number of channels in the first feature image;

[0042] The first spatial attention mask is calculated based on the height and width of the first feature image and the first intermediate variable.

[0043] The second intermediate variable is calculated based on the number of target detection boxes contained in the first feature image, the height and width of each target detection box contained in the first feature image, and the first feature image.

[0044] The first target channel attention mask is calculated based on the number of channels in the first feature image and the second intermediate variable.

[0045] A third intermediate variable is calculated based on the height and width of the first feature image, a preset indicator function, and the first feature image; wherein the indicator function is constructed based on the background Gaussian mask;

[0046] The first background channel attention mask is calculated based on the number of channels in the first feature image and the third intermediate variable.

[0047] The step of calculating the second spatial attention mask, the second target channel attention mask, and the second background channel attention mask based on the second feature image may include:

[0048] The fourth intermediate variable is calculated based on the second feature image and the number of channels in the second feature image;

[0049] The second spatial attention mask is calculated based on the height and width of the second feature image and the fourth intermediate variable.

[0050] The fifth intermediate variable is calculated based on the number of target detection boxes contained in the second feature image, the height and width of each target detection box contained in the second feature image, and the second feature image itself.

[0051] The second target channel attention mask is calculated based on the number of channels in the second feature image and the fifth intermediate variable.

[0052] The sixth intermediate variable is calculated based on the height and width of the second feature image, the indicator function, and the second feature image;

[0053] The second background channel attention mask is calculated based on the number of channels in the second feature image and the sixth intermediate variable.

[0054] A second aspect of this application provides a target detection knowledge distillation apparatus, comprising:

[0055] The image input module is used to input sample images with labeled target detection boxes into the teacher model and the student model respectively;

[0056] The feature image acquisition module is used to acquire the first feature image output by the teacher model and the second feature image output by the student model;

[0057] The first attention mask calculation module is used to calculate the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image.

[0058] The target distillation loss calculation module is used to calculate the target distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask.

[0059] The background distillation loss calculation module is used to calculate the background distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask.

[0060] A loss function construction module is used to construct a loss function based on the target distillation loss and the background distillation loss;

[0061] The model parameter optimization module is used to optimize the parameters of the student model based on the loss function.

[0062] A third aspect of this application provides a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the target detection knowledge distillation method as provided in the first aspect of this application.

[0063] A fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the target detection knowledge distillation method as provided in the first aspect of this application.

[0064] The fifth aspect of this application provides a computer program product that, when run on a terminal device, causes the terminal device to execute the target detection knowledge distillation method provided in the first aspect of this application.

[0065] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description

[0066] Figure 1 This is a flowchart of a target detection knowledge distillation method provided in an embodiment of this application;

[0067] Figure 2This is a schematic diagram illustrating the principle of a target detection knowledge distillation method provided in an embodiment of this application;

[0068] Figure 3 This is a schematic diagram of the structure of a target detection knowledge distillation device provided in an embodiment of this application;

[0069] Figure 4 This is a schematic diagram of a terminal device provided in an embodiment of this application. Detailed Implementation

[0070] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application can also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail. Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance.

[0071] This application proposes a knowledge distillation method for object detection, which is suitable for improving the distillation effect of models in object detection tasks and can effectively enhance model detection performance. For more specific technical implementation details of this application's embodiments, please refer to the method embodiments described below.

[0072] It should be understood that the execution subject of the various method embodiments of this application can be various types of terminal devices or servers, such as mobile phones, tablets, wearable devices, in-vehicle devices, augmented reality (AR) / virtual reality (VR) devices, laptops, ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), large-screen TVs, etc. The embodiments of this application do not impose any restrictions on the specific type of terminal device and server.

[0073] Please see Figure 1 This illustration shows a target detection knowledge distillation method provided in an embodiment of this application, comprising:

[0074] 101. Input the sample images with labeled target detection boxes into the teacher model and student model respectively;

[0075] First, the sample images with labeled object detection boxes are input into the teacher model and the student model, respectively. In the field of knowledge distillation, the teacher model is typically a high-performance, complex neural network, while the student model is a lightweight, simple neural network. The knowledge distillation process generally includes: first, pre-training the teacher model to achieve convergence and high detection performance; then, during the training of the student model, using the output of the teacher model as additional supervisory information to train the student model, transferring knowledge from the feature images to improve the training optimization effect of the student model.

[0076] 102. Obtain the first feature image output by the teacher model and the second feature image output by the student model;

[0077] After inputting sample images into the teacher model, the teacher model can output corresponding feature images, represented by the first feature image. Similarly, after inputting sample images into the student model, the student model can also output corresponding feature images, represented by the second feature image. The output feature images here can generally be intermediate output features of the model, and can be single-scale or multiple feature images at different scales. For example, they can be feature images output from one or more layers of the backbone network, or from one or more layers of the neck network. Generally, the teacher model has a larger number of parameters to ensure stronger detection performance. Using the intermediate output features of the teacher model to guide the learning of the student model is feasible. In extreme cases, if the intermediate output features of the student model are consistent with those of the teacher model, then the detection result obtained by sending this feature to the detection head of the target detector will also be consistent.

[0078] In one implementation of this application, after obtaining the first feature image output by the teacher model and the second feature image output by the student model, the method may further include:

[0079] Based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image, the target Gaussian mask and the background Gaussian mask are calculated by Gaussian encoding.

[0080] Since the sample images are already labeled with ground truth labels (i.e., the location and type of the actual object detection boxes), the first feature image and / or the second feature image will both contain corresponding ground truth labels, i.e., the labels of each object detection box, which may include information such as the location and type of each object detection box. These ground truth labels can be input into the pre-programmed Gaussian encoding module to calculate the target Gaussian mask (corresponding to the foreground object in the image) and the background Gaussian mask (corresponding to the background in the image) through Gaussian encoding.

[0081] Furthermore, the step of calculating the target Gaussian mask and the background Gaussian mask using Gaussian encoding based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image may include:

[0082] (1) Obtain the center point coordinates of each target detection box according to the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image;

[0083] (2) For each target detection box, set the Gaussian radius according to the height and width of the target detection box, and calculate the Gaussian mask of the target detection box based on the two-dimensional Gaussian formula and the coordinates of the center point of the target detection box;

[0084] (3) Construct the target Gaussian mask based on the Gaussian mask of each target detection box;

[0085] (4) The background Gaussian mask is constructed based on the target Gaussian mask.

[0086] When calculating the target Gaussian mask and the background Gaussian mask using Gaussian encoding, firstly, the center coordinates of each target detection box can be obtained based on the labels of each target detection box in the first feature image and / or the labels of each target detection box in the second feature image (the labels of each target detection box in the first feature image and the second feature image are usually the same). Then, for each target detection box, a Gaussian radius is set according to its height and width, and the Gaussian mask of the target detection box is calculated based on the two-dimensional Gaussian formula and the center coordinates of the target detection box. Next, the target Gaussian mask of the corresponding feature image can be constructed based on the Gaussian mask of each target detection box. Finally, since the target (foreground) and background are usually mutually exclusive, the corresponding background Gaussian mask can be constructed from the target Gaussian mask.

[0087] For example, the Gaussian coding module first obtains the center point coordinates of each object detection box using the real labels, using (x... i,j y i,j The expression () represents the Gaussian mask of the r-th object detection box. Assuming we need to calculate the Gaussian mask of the r-th object detection box, then based on the height H of that object detection box... r and width W r The corresponding Gaussian radius can be set to (H). r / 2, W r / 2), and then use the two-dimensional Gaussian formula to calculate the Gaussian mask of the target detection box, as follows:

[0088]

[0089] Here, sigma is a parameter that can be adjusted. For each object detection box, the corresponding Gaussian mask can be calculated using the above method. After filling these Gaussian masks according to their corresponding positions, the target Gaussian mask for the entire feature image can be obtained. M is used as the target Gaussian mask. G express.

[0090] The target Gaussian mask M is calculated. G Then, the background Gaussian mask M can be calculated using the following formula. B :

[0091]

[0092] Where I represents the entire feature image region, and i and j represent individual points within the feature image region.

[0093] Furthermore, after constructing the background Gaussian mask based on the target Gaussian mask, the process may further include:

[0094] (1) A scale mask is constructed based on the height and width of each target detection box and the background Gaussian mask;

[0095] (2) Perform element-wise multiplication on the target Gaussian mask and the scale mask to obtain the updated target Gaussian mask;

[0096] (3) Perform element-wise multiplication on the background Gaussian mask and the scale mask to obtain the updated background Gaussian mask.

[0097] Since different target detection boxes may have different sizes, to balance the impact of different-sized target detection boxes, scale masks can be used to process the target Gaussian mask and the background Gaussian mask separately. For example, a scale mask S is constructed as follows:

[0098]

[0099]

[0100] Among them, H r and W r Let H and W represent the height and width of the r-th target detection box at the current feature image scale, respectively. H and W represent the height and width of the current feature image, respectively.

[0101] Next, the target Gaussian mask M is processed using the scale mask S. G and background Gaussian mask M B The updated target Gaussian mask M is obtained by performing the following processing. G And the updated background Gaussian mask MB :

[0102]

[0103]

[0104] Here, * indicates that the elements at the corresponding positions are multiplied.

[0105] In the above process, using Gaussian masks (target Gaussian mask and background Gaussian mask) to separate the target (foreground) and background of the image can guide the model distillation to focus more on the features of the center point region of the target instance, which is beneficial to improving the detection performance of the model. In addition, if the target object is a pedestrian, using Gaussian masks can also achieve better fit.

[0106] 103. Calculate the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image;

[0107] Next, we introduce the attention mechanism of the image. The attention involved in the model distillation process mainly includes spatial dimension attention and channel dimension attention. In the embodiment of this application, when calculating channel dimension attention, the target instance and the background are distinguished, and channel dimension attention adapted to the target instance and the background are calculated respectively to further improve the effect of model distillation.

[0108] Since the first feature image is output by the teacher model and the second feature image is output by the student model, this embodiment uses... The first feature image is represented by the superscript T, which indicates the teacher, and the subscript C, which indicates the channel. The image represents the second feature image, with the superscript S indicating the student and the subscript C indicating the channel.

[0109] The first feature image The first spatial attention mask can be obtained by calculating the input spatial attention module. The notation is as follows: T represents the teacher (calculated from the feature image output by the teacher model), and S represents space. Spatial attention refers to reducing the dimensionality of features along the channel, using only one value to represent each pixel. This attention reflects the importance of a pixel in the feature image; a stronger response indicates a higher probability of the presence of a target. It is mainly used to determine which spatial locations in the feature image should receive more attention.

[0110] The first feature image The input channel attention module performs calculations to obtain the attention mask for the first target channel. The notation is as follows: The superscript T represents the teacher (calculated from the feature image output by the teacher model), the superscript t represents the target, and the subscript C represents the channel. The first feature image... The input channel attention module performs calculations, and the first background channel attention mask can also be obtained. The notation is as follows: T represents the teacher (calculated from the feature image output by the teacher model), b represents the background, and C represents the channel. Channel attention refers to dimensionality reduction of the features along its length and width dimensions, with each value reflecting the response of a channel. This attention reflects the importance of each channel in the feature image. Specifically, target channel attention is used to determine which channels in the feature image are more important for target discrimination, while background channel attention is used to determine which channels in the feature image are more important for background discrimination.

[0111] Similarly, the second spatial attention mask, the second target channel attention mask, and the second background channel attention mask can be calculated based on the second feature image. For example, the second feature image... By inputting the spatial attention module for computation, a second spatial attention mask can be obtained. The notation S represents the student (calculated from the feature image output by the student model), and the subscript S represents the space. The second feature image... The input channel attention module performs calculations to obtain the second target channel attention mask, which is then used... The notation is as follows: The superscript S represents the student (calculated from the feature image output by the student model), the superscript t represents the target, and the subscript C represents the channel. The attention mask for the second background channel can also be calculated. The superscript S represents the student (calculated from the feature image output by the student model), the superscript b represents the background, and the subscript C represents the channel.

[0112] Specifically, calculating the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image may include:

[0113] (1) Calculate the first intermediate variable based on the first feature image and the number of channels of the first feature image;

[0114] (2) Calculate the first spatial attention mask based on the height and width of the first feature image and the first intermediate variable;

[0115] (3) Calculate the second intermediate variable based on the number of target detection boxes contained in the first feature image, the height and width of each target detection box contained in the first feature image, and the first feature image;

[0116] (4) Calculate the first target channel attention mask based on the number of channels in the first feature image and the second intermediate variable;

[0117] (5) Calculate a third intermediate variable based on the height and width of the first feature image, a preset indicator function, and the first feature image; wherein the indicator function is constructed based on the background Gaussian mask;

[0118] (6) Calculate the first background channel attention mask based on the number of channels of the first feature image and the third intermediate variable.

[0119] Calculating the first spatial attention mask At that time, firstly based on the first feature image And from the number of channels C of the first feature image, an intermediate variable G is calculated. T (F), represented by the first intermediate variable, is calculated using the following formula:

[0120]

[0121] Then, based on the first feature image The height H and width W, and the first intermediate variable G T (F) can be calculated using the following formula to obtain the first spatial attention mask.

[0122]

[0123] Calculate the attention mask for the first target channel. At that time, firstly based on the first feature image The number of object detection boxes N, the first feature image The height H of the included target detection box r and width W r and the first feature image Another intermediate variable was calculated. Using a second intermediate variable, the specific calculation formula is as follows:

[0124]

[0125] Then, based on the first feature image The number of channels C and the second intermediate variable The attention mask for the first target channel can be calculated using the following formula.

[0126]

[0127] Calculate the attention mask for the first background channel. At that time, firstly based on the first feature image The height H and width W, a preset indicator function, and the first feature image. Another intermediate variable was calculated. Using a third intermediate variable, the specific calculation formula is as follows:

[0128]

[0129] Where A(i,j) is the indicator function set, specifically defined as follows:

[0130]

[0131] This represents an element in the background Gaussian mask mentioned above, where M is... The sum of quantities.

[0132] Then, based on the first feature image The number of channels C and the third intermediate variable The attention mask for the first background channel can be calculated using the following formula.

[0133]

[0134] Thus far, the first spatial attention mask First target channel attention mask and the first background channel attention mask All calculations have been completed.

[0135] The step of calculating the second spatial attention mask, the second target channel attention mask, and the second background channel attention mask based on the second feature image may include:

[0136] (1) Calculate the fourth intermediate variable based on the second feature image and the number of channels of the second feature image;

[0137] (2) The second spatial attention mask is calculated based on the height and width of the second feature image and the fourth intermediate variable;

[0138] (3) The fifth intermediate variable is calculated based on the number of target detection boxes contained in the second feature image, the height and width of each target detection box contained in the second feature image, and the second feature image.

[0139] (4) Calculate the second target channel attention mask based on the number of channels in the second feature image and the fifth intermediate variable;

[0140] (5) Calculate the sixth intermediate variable based on the height and width of the second feature image, the indicator function, and the second feature image;

[0141] (6) The second background channel attention mask is calculated based on the number of channels of the second feature image and the sixth intermediate variable.

[0142] Calculate the second spatial attention mask Second target channel attention mask and the second background channel attention mask The method is similar. In calculating the second-space attention mask... At that time, firstly based on the second feature image And from the number of channels C of the second feature image, an intermediate variable G is calculated. s (F), represented by the fourth intermediate variable, is calculated using the following formula:

[0143]

[0144] Then, based on the second feature image The height H and width W, and the fourth intermediate variable G s (F) can be calculated using the following formula to obtain the second spatial attention mask.

[0145]

[0146] Calculate the attention mask for the second target channel. At that time, firstly based on the second feature image The number of object detection boxes N, the second feature image The height H of the included target detection box r and width W r and the second feature image Another intermediate variable was calculated. The fifth intermediate variable is used, and the specific calculation formula is as follows:

[0147]

[0148] Then, based on the second feature image The number of channels C and the fifth intermediate variable The attention mask for the second target channel can be calculated using the following formula.

[0149]

[0150] Calculate the attention mask for the second background channel. At that time, firstly based on the second feature image The height H and width W, a preset indicator function, and the second feature image. Another intermediate variable was calculated. Using a sixth intermediate variable, the specific calculation formula is as follows:

[0151]

[0152] Where A(i,j) is the indicator function set, specifically defined as follows:

[0153]

[0154] This represents an element in the background Gaussian mask mentioned above, where M is... The sum of quantities.

[0155] Then, based on the second feature image The number of channels C and the sixth intermediate variable The second background channel attention mask can be calculated using the following formula.

[0156]

[0157] Thus, the second space attention mask Second target channel attention mask and the second background channel attention mask All calculations have been completed.

[0158] Next, we can use the first-space attention mask... Second Space Attention Mask First target channel attention mask Second target channel attention mask First background channel attention mask and the second background channel attention mask The model distillation loss was calculated.

[0159] Specifically, set the model distillation loss L M By using masks that are similar to those of the teacher model to guide the student model, the spatial locations and channel information that the student model focuses on become similar to those of the teacher model. The model distillation loss L can be calculated using the following formula. M :

[0160]

[0161] Where l represents the regression loss function, which can be L1 loss, L2 loss, or Smooth L1 loss, etc.

[0162] 104. Calculate the target distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask;

[0163] Next, based on the first feature image Second feature image First-space attention mask and the first target channel attention mask The target distillation loss L can be calculated. t The difference between the student model's output features and the teacher model's output features is mainly evaluated through mean squared error. Additionally, in calculating the target distillation loss L... t In this case, the target Gaussian mask M mentioned above can also be introduced. G This guides the model distillation to focus more on the center point region features of the target instance, which is beneficial to improving the model's detection performance. Specifically, the target distillation loss can be calculated based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask. In one implementation of this application, calculating the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask may include:

[0164] (1) Calculate the product of the first spatial attention mask and the first target channel attention mask to obtain the target fusion distillation mask;

[0165] (2) The target distillation loss is calculated based on the first feature image, the second feature image, the target Gaussian mask, and the target fusion distillation mask.

[0166] In the specific calculation, the first spatial attention mask can be calculated first. Attention mask for the first target channel The product of these two elements yields the target fusion distillation mask M. target , that is Then, based on the first feature image Second feature image Target Gaussian mask M G And the target fusion distillation mask M target The target distillation loss L can be calculated using the following formula. t :

[0167]

[0168] Where C represents the number of channels in the feature image, and H and W represent the height and width of the feature image, respectively. The target fusion distillation mask M target 'f' represents dimensionality transformation, typically using a 1x1 convolution to transform the first feature image. Second feature image The channel dimensions are equal to those of the target Gaussian mask M. G Fusion distillation mask M with target target This is to control the student model to only learn the feature differences after masking.

[0169] 105. Calculate the background distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask;

[0170] Calculate background distillation loss L b Methods and calculations for target distillation loss L t The method is similar, based on the first feature image Second feature image First-space attention mask and the first background channel attention mask The background distillation loss L can be calculated. b Similarly, in calculating the background distillation loss L... b In this case, the background Gaussian mask M mentioned above can also be introduced. B This guides the model distillation to focus more on the center point region features of the target instance, which is beneficial to improving the model's detection performance. Specifically, the background distillation loss can be calculated based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask. In one implementation of this application, calculating the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask may include:

[0171] (1) Calculate the product of the first spatial attention mask and the first background channel attention mask to obtain the background fusion distillation mask;

[0172] (2) The background distillation loss is calculated based on the first feature image, the second feature image, the background Gaussian mask, and the background fusion distillation mask.

[0173] In the specific calculation, the first spatial attention mask can be calculated first. Attention mask for the first background channel The product of these two elements yields the background fusion distillation mask M. background , that is Then, based on the first feature image Second feature image Background Gaussian mask M B and background fusion distillation mask M background The background distillation loss L can be calculated using the following formula. b :

[0174]

[0175] Where C represents the number of channels in the feature image, and H and W represent the height and width of the feature image, respectively. It is the background fusion distillation mask M background , where f represents dimensional transformation.

[0176] 106. Construct a loss function based on the target distillation loss and the background distillation loss;

[0177] After calculating the target distillation loss L t and background distillation loss L b Next, a corresponding loss function can be constructed based on these two distillation losses, specifically through a weighted summation. Additionally, the model distillation loss L calculated earlier can be added to the loss function. M .

[0178] Furthermore, after obtaining the first feature image output by the teacher model and the second feature image output by the student model, the process may further include:

[0179] The second feature image is input into the target detection head corresponding to the student model, and the target detection loss of the student model is calculated.

[0180] To further enhance the comprehensiveness of the loss function, the object detection loss of the student model can be introduced when constructing the loss function. Specifically, the second feature image can be used... Input the target detection head corresponding to the student model (i.e., the network structure in the target detection network used to detect the target location and category, such as the CenterNet detection head or other types of detection heads), and calculate the target detection loss L of the student model. l L l The loss is due to CenterNet or other types of detection heads. The specific calculation principle can be found in existing technologies and will not be elaborated here.

[0181] Finally, taking into account the target distillation loss L t Background distillation loss L b Model distillation loss L M and target detection loss L l The resulting loss function L is shown below:

[0182] L=αL l +βL t +γL b +λL M

[0183] Among them, α, β, γ, and λ are adjustable hyperparameters, mainly used to balance the different losses of each part.

[0184] 107. Based on the loss function, optimize the parameters of the student model.

[0185] After constructing the loss function, the parameters of the student model can be optimized based on this loss function to achieve the model distillation effect. Specifically, the gradient descent backpropagation algorithm can be used to optimize the parameters of the student model so that the loss function L converges to a minimum, at which point the student model can maximize the knowledge it acquires from the teacher model.

[0186] In this embodiment, sample images are first input into the teacher model and student model respectively to obtain the first feature image output by the teacher model and the second feature image output by the student model. Then, spatial attention mask, target channel attention mask, and background channel attention mask are calculated based on the first feature image. Next, target distillation loss and background distillation loss are calculated based on the first feature image, second feature image, spatial attention mask, target channel attention mask, and background channel attention mask. Finally, a loss function is constructed based on the target distillation loss and background distillation loss, and the parameters of the student model are optimized based on this loss function. The above process distinguishes between the target and background when calculating channel attention, calculating channel-level attention adapted to the target and background respectively to obtain the target channel attention mask and background channel attention mask. Then, the target distillation loss and background distillation loss are calculated separately, thereby guiding the distillation of the target and background respectively. This processing can refine the granularity of distillation and improve the model distillation effect of target detection to a certain extent.

[0187] A schematic diagram of the principle of a target detection knowledge distillation method provided in this application embodiment is shown below. Figure 2As shown, its basic operation method includes: inputting image data into the teacher model and student model respectively; calculating the target Gaussian mask and background Gaussian mask using the Gaussian coding module based on the ground truth labels in the image data; calculating the spatial attention mask of the teacher model and the spatial attention mask of the student model using the spatial attention module; calculating the channel attention mask of the teacher model and the channel attention mask of the student model using the channel attention module, where the channel attention mask includes the target channel attention mask and the background channel attention mask; constructing a mask loss based on the spatial attention mask and channel attention mask of the teacher model, the spatial attention mask and channel attention mask of the student model, the target Gaussian mask and the background Gaussian mask, and calculating the target fusion distillation mask and the background fusion distillation mask respectively; then, the output of the teacher model... The feature image is multiplied by the target fusion distillation mask to obtain the target distillation features of the teacher model. The feature image output by the student model is multiplied by the target fusion distillation mask to obtain the target distillation features of the student model. The target distillation loss is calculated based on the target distillation features of the teacher model and the target distillation features of the student model. The feature image output by the teacher model is multiplied by the background fusion distillation mask to obtain the background distillation features of the teacher model. The feature image output by the student model is multiplied by the background fusion distillation mask to obtain the background distillation features of the student model. The background distillation loss is calculated based on the background distillation features of the teacher model and the background distillation features of the student model. In addition, the detection head loss corresponding to the student model can also be calculated. Finally, a loss function is constructed using the target distillation loss, background distillation loss, and detection head loss to optimize the parameters of the student model and achieve knowledge distillation of the model.

[0188] In summary, this application, based on the detection head of algorithms such as CenterNet, designs a mechanism for distinguishing distillation features using Gaussian masks. This mechanism distills the target and background separately and guides the distillation method to focus more on the features of the target's center point region. Simultaneously, this application employs an instance-level channel attention mechanism, distinguishing the target from the background when calculating channel-dimensional attention. This enhances the accuracy of the model distillation method in a more granular manner, further improving the student model's ability to transfer knowledge from the teacher model.

[0189] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0190] The above mainly describes a knowledge distillation method for target detection. The following will describe a knowledge distillation apparatus for target detection.

[0191] Please see Figure 3One embodiment of a target detection knowledge distillation device in this application includes:

[0192] Image input module 301 is used to input sample images with labeled target detection boxes into the teacher model and student model respectively;

[0193] The feature image acquisition module 302 is used to acquire the first feature image output by the teacher model and the second feature image output by the student model;

[0194] The first attention mask calculation module 303 is used to calculate the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image.

[0195] The target distillation loss calculation module 304 is used to calculate the target distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask.

[0196] Background distillation loss calculation module 305 is used to calculate background distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask.

[0197] Loss function construction module 306 is used to construct a loss function based on the target distillation loss and the background distillation loss;

[0198] The model parameter optimization module 307 is used to optimize the parameters of the student model based on the loss function.

[0199] In one implementation of this application, the target detection knowledge distillation apparatus may further include:

[0200] The Gaussian mask calculation module is used to calculate the target Gaussian mask and the background Gaussian mask by means of Gaussian encoding based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image.

[0201] The target distillation loss calculation module can be specifically used to: calculate the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask;

[0202] The background distillation loss calculation module can be specifically used to: calculate the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask.

[0203] Furthermore, the Gaussian mask calculation module may include:

[0204] The center point coordinate acquisition unit is used to acquire the center point coordinates of each target detection box based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image.

[0205] The Gaussian mask calculation unit is used to set the Gaussian radius for each target detection box according to the height and width of the target detection box, and calculate the Gaussian mask of the target detection box based on the two-dimensional Gaussian formula and the coordinates of the center point of the target detection box.

[0206] A target Gaussian mask construction unit is used to construct the target Gaussian mask based on the Gaussian mask of each target detection box;

[0207] Background Gaussian mask construction unit, used to construct the background Gaussian mask based on the target Gaussian mask.

[0208] Furthermore, the Gaussian mask calculation module may also include:

[0209] A scale mask construction unit is used to construct a scale mask based on the height and width of each target detection box and the background Gaussian mask;

[0210] The target Gaussian mask update unit is used to perform element-wise multiplication of the target Gaussian mask and the scale mask at corresponding positions to obtain the updated target Gaussian mask.

[0211] The background Gaussian mask update unit is used to perform element-wise multiplication of the background Gaussian mask and the scale mask at corresponding positions to obtain the updated background Gaussian mask.

[0212] Furthermore, the target distillation loss calculation module may include:

[0213] The target fusion distillation mask calculation unit is used to calculate the product of the first spatial attention mask and the first target channel attention mask to obtain the target fusion distillation mask;

[0214] The target distillation loss calculation unit is used to calculate the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, and the target fusion distillation mask.

[0215] The background distillation loss calculation module may include:

[0216] The background fusion distillation mask calculation unit is used to calculate the product of the first spatial attention mask and the first background channel attention mask to obtain the background fusion distillation mask;

[0217] The background distillation loss calculation unit is used to calculate the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, and the background fusion distillation mask.

[0218] In one implementation of this application, the target detection knowledge distillation apparatus may further include:

[0219] The second attention mask calculation module is used to calculate the second spatial attention mask, the second target channel attention mask, and the second background channel attention mask based on the second feature image.

[0220] The model distillation loss calculation module is used to calculate the model distillation loss based on the first spatial attention mask, the second spatial attention mask, the first target channel attention mask, the second target channel attention mask, the first background channel attention mask, and the second background channel attention mask.

[0221] Specifically, the loss function construction module can be used to construct the loss function based on the target distillation loss, the background distillation loss, and the model distillation loss.

[0222] Furthermore, the first attention mask calculation module may include:

[0223] The first intermediate variable calculation unit is used to calculate the first intermediate variable based on the first feature image and the number of channels of the first feature image;

[0224] The first spatial attention mask calculation unit is used to calculate the first spatial attention mask based on the height and width of the first feature image and the first intermediate variable.

[0225] The second intermediate variable calculation unit is used to calculate the second intermediate variable based on the number of target detection boxes contained in the first feature image, the height and width of each target detection box contained in the first feature image, and the first feature image.

[0226] The first target channel attention mask calculation unit is used to calculate the first target channel attention mask based on the number of channels of the first feature image and the second intermediate variable.

[0227] The third intermediate variable calculation unit is used to calculate a third intermediate variable based on the height and width of the first feature image, a preset indicator function, and the first feature image; wherein the indicator function is constructed based on the background Gaussian mask;

[0228] The first background channel attention mask calculation unit is used to calculate the first background channel attention mask based on the number of channels of the first feature image and the third intermediate variable.

[0229] The second attention mask calculation module may include:

[0230] The fourth intermediate variable calculation unit is used to calculate the fourth intermediate variable based on the second feature image and the number of channels of the second feature image;

[0231] The second spatial attention mask calculation unit is used to calculate the second spatial attention mask based on the height and width of the second feature image and the fourth intermediate variable.

[0232] The fifth intermediate variable calculation unit is used to calculate the fifth intermediate variable based on the number of target detection boxes contained in the second feature image, the height and width of each target detection box contained in the second feature image, and the second feature image.

[0233] The second target channel attention mask calculation unit is used to calculate the second target channel attention mask based on the number of channels of the second feature image and the fifth intermediate variable.

[0234] The sixth intermediate variable calculation unit is used to calculate the sixth intermediate variable based on the height and width of the second feature image, the indicator function, and the second feature image;

[0235] The second background channel attention mask calculation unit is used to calculate the second background channel attention mask based on the number of channels of the second feature image and the sixth intermediate variable.

[0236] This application embodiment also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements... Figure 1 This represents any object detection knowledge distillation method.

[0237] This application also provides a computer program product that, when run on a terminal device, causes the terminal device to perform actions such as... Figure 1 This represents any object detection knowledge distillation method.

[0238] Figure 4This is a schematic diagram of a terminal device provided in an embodiment of this application. For example... Figure 4 As shown, the terminal device 4 in this embodiment includes: a processor 40, a memory 41, and a computer program 42 stored in the memory 41 and executable on the processor 40. When the processor 40 executes the computer program 42, it implements the steps in the embodiments of the various target detection knowledge distillation methods described above, for example... Figure 1 Steps 101 to 107 are shown. Alternatively, when the processor 40 executes the computer program 42, it implements the functions of each module / unit in the above-described device embodiments, for example... Figure 3 The functions of modules 301 to 307 are shown.

[0239] The computer program 42 can be divided into one or more modules / units, which are stored in the memory 41 and executed by the processor 40 to complete this application. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 42 in the terminal device 4.

[0240] The processor 40 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0241] The memory 41 can be an internal storage unit of the terminal device 4, such as a hard disk or memory of the terminal device 4. The memory 41 can also be an external storage device of the terminal device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the terminal device 4. Furthermore, the memory 41 can include both internal and external storage units of the terminal device 4. The memory 41 is used to store the computer program and other programs and data required by the terminal device. The memory 41 can also be used to temporarily store data that has been output or will be output.

[0242] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0243] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0244] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the system embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between devices or units through some interfaces, and may be electrical, mechanical, or other forms.

[0245] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.

[0246] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A knowledge distillation method for target detection, characterized in that, include: Input the sample images with labeled target detection boxes into the teacher model and the student model respectively; Obtain the first feature image output by the teacher model and the second feature image output by the student model; The first spatial attention mask, the first target channel attention mask, and the first background channel attention mask are calculated based on the first feature image. The target distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask. The background distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask. Construct a loss function based on the target distillation loss and the background distillation loss; Based on the loss function, the parameters of the student model are optimized; The method further includes, after obtaining the first feature image output by the teacher model and the second feature image output by the student model: Based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image, the target Gaussian mask and the background Gaussian mask are calculated by Gaussian encoding. The target distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask, specifically as follows: The target distillation loss is calculated based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask. The background distillation loss is calculated based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask, specifically as follows: The background distillation loss is calculated based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask.

2. The method as described in claim 1, characterized in that, The step of calculating the target Gaussian mask and the background Gaussian mask using Gaussian encoding based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image includes: Based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image, obtain the center point coordinates of each target detection box. For each target detection box, a Gaussian radius is set according to the height and width of the target detection box, and the Gaussian mask of the target detection box is calculated based on the two-dimensional Gaussian formula and the coordinates of the center point of the target detection box; The target Gaussian mask is constructed based on the Gaussian mask of each target detection box; The background Gaussian mask is constructed based on the target Gaussian mask.

3. The method as described in claim 2, characterized in that, After constructing the background Gaussian mask based on the target Gaussian mask, the method further includes: A scale mask is constructed based on the height and width of each target detection box and the background Gaussian mask; Perform element-wise multiplication on the target Gaussian mask and the scale mask at corresponding positions to obtain the updated target Gaussian mask; The background Gaussian mask and the scale mask are multiplied element-wise at corresponding positions to obtain the updated background Gaussian mask.

4. The method as described in claim 1, characterized in that, The step of calculating the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask includes: Calculate the product of the first spatial attention mask and the first target channel attention mask to obtain the target fusion distillation mask; The target distillation loss is calculated based on the first feature image, the second feature image, the target Gaussian mask, and the target fusion distillation mask. The step of calculating the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask includes: Calculate the product of the first spatial attention mask and the first background channel attention mask to obtain the background fusion distillation mask; The background distillation loss is calculated based on the first feature image, the second feature image, the background Gaussian mask, and the background fusion distillation mask.

5. The method according to any one of claims 1 to 4, characterized in that, After obtaining the first feature image output by the teacher model and the second feature image output by the student model, the process further includes: The second spatial attention mask, the second target channel attention mask, and the second background channel attention mask are calculated based on the second feature image. The model distillation loss is calculated based on the first spatial attention mask, the second spatial attention mask, the first target channel attention mask, the second target channel attention mask, the first background channel attention mask, and the second background channel attention mask. The construction of the loss function based on the target distillation loss and the background distillation loss is specifically as follows: The loss function is constructed based on the target distillation loss, the background distillation loss, and the model distillation loss.

6. The method as described in claim 5, characterized in that, The step of calculating the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image includes: The first intermediate variable is calculated based on the first feature image and the number of channels in the first feature image; The first spatial attention mask is calculated based on the height and width of the first feature image and the first intermediate variable. The second intermediate variable is calculated based on the number of target detection boxes contained in the first feature image, the height and width of each target detection box contained in the first feature image, and the first feature image. The first target channel attention mask is calculated based on the number of channels in the first feature image and the second intermediate variable. A third intermediate variable is calculated based on the height and width of the first feature image, a preset indicator function, and the first feature image; wherein the indicator function is constructed based on the background Gaussian mask; The first background channel attention mask is calculated based on the number of channels in the first feature image and the third intermediate variable. The calculation of the second spatial attention mask, the second target channel attention mask, and the second background channel attention mask based on the second feature image includes: The fourth intermediate variable is calculated based on the second feature image and the number of channels in the second feature image; The second spatial attention mask is calculated based on the height and width of the second feature image and the fourth intermediate variable. The fifth intermediate variable is calculated based on the number of target detection boxes contained in the second feature image, the height and width of each target detection box contained in the second feature image, and the second feature image itself. The second target channel attention mask is calculated based on the number of channels in the second feature image and the fifth intermediate variable. The sixth intermediate variable is calculated based on the height and width of the second feature image, the indicator function, and the second feature image; The second background channel attention mask is calculated based on the number of channels in the second feature image and the sixth intermediate variable.

7. A target detection knowledge distillation device, characterized in that, include: The image input module is used to input sample images with labeled target detection boxes into the teacher model and the student model respectively; The feature image acquisition module is used to acquire the first feature image output by the teacher model and the second feature image output by the student model; The first attention mask calculation module is used to calculate the first spatial attention mask, the first target channel attention mask, and the first background channel attention mask based on the first feature image. The target distillation loss calculation module is used to calculate the target distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first target channel attention mask. The background distillation loss calculation module is used to calculate the background distillation loss based on the first feature image, the second feature image, the first spatial attention mask, and the first background channel attention mask. A loss function construction module is used to construct a loss function based on the target distillation loss and the background distillation loss; The model parameter optimization module is used to optimize the parameters of the student model based on the loss function. The target detection knowledge distillation device also includes: The Gaussian mask calculation module is used to calculate the target Gaussian mask and the background Gaussian mask by means of Gaussian encoding based on the labels of each target detection box contained in the first feature image and / or the labels of each target detection box contained in the second feature image. The target distillation loss calculation module is specifically used to: calculate the target distillation loss based on the first feature image, the second feature image, the target Gaussian mask, the first spatial attention mask, and the first target channel attention mask; The background distillation loss calculation module is specifically used to: calculate the background distillation loss based on the first feature image, the second feature image, the background Gaussian mask, the first spatial attention mask, and the first background channel attention mask.

8. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the target detection knowledge distillation method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the target detection knowledge distillation method as described in any one of claims 1 to 6.