Training method for rotating target detection model, rotating target detection method and device

By combining sparsification training and channel pruning with local distillation, the rotating target detection model was optimized, solving the problems of high computational cost and difficulty in maintaining accuracy, and achieving efficient rotating target detection on terminal devices.

CN116206177BActive Publication Date: 2026-06-30BOE TECHNOLOGY GROUP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BOE TECHNOLOGY GROUP CO LTD
Filing Date
2023-02-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing rotating target detection models suffer from high computational cost and difficulty in maintaining accuracy when deployed to terminal devices. In particular, in the case of single-class target detection, the classification Logit distillation method cannot improve accuracy, and existing algorithms are difficult to adapt to the angle prediction of rotating targets.

Method used

The student model is optimized using sparse training and channel pruning methods. Localization distillation is performed by discretizing detection information and combining it with fine-grained feature map distillation to reduce computational cost and improve localization accuracy.

Benefits of technology

It significantly reduces the computational requirements during model inference while maintaining high accuracy, and improves the accuracy of target localization in occluded and blurred scenes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116206177B_ABST
    Figure CN116206177B_ABST
Patent Text Reader

Abstract

A training method, a rotation target detection method, and an apparatus for a rotation target detection model are disclosed, comprising: inputting a first image into the rotation target detection model to obtain detection information of a target object, wherein the detection information includes the localization information of the predicted rotation bounding box of the target object; the rotation target detection model is trained by: acquiring a teacher model and a student model, wherein the output layer of the teacher model and the output layer of the student model both include an angle regression channel; discretizing the detection information output by the teacher model and the detection information output by the student model respectively; performing sparsity training and channel pruning on the discretized student model; and using the discretized teacher model to perform localization distillation on the channel-pruned student model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to, but is not limited to, the field of rotating target detection technology, and particularly to a training method for a rotating target detection model, a rotating target detection method, and an apparatus. Background Technology

[0002] In recent years, the field of artificial intelligence has been developing rapidly, and object detection technology is an important component of this development. The task of object detection is to find the center point of an object of interest in an image, as well as its length and width; that is, to predict a rectangle containing the object. Many objects in real life, such as shelves in a shopping mall, have large aspect ratios and arbitrary angles. For these objects, rotational object detection methods are more suitable for object detection, i.e., predicting a rectangle containing the object with an angle. Summary of the Invention

[0003] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.

[0004] This disclosure provides a method for detecting rotating targets, including:

[0005] The first image is input into the rotating target detection model to obtain the detection information of the target object, which includes the positioning information of the predicted rotation box of the target object;

[0006] The rotating target detection model is trained using the following method: a teacher model and a student model are obtained, both of which include an angle regression channel in their output layers; the detection information output by the teacher model and the student model are discretized; the discretized student model is subjected to sparsification training and channel pruning; and the discretized teacher model is used to perform localization distillation on the channel-pruned student model.

[0007] Optionally, the positioning information includes: the X-axis coordinate of the center point, the Y-axis coordinate of the center point, the length of the short side, the length of the long side, and the rotation angle.

[0008] Optionally, the step of using the discretized teacher model to perform localization distillation on the discretized student model includes:

[0009] Obtain the discretized distribution of each location information output by the teacher model and the student model;

[0010] Soften each discretized distribution individually;

[0011] The sum of the KL divergence losses of each discretized distribution output by the teacher model and the student model is used as the localization distillation loss between the teacher model and the student model.

[0012] Optionally, the rotating target detection model is also trained using the following method:

[0013] Obtain one or more feature maps of the teacher model;

[0014] Determine the distillation region on the feature map;

[0015] For the distillation region, the teacher model is used to distill the student model.

[0016] Optionally, the distillation region is the region where the corresponding predicted rotation box matches the true rotation box.

[0017] Optionally, the detection model includes a detection head, and one or more feature maps of the acquired teacher model are the input feature maps of the detection head.

[0018] Optionally, determining the distillation region on the feature map includes:

[0019] Both the predicted and true value rotated boxes are converted into horizontal boxes.

[0020] The region where the intersection-union ratio of the predicted rotation box and the true rotation box is greater than a preset intersection-union ratio threshold is set as the distillation region.

[0021] This disclosure also provides a rotating target detection apparatus, including a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to execute the steps of the rotating target detection method according to any embodiment of this disclosure based on the instructions stored in the memory.

[0022] This disclosure also provides a non-transient computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the rotating target detection method described in any embodiment of this disclosure.

[0023] This disclosure also provides a method for training a rotating target detection model, including:

[0024] Obtain a teacher model and a student model, wherein the output layer of the teacher model and the output layer of the student model both include an angle regression channel;

[0025] The detection information output by the teacher model and the detection information output by the student model are discretized respectively;

[0026] The discretized student model is subjected to sparsity training and channel pruning.

[0027] The discretized teacher model is used to perform localization distillation on the student model after channel pruning.

[0028] This disclosure also provides a training apparatus for a rotating target detection model, including a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to execute the steps of the training method for the rotating target detection model according to any embodiment of this disclosure based on the instructions stored in the memory.

[0029] This disclosure also provides a non-transient computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the training method for the rotating target detection model described in any embodiment of this disclosure.

[0030] After reading and understanding the accompanying diagrams and detailed descriptions, other aspects can be understood. Attached Figure Description

[0031] The accompanying drawings are provided to further illustrate the technical solutions of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure. The shapes and sizes of the components in the drawings do not reflect actual proportions and are only intended to illustrate the content of this disclosure.

[0032] Figure 1 A flowchart illustrating a training method for a rotating target detection model provided as an exemplary embodiment of this disclosure;

[0033] Figure 2A Anchor boxes output from the output layer of a neural network are shown in some embodiments of this disclosure;

[0034] Figure 2B This illustrates a non-horizontally oriented bounding box according to some embodiments of the present disclosure;

[0035] Figure 2C and Figure 2D Two bounding boxes of a target to be detected in an image are shown according to some embodiments of the present disclosure;

[0036] Figure 3 The SkewIoU approximation process in two-dimensional space based on the Kalman filter is shown.

[0037] Figure 4 A pruning process based on BN weights is provided for an exemplary embodiment of this disclosure;

[0038] Figure 5 A schematic diagram of a knowledge distillation process provided for an exemplary embodiment of this disclosure;

[0039] Figure 6 A schematic diagram illustrating the training process of a rotating target detection model provided for an exemplary embodiment of this disclosure;

[0040] Figure 7 A schematic diagram of the structure of a training device for a rotating target detection model provided as an exemplary embodiment of this disclosure;

[0041] Figure 8 This is a schematic diagram of the structure of a rotating target detection device provided for an exemplary embodiment of the present disclosure. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this disclosure clearer, the embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be arbitrarily combined with each other.

[0043] Unless otherwise defined, the technical or scientific terms used in the embodiments of this disclosure shall have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar terms used in the embodiments of this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" indicate that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, but do not exclude other elements or objects.

[0044] like Figure 1 As shown in the figure, this disclosure provides a training method for a rotating target detection model, including the following steps:

[0045] Step 101: Obtain the teacher model and student model. The output layer of both the teacher model and the student model includes an angle regression channel.

[0046] Step 102: Discretize the detection information output by the teacher model and the detection information output by the student model respectively;

[0047] Step 103: Perform sparsity training and channel pruning on the discretized student model;

[0048] Step 104: Use the discretized teacher model to perform localization distillation on the student model after channel pruning.

[0049] On some terminal development boards, due to limited computing power, there are limitations on the size and computational load of the deployed target detection model. It is necessary to reduce the number of model parameters and computational load while maintaining accuracy, and reduce the computational requirements during model inference. In addition, for the case of single-class target detection, the distillation method of classification Logit (multi-logic) cannot soften the label, making it difficult to improve accuracy.

[0050] This embodiment of the disclosure significantly reduces the computational requirements during model inference through sparse training and channel pruning, and helps maintain the accuracy of the pruned student model, avoiding a situation where the accuracy of the student model drops too much after pruning and fine-tuning is difficult to restore the accuracy. By discretizing the detection information output by the teacher model and the student model, the teacher model and the student model learn the probability distribution of the output localization information. Localization distillation is performed based on the learned probability distribution, which further improves the localization accuracy of the student model and helps to improve the target localization accuracy in scenarios with high uncertainty such as occlusion and blur.

[0051] The rotating target detection model of this disclosure can be set to single-class detection, that is, it only detects targets (which can be goods or any other arbitrary targets) without distinguishing specific target categories. This disclosure decouples target classification from target positioning. As used in this disclosure, the term "decoupling" means that target classification does not exclusively depend on target positioning. In one example, target classification and target positioning are independent of each other. In some exemplary embodiments, target classification can be implemented using template matching; however, this disclosure does not limit this, and target classification is not within the scope of this disclosure.

[0052] In some exemplary embodiments, the methods of this disclosure include using object detection algorithms such as single-stage deep learning-based algorithms (e.g., YOLO, including YOLOv5) to build a rotating object detection model.

[0053] Various suitable algorithms can be used to build detection models. In one example, object detection algorithms such as deep learning-based object detection algorithms can be used to build a rotational object detection model. Examples of object detection algorithms include any suitable two-stage object detection algorithm and any suitable single-stage object detection algorithm. Examples of suitable object detection algorithms include YOLO, SSD, R-CNN, Fast R-CNN, Faster R-CNN, and CenterNet.

[0054] In some exemplary embodiments, the detection model includes a neural network. In some exemplary embodiments, the neural network includes an output layer configured to output anchor boxes that use dimensional clustering to predict the rotation boxes. Figure 2AAnchor boxes output from the output layer of a neural network are shown in some embodiments according to this disclosure. In some embodiments, in the detection model according to this disclosure, each anchor box includes an additional channel representing a regression value of an angle. As used herein, the term "neural network" refers to a network used to solve artificial intelligence (AI) problems. A neural network includes multiple hidden layers. Each hidden layer includes multiple neurons (e.g., nodes). The multiple neurons in each hidden layer are connected to multiple neurons in adjacent hidden layers. The connections between neurons have different weights. Neural networks have a structure that mimics biological neural networks. Neural networks can solve problems in a nondeterministic manner. The parameters of a neural network can be tuned through pre-training, for example, by inputting a large number of problems into the neural network and obtaining results from the neural network. Feedback of these results is fed back into the neural network to allow the neural network to adjust its parameters. Pre-training allows the neural network to have stronger problem-solving capabilities. As used herein, the term "channel" is used to indicate a signal path in a neural network.

[0055] In some exemplary embodiments, the detection model of this disclosure includes a teacher model and a student model. In one example, both the teacher model and the student model are built based on the YOLOv5 model. For example, the teacher model may use rotate-YOLOv5l and the student model may use rotate-YOLOv5s; or, the teacher model may use rotate-YOLOv5m and the student model may use rotate-YOLOv5s. However, this disclosure does not limit this and the choice can be made according to the actual situation.

[0056] The overall structure of YOLOv5 can be divided into three parts: the backbone, the neck, and the head. The backbone is primarily used for feature extraction. Commonly used backbone networks include VGG, ResNet, DenseNet, MobileNet, EfficientNet, CSPDarknet 53, and Swin Transformer (YOLOv5 uses CSPDarknet 53 as its backbone). When applied to different scenarios, the model can be fine-tuned to better suit specific situations.

[0057] The neck design aims to better utilize the features extracted by the backbone, reprocessing and rationally utilizing the feature maps extracted by the backbone at different stages. Commonly used neck structures include FPN, PANet, NAS-FPN, BiFPN, ASFF, SFAM, etc. (Yolov5 uses a PAN structure). A common feature is the repeated use of various upsampling, concatenation, dot summation, and dot product techniques to design aggregation strategies.

[0058] As a classification network, the backbone network cannot complete the localization task. The head uses the feature maps extracted by the backbone network to detect the location and / or category of the target.

[0059] In some exemplary embodiments, the method of this disclosure further includes using a cross-union loss function (e.g., SkewIoU loss function, KFIoU loss function) to regress the bounding box. Compared to prior art object detection models based on CSL rotation angle classification algorithms, this disclosure significantly reduces the modeling parameters required in the object detection process.

[0060] Current open-source models generally use classification methods to obtain the rotation angle. Since the rotation angle ranges from -90° to 90°, using classification methods requires adding an angle classification head with 180 channels, making the model's detection head particularly bulky. The rotation target detection model of this disclosure uses a regression method to obtain the angle, so only one channel needs to be set, and there is no such heavy classifier, which helps to reduce the amount of data in the model.

[0061] In some exemplary embodiments, the anchor frame can be represented by (x_c, y_c, long, short, angle), where x_c and y_c represent the X-axis coordinates and Y-axis coordinates of the center point of the anchor frame, long and short represent the lengths of the long and short sides of the anchor frame, and angle is represented in long side format, for example, angle is the angle between the long side of the image and the horizontal axis.

[0062] In other exemplary implementations, angle represents the angle of orientation of the anchor frame relative to the horizontal axis of the image (e.g., the original image in which the target to be detected is tilted relative to the horizontal axis of the image).

[0063] In some exemplary implementations, the bounding box predicted by the neural network can be derived from (b x ,b y ,b w ,b h ,b theta The bounding box predicted by the neural network in this disclosure can be a horizontally oriented bounding box or a non-horizontally oriented bounding box.

[0064] Figure 2BA non-horizontally oriented boundary box is shown according to some embodiments of this disclosure. Figure 2B In one example shown, the non-horizontally oriented bounding box NHBB can be composed of (b x ,b y ,b w ,b h ,b theta ) indicates that b x and b y b represents the coordinates of the center point of the non-horizontally oriented bounding box NHBB. w and b h b represents the lengths of the long and short sides of the non-horizontally oriented bounding box NHBB. theta The angle OR1 represents the orientation of the non-horizontally oriented bounding box NHBB relative to the horizontal axis of the image (e.g., the original image where the target to be detected is tilted relative to the horizontal axis of the image).

[0065] Figure 2C and Figure 2D Two bounding boxes for a target to be detected in an image are shown according to some embodiments of the present disclosure, wherein, Figure 2C The first bounding box BB1 in the image is a non-horizontally oriented bounding box. Figure 2D The second bounding box BB2 in the reference is a horizontally oriented bounding box. Figure 2C The horizontal axis of the image is denoted as HA, and the target can be a retail product picked up by a customer or any other arbitrary target. The first bounding box BB1 and the target to be detected have substantially the same orientation. As used herein, the term "substantially the same orientation" covers a deviation of up to 0.5 degrees, 1 degree, 2 degrees, or 5 degrees from a reference orientation.

[0066] refer to Figure 2C In some embodiments, the target to be detected can be arbitrarily oriented relative to the horizontal axis HA of the image. As used herein, the orientation OR of the target to be detected refers to an orientation from the typically considered lower side to the typically considered upper side of the target, or from the typically considered upper side to the typically considered lower side. For example, a water bottle typically has a bottom side and a top side, and the orientation of the water bottle is from the bottom side (the bottle is typically located on the bottom surface) to the top side (the cap is located on the top side). In retail product applications, the orientation OR of the target to be detected is typically the same as the orientation of the text on the retail product. For example, Figure 2C and Figure 2D The orientation of the water bottle in the image is OR perpendicular to the orientation of the text "water" on the bottle. The first bounding box BB1 and the target object have substantially the same orientation, e.g., orientation OR. The first bounding box BB1 is not horizontally or vertically oriented relative to the horizontal axis HA of the image. For example, the long or short side of the first bounding box BB1 is not parallel to the horizontal axis HA of the image.

[0067] refer to Figure 2C The orientation OR of the first bounding box BB1 or the target object to be detected is at an angle α1 relative to the horizontal axis HA of the image, where angle α1 is not zero or 90 degrees. For example, α1 is in the range of 5 degrees to 85 degrees, such as 5 degrees to 10 degrees, 10 degrees to 15 degrees, 15 degrees to 20 degrees, 20 degrees to 25 degrees, 25 degrees to 30 degrees, 30 degrees to 35 degrees, 35 degrees to 40 degrees, 40 degrees to 45 degrees, 45 degrees to 50 degrees, 50 degrees to 55 degrees, 55 degrees to 60 degrees, 60 degrees to 65 degrees, 65 degrees to 70 degrees, 70 degrees to 75 degrees, 75 degrees to 80 degrees, or 80 degrees to 85 degrees. Figure 2C In the image, the first bounding box BB1 or the target to be detected is neither horizontally oriented nor vertically oriented relative to the horizontal axis HA of the image.

[0068] refer to Figure 2D Within the portion of the image defined by the second bounding box BB2, the target object is also oriented horizontally or vertically relative to the horizontal axis HA of the image. The orientation OR of the second bounding box or the target object is at an angle α2 relative to the horizontal axis HA of the image, where α2 is in the range of 85 to 90 degrees or 0 to 5 degrees. As used herein, the term "oriented horizontally or vertically relative to the horizontal axis HA of the image" refers to an orientation at an angle in the range of 85 to 90 degrees or 0 to 5 degrees relative to the horizontal axis HA of the image.

[0069] In some embodiments, the parameters of the bounding box can be represented as:

[0070] b x =2σ(t) x -0.5 + Cx;

[0071] b y =2σ(t) y -0.5+Cy;

[0072] b w =p w (2σ(t w )) 2 ;

[0073] b h =p h (2σ(t h )) 2 ;

[0074] b theta =(σ(t) theta )-0.5)*π;

[0075] Among them, t x t y t w th t theta σ represents the X-axis and Y-axis coordinates of the center point of the anchor box output from the neural network, the length of the long side of the anchor box, the length of the short side of the anchor box, and the angle of the anchor box's orientation relative to the horizontal axis of the image; σ represents the sigmoid activation function used to map the network's predictions; t x t y t w t h Between [0,1]; Cx and Cy are the offsets of the cells relative to the top-left corner of the image; p w p h These are the prior box width and height; b x b y b represents the coordinates of the center point of the bounding box; w b h Indicates the lengths of the long and short sides of the bounding box; b theta This indicates the angle at which the bounding box is oriented relative to the horizontal axis of the image.

[0076] In some exemplary implementations, the neural network is trained using the following loss function:

[0077] L total =L obj +L reg =L obj +L c +L kf ;

[0078] Among them, L obj This represents the confidence loss. In one example, the confidence loss L... obj This is the BCE loss. Optionally, the BCE loss is represented as L BCE =-[ylog(σ(x))+(1-y)log(1-σ(x))];

[0079] Among them, L BCEσ represents the BCE loss; σ represents the Sigmoid activation function; x represents the confidence level of the bounding box predicted by the neural network; and y represents the ground truth confidence level. In one example, y has a ground truth confidence level of either 0 or 1. For instance, y = 1 when the Intersection over Union (IoU) ratio is greater than or equal to 0.5, and y = 0 when the IoU ratio is less than 0.5. In object detection, IoU is the ratio of the intersection and union between the predicted bounding box and the ground truth bounding box. Ideally, they would overlap completely, i.e., the ratio is 1, but generally, as long as IoU ≥ 0.5, the result is acceptable, meaning the object detection can be considered correct. In another example, since an object detection algorithm, such as YOLOv5, is used, the ground truth confidence level is the ratio of the IoU between the current bounding box and the ground truth bounding box. In a 2D detection task, the value of FKIoU is between [0, 1 / 3]. Therefore, the ground truth confidence level can be 3*FKIoU.

[0080] In some exemplary embodiments, L reg =L c +L kf ,L kf =1-KFIoU;L c This represents the distance loss at the centroid, and the SmothL1 loss used. KFIoU is an approximation of the Skew IoU (Skip Crossover Union Ratio), which uses a Gaussian distribution to simulate the calculation of Skew IoU without introducing hyperparameters.

[0081] Figure 3 This illustrates the SkewIoU approximation process in two-dimensional space based on a Kalman filter. (Reference) Figure 3 In step (a), the bounding box is converted to a Gaussian distribution. In the method of step (a):

[0082] μ=(x,y) T ;

[0083]

[0084] ∑=RΛR T ;as well as

[0085]

[0086] refer to Figure 3 In step (b), the center distance narrows due to the loss of the center point. During step (b):

[0087]

[0088] refer to Figure 3In step (c), the Gaussian distribution of the overlapping region is obtained through Kalman filtering. During step (c):

[0089] αG kf (μ,∑)=G1(μ1,∑1)G2(μ2,∑2);

[0090] K = ∑1(∑1 + ∑2) -1 ;

[0091] μ = μ1 + K(μ2 - μ1); and

[0092] ∑=∑1-K∑1.

[0093] refer to Figure 3 Step (d) involves inverting the Gaussian distribution into a bounding box to compute the approximate SkewIoU. During step (d):

[0094]

[0095]

[0096] L kf (∑1,∑2)=1-KFIoU.

[0097] The method disclosed herein enables the target detection model to learn a probability distribution by discretizing the coordinates and angle values ​​of the bounding box (i.e., the output anchor box) in order to achieve subsequent localization distillation.

[0098] In some exemplary embodiments, the detection information output by the teacher model and the detection information output by the student model are discretized, including:

[0099] Multiply the number of channels related to the location of all output anchor boxes in the teacher model and student model by N, where N is a natural number greater than 1, to discretize the value of each location information into a distribution represented by N probability values.

[0100] The bounding box's location information includes five values: the center point's X-axis coordinate, the center point's Y-axis coordinate, the shorter side length *w*, the longer side length *h*, and the rotation angle *theta*. Each value is discretized into a distribution represented by N probability values ​​(default N=9) to model the ambiguity of the bounding box's location. Taking the angle as an example, the sharper the probability distribution, the more certain the angle; the flatter the distribution, the higher the uncertainty of the angle (usually caused by occlusion).

[0101] In specific implementation, the number of channels related to the localization of all output anchor boxes in the model is multiplied by N, that is, the predicted value of each anchor box changes from 5+1 to 5N+1 (where 1 represents the confidence prediction value, that is, the probability value of predicting that the content in the rotating box is the target). Then the number of output channels of the three heads of the network changes from na×80×80×(5+1), na×40×40×(5+1), na×20×20×(5+1) to na×80×80×(5N+1), na×40×40×(5N+1), na×20×20×(5N+1), where na is the number of anchor boxes predicted for each point on the output feature map, which is 3 by default.

[0102] The target detection model of this disclosure requires an additional process for calculating the expected value of the output distribution during the bounding box decoding process.

[0103] In some exemplary embodiments, the process of calculating the expected value of the output distribution includes:

[0104] For each probability distribution, perform a softmax operation on the N discrete values, and then multiply the result by a predefined interval value to obtain the expected value, as shown in the formula below.

[0105] P′=Softmax(P);

[0106]

[0107] Where y′ is the expected output value, P represents the network output value, and y i To discretize the interval values, the defined range is divided into N parts.

[0108] In some exemplary implementations, sparsity training and channel pruning are performed on the discretized student model, including:

[0109] Set sparsity coefficient s, perform sparsity training based on the set sparsity coefficient s, and regularize the weights of the BN (Batch Normalization) layer during the training process.

[0110] For the model obtained through sparse training, iterate through each module of the model, obtain the weights of all BN layers, and sort them by absolute value.

[0111] The number of channels to be pruned and the BN weight threshold are calculated based on the pruning ratio P and the number of features in all BN layers of the network (excluding ignored layers);

[0112] Iterate through all Batch Normalization (BN) layers in the model. If the current channel weight in this layer is greater than or equal to the BN weight threshold, retain this channel. If the current channel weight in this layer is less than the BN weight threshold, mark this channel for deletion. Count the number of channels to be pruned in this layer, C. If the entire layer needs pruning, retain the minimum number of channels (e.g., the minimum number of channels can be 8). Check if the number of channels to be pruned in this layer, C, is divisible by the preset speed optimization parameter round_to (set according to hardware conditions; some boards may have a different value for 2). n If each channel is optimized for speed, then round_to = 2 n If the number of channels to be pruned in this layer, C, cannot be divided by the speed optimization parameter round_to, then take the value of round_to that can be divided by C downwards, and set this value as the actual number of channels to be pruned in this layer (it is not recommended to take the value upwards, as it is easy to cause a large loss of accuracy). Then delete the corresponding channels of the channels and related dependent layers, and finally generate the pruned model.

[0113] For rotating target detection models, directly pruning based on the convolution kernel weights (such as L1 and L2 pruning) results in a significant loss of model accuracy. Therefore, sparse training is adopted, and the weights of the BN layer (i.e., the scaling factor γ in BN) are regularized during training, so that the model learns to reduce the weights of redundant channels during training.

[0114] Specifically, a sparsity coefficient 's' is set. For example, 's' can be set to 0.001. A larger 's' indicates a higher sparsity ratio, but the accuracy of the trained model will be lower. Before gradient backpropagation, an L1 regularization term is added to the gradients of all BN layers.

[0115] g′=g+s*Sign(W BN );

[0116] Where g is the original gradient of the BN layer, obtained by calculating the detection model loss, and W BN The absolute value of the weights for the BN layer is calculated using the Sign function.

[0117] This embodiment of the disclosure uses the BN layer weight as a criterion for evaluating the importance of the corresponding channel during pruning. For example... Figure 4 As shown, for the model obtained through sparse training, each module of the model is traversed to obtain the weights of all BN layers and sorted by absolute value; the number of channels to be pruned and the BN weight threshold are calculated based on the pruning ratio P and the number of features in all BN layers of the network (excluding ignored layers); all BN layers of the model are traversed, and if the current channel weight of the current layer is greater than the threshold, the channel is retained; otherwise, the channel is deleted. The number of channels to be pruned in the current layer is counted C. If the entire layer needs pruning, the minimum number of channels is retained (default 8). The round_to parameter is set (the setting depends on the hardware; some boards may have different settings depending on the hardware). n(For speed optimization), if C is not divisible by round_to, then C is rounded down to a divisible value (rounding up is not allowed as it would severely degrade model accuracy), and then the corresponding filter channels of the dependent layers are deleted, finally generating the pruned model. For example, assuming C = 9 and round_to = 8, the actual number of channels pruned in this layer is 8.

[0118] This disclosure performs pruning based on the sparsity training results. The pruning ratio is small for shallow networks and large for deep networks. In particular, the number of prunings in the first few layers of the network may not be enough for the set round_to parameter. However, shallow networks have large feature map sizes, resulting in high computational cost. Without pruning, it would be difficult to reduce the computational cost. Therefore, in some examples, the round_to rounding is not performed on shallow networks, and pruning is performed directly based on the number of channels C to be pruned in this layer.

[0119] In some exemplary implementations, the head output layer is ignored during both the sparse training and pruning processes to ensure that the number of anchor boxes predicted by the network remains unchanged.

[0120] After pruning and compression, the number of parameters and computational cost of the model are reduced, but the accuracy will decrease. This disclosure uses the model distillation method to guide the student model to learn using a pre-trained teacher model, transferring the knowledge of the large model to the small model, thereby improving the accuracy of the small model.

[0121] In some exemplary embodiments, the discretized teacher model is used to perform localization distillation on the discretized student model, including:

[0122] Obtain the discretized distribution of each location information output by the teacher model and the student model;

[0123] Soften each discretized distribution individually;

[0124] The sum of the KL (Kullback-Leibler) divergence losses of each discretized distribution output by the teacher and student models is used as the localization distillation loss between the teacher and student models.

[0125] After discretizing the five regression values ​​of the bounding box, the probability distributions of the five regression values ​​can be softened using a temperature-controlled Softmax function. Then, the KL divergence loss between the teacher model and the student model is calculated, and the LD distillation loss is as follows.

[0126]

[0127] in, and Output a discrete distribution of a single positioning index (for example, a single positioning index can be any one of the aforementioned center point X-axis coordinate, center point Y-axis coordinate, short side length, long side length, and rotation angle) for both the student model and the teacher model. T is the temperature of the softmax function, used to increase the information entropy of the output distribution. T > 1; the larger T is, the smoother the distribution; the smaller T is, the sharper the distribution. For example, T can be 10. KL Calculate the KL divergence, where RB represents the five localization metrics of the bounding box [x_c, y_c, long, short, angle], and the overall L... LD (RB S ,RB T The loss is the sum of the KL divergence losses for each positioning indicator.

[0128] Since negative samples predominate in the anchors predicted by the network, to avoid the feature differences on positive samples being overwhelmed, only positive samples are used for LD distillation loss calculation, RB. S ,RB T This represents positive sample anchors.

[0129] In some exemplary embodiments, the rotating target detection model is also trained using the following method:

[0130] Obtain one or more feature maps of the teacher model;

[0131] Determine the distillation region on the feature map;

[0132] For the distillation region, the teacher model is used to distill the student model.

[0133] For object detection, images typically contain a large number of background areas, and the entire Figure 1 Distillation can lead to a large amount of irrelevant information overwhelming the feature response differences in the target region. Therefore, this disclosure also employs a fine-grained feature map distillation algorithm to focus on local features.

[0134] In some exemplary embodiments, the distillation region is the region where the corresponding predicted rotation box matches the true rotation box.

[0135] This embodiment of the disclosure only performs fine-grained feature map distillation on the anchor regions that match the ground truth (GT). The fine-grained feature map distillation loss is as follows:

[0136]

[0137] Where n is the number of feature maps for distillation, α i For different feature layer weights (in one example, α) i Take 1 for all. and These are student model features and teacher model features, respectively, f adap It consists of a conv-relu module, used to process the feature maps of the student model. Scale matching to teacher model feature map The scale is then used to calculate the response difference between the two models using MSEloss. Mask is either 1 or 0, where Mask is 1 for the distilled region and 0 for the non-distilled region. The distilled region is the area where the teacher model output matches the ground truth GT. The height and width scales are the same.

[0138] In some exemplary embodiments, determining the distillation region on the feature map includes...

[0139] Convert both the predicted and ground truth rotation boxes into horizontal boxes.

[0140] The region where the IoU between the predicted and ground truth rotation boxes is greater than a preset IoU threshold is set as the distillation region.

[0141] When the bounding box of the target is a non-horizontally oriented bounding box, since the calculation of rotation IoU is sensitive to angle, in order to avoid matching failures due to inaccurate angle prediction, which would lead to a small distillation area, both the ground truth rotated box and the predicted rotated box are converted into horizontal boxes. Then, matching is performed based on IoU. For anchor regions that are successfully matched (IoU > threshold 0.5), the corresponding position on the mask is set to 1, otherwise it is set to 0.

[0142] In some exemplary implementations, one or more feature maps of the acquired teacher model are the input feature maps of the head.

[0143] If the input feature map of the head is selected for distillation, i.e., the feature map with a height and width scale of 80*80, 40*40, and 20*20, then no anchor mapping process is required, and the distillation region is directly a point. If feature maps from other layers are selected for distillation, the anchors need to be scaled and mapped. Therefore, it is optional to use the three-layer feature map input of the head for fine-grained feature map distillation.

[0144] In summary, the overall loss of the student model distillation is a weighted sum of the confidence loss, regression loss, and distillation loss, as shown in the following equation:

[0145] L=ω1L obj +ω2L reg +ω3L LD +ω4L FD ;

[0146] Where L represents the total loss; Lobj Indicates confidence loss; L reg L represents regression loss; reg =L c +L kf L c L represents the distance loss at the center of mass. kf = 1 - KFIoU; KFIoU is an approximation of the sloping crossover ratio, ω i The weights representing the i-th type of loss can be set according to actual needs, with each ω value set accordingly. i The value of .

[0147] The rotating target detection model provided in this disclosure is a model adapted to a single-class detection + template matching technology scheme. Therefore, it only has single-class targets and cannot use classification logit for distillation. This disclosure uses the discretization of the bounding box logit value to perform location distillation, and combines it with fine-grained feature map distillation based on bounding box adaptation improvement to simulate the feature extraction capability of large models. Figure 5 As shown, S(x,T) represents the probability distribution of the head output. The LD distillation loss is calculated based on the two probability distributions output by the teacher model and the student model. The input image is input to the teacher model and the student model respectively. The teacher model and the student model extract features from the input image to obtain multiple feature maps. One or more of these feature maps are selected for fine-grained feature map distillation. For example, the input feature map of the head can be selected for fine-grained feature map distillation. Since the number of channels in the student model's feature map is less than the number of channels in the teacher model's feature map (for example, the student model has 256 channels and the teacher model has 512 channels), a scale transformation is performed through a convolutional layer (Conv) to expand the 256 channels to 512 channels. After feature matching, the fine-grained feature map distillation loss is calculated for the set distillation region (i.e., the region corresponding to Mask 1).

[0148] like Figure 6As shown, in practical use, the channel pruning and knowledge distillation processes can be iterated multiple times as needed. If, after the current channel pruning and knowledge distillation, the computational cost of the model does not meet the preset computational cost requirement (meeting the preset computational cost requirement means that the computational cost of the model is less than or equal to the preset computational cost threshold), and the accuracy loss is lower than the preset accuracy loss threshold (for example, the preset accuracy loss threshold can be 2%), then the process can be repeated for sparsity training, channel pruning, and knowledge distillation, or repeated for channel pruning and knowledge distillation, until the computational cost of the model meets the preset computational cost requirement and the accuracy loss is lower than the preset accuracy loss threshold. If, after the current channel pruning and knowledge distillation, the accuracy loss of the model is higher than the preset accuracy loss threshold, the iteration stops and the program exits. If, after the current channel pruning and knowledge distillation, the computational cost of the model meets the preset computational cost requirement and the accuracy loss is lower than the preset accuracy loss threshold, the lightweight model is output.

[0149] For example, a lightweight rotating target detection model is constructed according to the training method of this disclosure, including the following steps:

[0150] S1. Build a rotating target detection model based on YOLOv5 (including teacher model and student model);

[0151] S2. Discretize the four coordinate values ​​and one angle value of the output anchor box of the teacher model and the student model respectively, and expand each value to a distribution represented by N probabilities;

[0152] S3. Using rotate-yolov5l as the teacher model, perform normal training;

[0153] S4. Using rotate-yolov5s as the basic model for pruning, perform sparse training.

[0154] S5. Perform channel pruning on rotate-yolov5s based on the weights of the BN layer. For example, the pruning ratio can be 0.2.

[0155] S6. Distill the pruned and compressed model using the teacher model rotate-yolov5l obtained in S3.

[0156] S7. Prune the trained model from S6 again based on the weights of the BN layer. For example, the pruning ratio can be 0.2.

[0157] S8. The pruning compression model of S7 was distilled again using the teacher model rotate-yolov5l;

[0158] S9. Obtain the final lightweight rotating target detection model.

[0159] In this example, the teacher model uses rotate-yolov5l, and the student model uses rotate-yolov5s, selected based on the results of the distillation experiment. In practice, these models can be changed depending on the data available; for example, yolov5m could be used as the teacher model. This example iterates through two channel pruning and knowledge distillation processes to obtain the final lightweight rotating target detection model. In actual use, the number of iterations can be selected as needed.

[0160] The lightweight rotating target detection model constructed according to the training method provided in this disclosure can maintain model accuracy while reducing the number of model parameters and computational load. This disclosure improves the accuracy of the pruning model by combining localization distillation and feature map distillation, avoiding the problem that single-class detection models cannot perform label softening using classification logit.

[0161] This disclosure also provides a training apparatus for a rotating target detection model, including a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to execute the steps of a training method for a rotating target detection model as described in any embodiment of this disclosure based on the instructions stored in the memory.

[0162] like Figure 7 As shown, in one example, the training device for the rotating target detection model may include: a first processor 710, a first memory 720, and a first bus system 730, wherein the first processor 710 and the first memory 720 are connected through the first bus system 730, the first memory 720 is used to store instructions, and the first processor 710 is used to execute the instructions stored in the first memory 720 to obtain a teacher model and a student model. The output layer of the teacher model and the output layer of the student model both include an angle regression channel; the detection information output by the teacher model and the detection information output by the student model are discretized respectively; the discretized student model is subjected to sparse training and channel pruning; and the discretized teacher model is used to perform localization distillation on the channel-pruned student model.

[0163] It should be understood that the first processor 710 can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0164] The first memory 720 may include read-only memory and random access memory, and provides instructions and data to the first processor 710. A portion of the first memory 720 may also include non-volatile random access memory. For example, the first memory 720 may also store device type information.

[0165] In addition to a data bus, the first bus system 730 may also include a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 7 The general designated all buses as the first bus system 730.

[0166] In implementation, the processing performed by the processing device can be accomplished through integrated logic circuits in the hardware of the first processor 710 or through software instructions. That is, the method steps of this embodiment can be executed by a hardware processor, or by a combination of hardware and software modules within the processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other storage media. This storage medium is located in the first memory 720. The first processor 710 reads information from the first memory 720 and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, further details are omitted here.

[0167] This disclosure also provides a non-transient computer-readable storage medium storing a computer program thereon. When executed by a processor, the program implements the training method for a rotating target detection model as described in any embodiment of this disclosure. The training method for a rotating target detection model driven by executing executable instructions is essentially the same as the training method for a rotating target detection model provided in the above embodiments of this disclosure, and will not be described in detail here.

[0168] In some possible implementations, various aspects of the training method for the rotating target detection model provided in this disclosure can also be implemented as a program product comprising program code that, when run on a computer device, causes the computer device to perform the steps in the training method for the rotating target detection model according to various exemplary embodiments of this disclosure as described above. For example, the computer device can execute the training method for the rotating target detection model described in the embodiments of this disclosure.

[0169] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0170] This disclosure also provides a method for detecting rotating targets, including:

[0171] The first image is input into the rotating target detection model to obtain the detection information of the target object, which includes the positioning information of the predicted rotation box of the target object;

[0172] The rotating target detection model is trained using the following method: a teacher model and a student model are obtained, both of which include an angle regression channel in their output layers; the detection information output by the teacher model and the student model are discretized; the discretized student model is subjected to sparsification training and channel pruning; and the discretized teacher model is used to perform localization distillation on the channel-pruned student model.

[0173] The specific training method for the rotating target detection model can be found in the foregoing description, and will not be repeated here in the embodiments disclosed herein.

[0174] This disclosure also provides a rotating target detection apparatus, including a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to execute the steps of the rotating target detection method as described in any embodiment of this disclosure based on the instructions stored in the memory.

[0175] like Figure 8As shown, in one example, the rotating target detection device may include: a second processor 810, a second memory 820, and a second bus system 830, wherein the second processor 810 and the second memory 820 are connected through the second bus system 830, the second memory 820 is used to store instructions, and the second processor 810 is used to execute the instructions stored in the second memory 820 to input a first image into a rotating target detection model to obtain detection information of the target object, the detection information including the localization information of the predicted rotation box of the target object; the rotating target detection model is trained by the following method: obtaining a teacher model and a student model, the output layer of the teacher model and the output layer of the student model both including an angle regression channel; discretizing the detection information output by the teacher model and the detection information output by the student model respectively; performing sparsity training and channel pruning on the discretized student model; and using the discretized teacher model to perform localization distillation on the channel-pruned student model.

[0176] It should be understood that the second processor 810 can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0177] The second memory 820 may include read-only memory and random access memory, and provides instructions and data to the second processor 810. A portion of the second memory 820 may also include non-volatile random access memory. For example, the second memory 820 may also store device type information.

[0178] In addition to the data bus, the second bus system 830 may also include a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 8 The general designated all buses as the second bus system 830.

[0179] In implementation, the processing performed by the processing device can be accomplished through integrated logic circuits in the hardware of the second processor 810 or through software instructions. That is, the method steps of this embodiment can be executed by a hardware processor, or by a combination of hardware and software modules within the processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other storage media. This storage medium is located in the second memory 820. The second processor 810 reads information from the second memory 820 and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, further details are omitted here.

[0180] This disclosure also provides a non-transient computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the rotating target detection method as described in any embodiment of this disclosure.

[0181] In some possible implementations, various aspects of the rotating target detection method provided in this disclosure can also be implemented as a program product including program code that, when run on a computer device, causes the computer device to perform the steps in the rotating target detection method according to various exemplary embodiments of this disclosure described above. For example, the computer device can execute the rotating target detection method described in the embodiments of this disclosure.

[0182] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0183] It will be understood by those skilled in the art that all or some of the steps, systems, or apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

[0184] While the embodiments disclosed herein are as described above, the content is merely for the purpose of facilitating understanding of this disclosure and is not intended to limit the invention. Any person skilled in the art may make any modifications and changes to the form and details of the implementation without departing from the spirit and scope of this disclosure; however, the patent protection scope of this invention shall still be determined by the scope defined in the appended claims.

Claims

1. A method for detecting rotating targets, characterized in that, include: The first image is input into the rotating target detection model to obtain the detection information of the target object, which includes the positioning information of the predicted rotation box of the target object; The rotating target detection model is trained using the following method: A teacher model and a student model are obtained, both of which have an angle regression channel in their output layers; the detection information output by the teacher model and the student model are discretized; the discretized student model undergoes sparsity training and channel pruning; and the discretized teacher model is used to perform localization distillation on the channel-pruned student model. The rotating target detection model is further trained by the following method: obtaining one or more feature maps of the teacher model; determining distillation regions on the feature maps; and distilling the student model using the teacher model for the distillation regions; wherein, determining the distillation regions on the feature maps includes: converting both the predicted and ground truth rotation boxes into horizontal boxes; and setting the regions where the intersection-union ratio (IU) of the corresponding output predicted and ground truth rotation boxes is greater than a preset IU threshold as the distillation regions.

2. The method according to claim 1, characterized in that, The positioning information includes: the X-axis coordinate of the center point, the Y-axis coordinate of the center point, the length of the short side, the length of the long side, and the rotation angle.

3. The method according to claim 2, characterized in that, The step of using the discretized teacher model to perform localization distillation on the discretized student model includes: Obtain the discretized distribution of each location information output by the teacher model and the student model; Soften each discretized distribution individually; The sum of the KL divergence losses of each discretized distribution output by the teacher model and the student model is used as the localization distillation loss between the teacher model and the student model.

4. The method according to claim 1, characterized in that, The distillation region is the area where the corresponding predicted rotation box matches the true rotation box.

5. The method according to claim 1, characterized in that, The detection model includes a detection head, and one or more feature maps of the acquired teacher model are the input feature maps of the detection head.

6. A rotating target detection device, characterized in that, The method includes a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to perform the steps of the rotating target detection method as described in any one of claims 1 to 5 based on the instructions stored in the memory.

7. A non-transient computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the rotating target detection method as described in any one of claims 1 to 5.

8. A training method for a rotating target detection model, characterized in that, include: Obtain a teacher model and a student model, wherein the output layer of the teacher model and the output layer of the student model both include an angle regression channel; The detection information output by the teacher model and the detection information output by the student model are discretized respectively; The discretized student model is subjected to sparsity training and channel pruning. The discretized teacher model is used to perform localization distillation on the channel-pruned student model; Obtain one or more feature maps of the teacher model; Determine the distillation region on the feature map; wherein, determining the distillation region on the feature map includes: converting both the predicted rotation box and the ground truth rotation box into horizontal boxes; and setting the region where the intersection-union ratio of the corresponding output predicted rotation box and the ground truth rotation box is greater than a preset intersection-union ratio threshold as the distillation region; For the distillation region, the teacher model is used to distill the student model.

9. A training device for a rotating target detection model, characterized in that, It includes a memory; and a processor connected to the memory, the memory being used to store instructions, the processor being configured to perform the steps of the training method for the rotating target detection model as described in claim 8 based on the instructions stored in the memory.

10. A non-transient computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the training method for the rotating target detection model as described in claim 8.