A target detection and tracking method and device based on a pruning model

By pruning the convolutional layers of the neural network model and configuring the pruning rate using information entropy, the problem of insufficient computing resources on edge devices is solved, enabling efficient application of target detection and tracking on edge devices.

CN118261840BActive Publication Date: 2026-06-26SHENZHEN MICROBT ELECTRONICS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN MICROBT ELECTRONICS TECH CO LTD
Filing Date
2022-12-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Target detection and tracking algorithms based on neural network models are difficult to apply on edge devices due to limitations such as computing resources and battery capacity.

Method used

A target detection and tracking method based on a pruning model is adopted. By pruning the convolutional layers in an incompletely trained neural network model, the pruning rate is configured using information entropy, thereby reducing the consumption of computing resources.

Benefits of technology

Without compromising model accuracy, it reduces computational resource consumption, improves model compression efficiency, and facilitates application in edge devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118261840B_ABST
    Figure CN118261840B_ABST
Patent Text Reader

Abstract

The application discloses a target detection tracking method based on a pruned neural network model, comprising: acquiring a continuous image to be detected, acquiring a current image frame, a previous image frame and a heat map of a detected target in the previous image frame from the continuous image to be detected, processing the current image frame, the previous image frame and the heat map by using a trained pruned neural network model, obtaining a target center point position in the current image frame, an intra-frame target frame offset position, a target frame size and an inter-frame target frame displacement prediction from an output of the trained pruned neural network model; determining a target detection tracking result based on detection results in each current image frame; and obtaining the pruned neural network model in the following manner: for at least one convolution layer in an incompletely trained neural network model, performing filter pruning on filters in each convolution layer according to a pruning rate configured by information entropy corresponding to each convolution layer. The application greatly reduces the consumption of computing resources.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection and tracking, and in particular, to a target detection and tracking method based on a pruning model. Background Technology

[0002] With the development of application needs, target detection and tracking technology has been applied to terminal electronic devices, such as home cameras for target detection and tracking of pets.

[0003] Due to limitations in computing resources and battery capacity of edge devices such as terminals, target detection and tracking algorithms based on neural network models are difficult to apply to edge devices. Summary of the Invention

[0004] This invention provides a target detection and tracking method based on a pruning model to reduce the consumption of computing resources.

[0005] The first aspect of this application provides a target detection and tracking method based on a pruned neural network model, the method comprising:

[0006] Acquire the continuous image to be detected.

[0007] From the continuous images to be detected, obtain the current image frame, the previous image frame, and a heatmap of the target being detected in the previous image frame. The heatmap is used to represent the center point location information of the target being detected.

[0008] The trained pruned neural network model is used to process the acquired current image frame, previous image frame, and heatmap.

[0009] The target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction are obtained from the output of the trained pruned neural network model.

[0010] The target detection and tracking results are determined based on the target center point position, intra-frame target bounding box offset position, target bounding box size, and inter-frame target bounding box displacement prediction in each current image frame.

[0011] in,

[0012] The pruned neural network model is obtained in the following manner:

[0013] For at least one convolutional layer in an incompletely trained neural network model, filter pruning is performed on the filters in each convolutional layer according to the pruning rate configured by the information entropy corresponding to each convolutional layer. Here, information entropy is used to characterize the importance of the filters in the convolutional layer.

[0014] Preferably, the pruned neural network model includes:

[0015] The multi-feature attention fusion module is used to extract features from the current image frame, the previous image frame, and the heatmap, and then perform feature fusion and filtering.

[0016] The encoding module is used to extract abstract features from the features output by the multi-feature attention fusion module.

[0017] The decoding module is used to process abstract features to output the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction.

[0018] Preferably, the multi-feature attention fusion module includes:

[0019] The first feature attention fusion submodule extracts features from the current image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the second feature attention fusion submodule for feature fusion. The second submodule then performs convolution processing on the weight ratio of each convolutional layer and outputs the results.

[0020] The second feature attention fusion submodule extracts features from the previous image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the third feature attention fusion submodule for feature fusion. It then performs convolution processing on the weight ratio of each convolutional layer and outputs the results.

[0021] The third feature attention fusion submodule is used to extract heatmap features. It obtains the weight ratio of each convolutional layer through attention processing, performs convolution processing on the weight ratio of each convolutional layer, and outputs the results.

[0022] The splicing submodule is used to splice the output features from the first feature attention fusion submodule, the second feature attention fusion submodule, and the third feature attention fusion submodule, respectively.

[0023] Preferably, the first feature attention fusion submodule includes:

[0024] The first convolutional submodule is used to extract features from the current image frame.

[0025] The first attention submodule is used to perform attention processing on the extracted features of the current image frame to obtain the weights of each convolutional layer.

[0026] The first product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the first attention submodule to obtain the weight proportion of each convolutional layer.

[0027] The fourth convolution submodule performs convolution processing on the weight ratio of each convolutional layer output from the first product submodule, and filters out features based on the weight ratio of each convolutional layer before inputting them into the fifth convolution submodule.

[0028] The second feature attention fusion submodule includes:

[0029] The second convolutional submodule is used to extract features from the previous image frame.

[0030] The second attention submodule is used to perform attention processing on the extracted features from the previous image frame to obtain the weights for each convolutional layer.

[0031] The second product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the second attention submodule to obtain the weight proportion of each convolutional layer.

[0032] The fifth convolutional submodule performs convolution processing on the weight proportions of each convolutional layer output from the second product submodule and the filtered features output from the fourth convolutional submodule. It then filters out features based on the weight proportions of each convolutional layer and inputs them into the sixth convolutional submodule.

[0033] The third feature attention fusion submodule includes:

[0034] The third convolutional submodule is used to extract heatmap features.

[0035] The third attention submodule is used to perform attention processing on the extracted heatmap features to obtain the weights for each convolutional layer.

[0036] The third product submodule is used to multiply the extracted heatmap features by the weight percentage of each convolutional layer output by the third attention submodule to obtain the weight percentage of each convolutional layer.

[0037] The sixth convolutional submodule is used to perform convolution processing on the weight ratio of each convolutional layer output by the third product submodule and the filtering features output by the fifth convolutional submodule.

[0038] Preferably, the pruning rate configured according to the information entropy of each convolutional layer in the incompletely trained neural network model includes:

[0039] The initial neural network model is trained in only one round.

[0040] For any convolutional layer in a neural network model after one round of training:

[0041] The input and output matrices of the convolutional layer are decomposed into a low-rank matrix to obtain a decomposed feature matrix composed of the decomposed feature values.

[0042] The eigenvalues ​​in the decomposed eigenma matrix are normalized so that they are constrained to a range greater than or equal to 0 and less than or equal to 1, resulting in normalized eigenvalues.

[0043] Determine the weights for each normalized feature value; these weights represent the probabilities of all projections of the convolutional layer into the low-rank space.

[0044] By using the weights of each normalized feature value, the information entropy of the convolutional layer can be obtained.

[0045] The pruning rate of the convolutional layer is determined based on the information entropy. The larger the information entropy, the smaller the pruning rate, and vice versa.

[0046] The filter pruning in the convolutional layer includes:

[0047] Based on the neural network model trained in only one round, filter pruning is performed according to the pruning rate of each convolutional layer;

[0048] The trained pruned neural network model is obtained as follows:

[0049] The pruned neural network model is fully trained using the training sample set to obtain the trained pruned neural network model.

[0050] Preferably, the initial neural network model is trained in only one round, including:

[0051] Select the current batch of sample data from the training sample set and input it into the current neural network model to obtain the prediction results for the current batch of sample data.

[0052] Based on the loss function value between the predicted and actual results, adjust the model parameters of the current neural network model.

[0053] Repeat this process until all sample data in the training sample set has been selected to complete one round of training.

[0054] Determining the pruning rate of the convolutional layer based on information entropy includes:

[0055] Calculate the proportion of the information entropy of this convolutional layer to the information entropy of all convolutional layers.

[0056] The difference between the value 1 and this ratio is used to obtain the quota coefficient for configuring the pruning rate of this convolutional layer.

[0057] The pruning rate of convolutional layer l is obtained by multiplying the quota coefficient by the total pruning rate of the network model.

[0058] Preferably, the loss function value is the sum of the target center point position loss function value, the intra-frame target box offset position loss function value, the target box size loss function value, and the inter-frame target box displacement prediction loss function value;

[0059] in,

[0060] The target center point location loss function value is calculated using the focus loss function.

[0061] The intra-frame target box offset position loss function value is calculated using the first mean absolute error loss function.

[0062] The target bounding box size loss function value is calculated using the second mean absolute error loss function.

[0063] The inter-frame target box displacement prediction loss function value is calculated using the third mean absolute error loss function.

[0064] Preferably, the normalization process for the decomposed eigenvalues ​​in the decomposed feature matrix includes:

[0065] For any eigenvalue in the decomposition characteristic matrix:

[0066] Calculate the first difference between the decomposed eigenvalue and the smallest decomposed eigenvalue in the decomposed eigenma matrix.

[0067] Calculate the second difference between the largest and smallest eigenvalues ​​in the decomposed feature matrix.

[0068] Calculate the ratio between the first difference and the second difference to obtain the normalized result of the decomposed eigenvalue.

[0069] Preferably, determining the weights of each normalized eigenvalue includes:

[0070] For any normalized eigenvalue:

[0071] The weight of the exponential function value of the normalized eigenvalue is obtained by calculating the ratio between the exponential function value of the normalized eigenvalue and the sum of the exponential function values ​​of each normalized eigenvalue in the decomposed eigenvalue matrix.

[0072] The step of obtaining the information entropy of the convolutional layer using the weights of each normalized feature value includes:

[0073] The weights of the exponential function value for any normalized eigenvalue of this convolutional layer.

[0074] Calculate the self-information of the weights of the exponential function value of the normalized eigenvalue.

[0075] The average value of the self-information is calculated by taking the exponential function value of all normalized feature values ​​of the convolutional layer as random variables;

[0076] Obtain the information entropy of the convolutional layer.

[0077] A second aspect of this application provides a target detection and tracking device based on a pruned neural network model, the device comprising:

[0078] The image acquisition module is used to acquire continuous images to be detected, and from the continuous images to be detected, to acquire the current image frame, the previous image frame, and a heatmap of the target to be detected in the previous image frame, wherein the heatmap is used to characterize the center point location information of the target to be detected.

[0079] The neural network model module is used to process the acquired current image frame, previous image frame, and heatmap using a trained pruned neural network model. It obtains the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction from the output of the trained pruned neural network model.

[0080] The association module is used to determine the target detection and tracking results based on the target center point position in each current image frame, the target box offset position within the frame, the target box size, and the target box displacement prediction between frames;

[0081] in,

[0082] The pruned neural network model is obtained in the following manner:

[0083] For an incompletely trained neural network model with at least one convolutional layer, filter pruning is performed on the filters in each convolutional layer according to the pruning rate configured by the information entropy corresponding to each convolutional layer. Here, information entropy is used to characterize the importance of the filters in the convolutional layer.

[0084] The target detection and tracking method based on the pruning model provided in this application prunes at least one convolutional layer in an incompletely trained neural network model using a filter pruning rate configured according to the information entropy. This improves the efficiency of model compression without affecting the accuracy of model compression, thereby reducing the computational resource consumption of the target detection and tracking method based on the pruning model and making it easier to apply in edge devices. Attached Figure Description

[0085] Figure 1 This is a flowchart illustrating a target detection and tracking method based on a pruned neural network model, as described in an embodiment of this application.

[0086] Figure 2 This is a schematic diagram of a neural network model used for target detection and tracking in an embodiment of this application.

[0087] Figure 3 This is a schematic diagram of a multi-feature attention fusion module according to an embodiment of this application.

[0088] Figure 4 This is a schematic diagram illustrating the acquisition of a trained pruned neural network model according to an embodiment of this application.

[0089] Figure 5This is a schematic diagram of a process for determining the pruning rate in an embodiment of this application.

[0090] Figure 6 This is a schematic diagram of a target detection and tracking device based on a pruned neural network model according to an embodiment of this application.

[0091] Figure 7 This is another schematic diagram of a target detection and tracking device based on a pruned neural network model, according to an embodiment of this application. Detailed Implementation

[0092] To make the objectives, technical means, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings.

[0093] Most target detection and tracking methods based on neural network models follow the idea of ​​detection-by-detection, that is, first detect the information in the input image, and then realize tracking based on the detected information and a series of post-processing. This makes the tracking results highly dependent on the performance of the detector, requiring a powerful detector and a series of complex post-processing. These have become two major bottlenecks hindering the application of target detection and tracking methods based on neural network models in edge devices.

[0094] The principle of tracking-by-detection is that the detector first finds the bounding boxes (target boxes) of all objects in each frame. Tracking then becomes a problem of bounding box association. A filter composed of one or more convolutional kernels, such as a Kalman filter, is used to predict the target boxes, and the Hungarian matching algorithm is used to associate each object's target box with the target boxes in the previous frame. This method requires a large amount of computation, which can consume significant resources and put a heavy burden on edge devices.

[0095] In neural network technology, model compression is a technique for deploying state-of-the-art network models on low-power and resource-constrained edge devices without significantly impacting model accuracy. Model compression methods include pruning, quantization, and knowledge distillation. Pruning is a commonly used technique for compressing neural networks and can be divided into two types: weight pruning and filter pruning. The former involves removing individual weights or neurons; however, this is difficult to accelerate in hardware. The latter removes entire channels or filters, which can more easily achieve considerable speedups.

[0096] This application optimizes the two bottlenecks of high-performance detectors and complex post-processing by using filter pruning to trim the network structure and integrate post-processing into the network structure to achieve real-time tracking.

[0097] See Figure 1 As shown, Figure 1 This is a flowchart illustrating a target detection and tracking method based on a pruned neural network model, as described in an embodiment of this application. The method includes:

[0098] Step 101: Obtain the continuous image to be detected.

[0099] Step 102: From the continuous images to be detected, obtain the current image frame, the previous image frame, and a heatmap of the target being detected in the previous image frame, wherein the heatmap is used to characterize the center point location information of the target being detected.

[0100] It should be understood that the previous image frame may be adjacent to the current image frame or not. For example, the number of image frames between the current image frame and the previous image frame is less than a set threshold. The heatmap can be obtained by target detection based on the previous image frame. The heatmap may include multiple frames. As an example, each category of detected target has its own heatmap.

[0101] This step involves acquiring preceding and following image frames to predict the target's position in the next frame during target detection, thereby associating the target with other targets, reducing post-tracking processing and minimizing the workload on the device.

[0102] Step 103: Using the trained pruned neural network model, process the acquired current image frame, previous image frame, and heatmap.

[0103] The target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction are obtained from the output of the trained pruned neural network model.

[0104] As an example, pruned neural network models include:

[0105] The multi-feature attention fusion module is used to extract features from the current image frame, the previous image frame, and the heatmap, and then perform fusion and filtering.

[0106] The encoding module is used to extract abstract features from the features fused by the multi-feature attention fusion module.

[0107] The decoding module is used to process abstract features to output the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction.

[0108] The pruned neural network model is obtained as follows:

[0109] For at least one convolutional layer in an incompletely trained neural network model, filter pruning is performed on the filters in the convolutional layer according to the pruning rate configured by the information entropy corresponding to the convolutional layer, where information entropy is used to characterize the importance of the filters in the convolutional layer.

[0110] Incompletely trained neural network models, as opposed to fully trained neural network models, refer to situations where the prediction results of the neural network model do not meet expectations. This includes, but is not limited to, situations where, after a limited number of training iterations, the loss function value between the predicted results and the actual results does not reach the loss threshold under fully trained conditions.

[0111] A fully trained neural network model usually refers to a trained neural network model, which means that the prediction results of the neural network model meet the expectations, including but not limited to: after a certain number of training rounds, the loss function value between the prediction results and the actual results reaches a set loss threshold.

[0112] Step 104: Determine the target detection and tracking result based on the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction in each current image frame.

[0113] As an example, the target detection and tracking results include target location information and its associated target information, such as the target's corresponding number.

[0114] This application embodiment is based on an incompletely trained neural network model. By configuring the pruning rate of the filters in the convolutional layer through information entropy, it can reduce the number of training rounds, which is beneficial to reducing the consumption of computing resources. It can also make pruning unaffected by model training, thereby improving the deployment efficiency of the neural network model. This is conducive to realizing target detection and tracking based on the neural network model, and improves the intelligence of target detection and tracking.

[0115] To facilitate understanding of this application, the following description uses a pet as an example to illustrate the target detection and tracking method. It should be understood that this application is not limited to the detection and tracking of pets, and can be applied to the detection and tracking of any other target, including but not limited to vehicles, mobile robots, animals, moving objects, and any other moving target.

[0116] See Figure 2 As shown, Figure 2 This is a schematic diagram of a neural network model used for target detection and tracking according to an embodiment of this application. The neural network model includes:

[0117] The multi-feature attention fusion module is used to extract features from the current image frame, the previous image frame, and the heatmap, and then perform fusion and filtering. The heatmap includes feature maps corresponding to each category of detected targets in the previous image frame, and each feature map can represent the distribution of the center point position of the detected target of that category.

[0118] The encoding module is used to extract deeper, more abstract features from the fused and filtered features. At this point, the features are downsampled for output. For example, if the input image frame has a width of W and a height of H, after a 32x downsampling, the output is W / 32, H / 32.

[0119] This is a decoding module used to process abstract features to obtain the target center point position, target box offset, target box size, and inter-frame target displacement prediction of the current image frame. The peak point of the heatmap is the target center point position.

[0120] The encoding / decoding module is the core module for feature extraction in the network. As an example, a deep layer aggregation network is used as the encoding module for feature extraction. The deep layer aggregation network consists of multiple stages, each stage consists of multiple blocks, and each block contains multiple layers. The network structure includes multiple different types of blocks, and iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA) are used to better aggregate information from different layers. IDA is used to fuse information from different stages, and HDA is used to fuse information from different blocks, thereby obtaining the output of the encoding module.

[0121] The decoding module uses two deconvolutions to increase the dimensionality of the abstract features output by the encoding module. The increased dimensionality features are then processed by different convolutional sub-modules to obtain different outputs. Specifically, the increased dimensionality features are processed by a convolutional sub-module for obtaining the target center point position, a convolutional sub-module for obtaining the target box offset, a convolutional sub-module for obtaining the target box size, and a convolutional sub-module for predicting the target displacement to obtain the inter-frame target displacement prediction.

[0122] See Figure 3 As shown, Figure 3 This is a schematic diagram of a multi-feature attention fusion module according to an embodiment of this application.

[0123] As an example, a multi-feature attention fusion module includes:

[0124] The first feature attention fusion submodule extracts features from the current image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the second feature attention fusion submodule for feature fusion. The second submodule then performs convolution processing on the weight ratio of each convolutional layer and outputs the results.

[0125] The second feature attention fusion submodule extracts features from the previous image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the third feature attention fusion submodule for feature fusion. It then performs convolution processing on the weight ratio of each convolutional layer and outputs the results.

[0126] The third feature attention fusion submodule is used to extract heatmap features. It obtains the weight ratio of each convolutional layer through attention processing, performs convolution processing on the weight ratio of each convolutional layer, and outputs the results.

[0127] The splicing submodule is used to splice the output features from the first feature attention fusion submodule, the second feature attention fusion submodule, and the third feature attention fusion submodule, respectively.

[0128] As an example,

[0129] The first feature attention fusion submodule includes: a first convolution submodule, used to extract features of the current image frame.

[0130] The second feature attention fusion submodule includes: a second convolution submodule, used to extract features from the previous image frame.

[0131] The third feature attention fusion submodule includes: a third convolution submodule, used to extract heatmap features.

[0132] Since simply adding the features from the first, second, and third convolutional submodules does not effectively integrate different input types and their corresponding weights, the outputs from each submodule are passed in two directions. Specifically, the first output of the first convolutional submodule is processed by the first attention submodule and multiplied by its second output, then input to the fourth convolutional submodule. Similarly, the first output of the second convolutional submodule is processed by the second attention submodule and multiplied by its second output, then input to the fifth convolutional submodule. Finally, the first output of the third convolutional submodule is processed by the third attention submodule and multiplied by its second output, then input to the sixth convolutional submodule.

[0133] The attention submodule is the GCAC module, which consists of a series of convolutions. G represents global average pooling, C represents depthwise separable convolution, and A represents activation. The activation function used is the sigmoid function. The first output is processed by the attention submodule to obtain the weights of each convolutional layer. Each weight is multiplied by the un-attention-processed convolution result (i.e., the second output, i.e., the extracted features) to obtain the proportion of weights of each convolutional layer.

[0134] Furthermore, the features selected by the fourth convolutional module are input into the fifth convolutional module, and the features selected by the fifth convolutional module are input into the sixth convolutional module for fusion.

[0135] The outputs of the fourth, fifth, and sixth convolutional submodules are all input to the splicing submodule to splice the features from the fourth, fifth, and sixth convolutional submodules.

[0136] thus,

[0137] The first feature attention fusion submodule includes:

[0138] The first attention submodule is used to perform attention processing on the extracted features of the current image frame to obtain the weights of each convolutional layer.

[0139] The first product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the first attention submodule to obtain the weight proportion of each convolutional layer.

[0140] The second feature attention fusion submodule includes:

[0141] The fourth convolution submodule performs convolution processing on the weight ratio of each convolutional layer output by the first product submodule, and filters out features based on the weight ratio of each convolutional layer before inputting them into the fifth convolution submodule.

[0142] The second attention submodule is used to perform attention processing on the extracted features from the previous image frame to obtain the weights for each convolutional layer.

[0143] The second product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the second attention submodule to obtain the weight proportion of each convolutional layer.

[0144] The fifth convolutional submodule performs convolution processing on the weight proportions of each convolutional layer output from the second product submodule and the filtered features output from the fourth convolutional submodule. It then filters out features based on the weight proportions of each convolutional layer and inputs them into the sixth convolutional submodule.

[0145] The third feature attention fusion submodule includes:

[0146] The third attention submodule is used to perform attention processing on the extracted heatmap features to obtain the weights for each convolutional layer.

[0147] The third product submodule is used to multiply the extracted heatmap features by the weight percentage of each convolutional layer output by the third attention submodule to obtain the weight percentage of each convolutional layer.

[0148] The sixth convolutional submodule is used to perform convolution processing on the weight ratio of each convolutional layer output by the third product submodule and the filtering features output by the fifth convolutional submodule.

[0149] See Figure 4 As shown, Figure 4 This is a schematic diagram illustrating the acquisition of a trained pruned neural network model according to an embodiment of this application. To obtain the pruned neural network model, this embodiment collects pet images as a training sample set. Samples from the current batch are selected from the training sample set and input into the current neural network model to train it. This process continues until all samples in the training sample set have been selected, thus completing one round of training. The initial batch of samples is input into an untrained initial neural network model. The data in the training sample set is labeled with the target's location information and a unique target ID.

[0150] As an example, m sample images are selected from the training sample set as the current batch samples and input into the current neural network model. The loss function value is determined based on the prediction result and the actual result output by the neural network model. The parameters of the current neural network model are adjusted according to the loss function value. Then, the next m sample images are selected as the current batch samples. This process is repeated until the training sample set is completely selected or the loss function value reaches the set threshold, thus completing one round of training.

[0151] After one round of training, training is paused. Based on the neural network model obtained from one round of training, the neural network model structure is pruned according to the set pruning rate, and unimportant convolutional kernels are removed to obtain the pruned neural network model, which is called the pruned neural network model for ease of description. Then, the pruned neural network model is trained multiple times using the training sample set to obtain the trained pruned neural network model.

[0152] This application embodiment reduces the resource consumption of the neural network model by pruning the neural network model after only one round of training. Compared with existing model pruning methods that require multiple rounds of full training before pruning, this avoids the defect that the pruning results obtained by models with different training levels may have large differences. This application embodiment analyzes the importance of each convolutional layer filter by information entropy, without relying on the filter weight information, thus avoiding the influence of whether the model training has converged.

[0153] The following explains the values ​​of the loss function.

[0154] As an example, the loss function value is the sum of the target center point position loss function value, the target offset position loss function value, the target box size loss function value, and the inter-frame target displacement prediction loss function value.

[0155] The target center point location loss function is calculated using the focal loss function to address the imbalance between positive and negative samples during training and reduce the weight of a large number of simple negative samples. The mathematical expression of the focal loss function is as follows:

[0156]

[0157] Among them, L heat Let be the loss function value for the target center point location, α and β be the modulation coefficients (2 and 4 respectively as an example), and N be the number of pixels i in the heatmap of a certain category of detected targets. xyc Let i be the true category of pixel i. Let ∑ be the probability value of the predicted class of pixel i. The predicted class of pixel i is output by the neural network model. The symbol ∑ indicates that the probability is accumulated from pixel i=1 to i=N.

[0158] For example, if a pixel i is a pet, then the Y-axis of pixel i... xyc =1, if it is not a pet, then the Y-axis of pixel i is 1. xyc =0.

[0159] If the category result includes multiple categories, the loss function values ​​of the target center point positions corresponding to each category are summed.

[0160] The target offset position loss function value is calculated using the first mean absolute error loss function (L1Loss) to compensate for the positional offset of the center point of the neural network output when mapped back to the current image frame of the input. The first L1Loss loss function is expressed mathematically as follows:

[0161]

[0162] Among them, L off The target offset loss function value is given by R, which represents the number of downsampling operations. For example, R = 4. This represents the predicted bias position information of the predicted target box containing pixel i, where N is the number of pixels i in the heatmap of a certain category of detected targets, and P represents the center point position information of the ground truth target box containing pixel i. for integer values, This represents the positional deviation of pixel i due to downsampling and rounding. The symbol ∑ indicates that the accumulation is performed from pixel i=1 to i=N.

[0163] The target bounding box size loss function value mainly relates to the size of the target. It is calculated using the second mean absolute error loss function, which is mathematically expressed as:

[0164]

[0165] Among them, L size The value of the loss function is the bounding box size, N is the number of pixels k in the output feature map, and s k Let k be the size of the actual bounding box containing pixel k. Let k be the size of the predicted bounding box containing pixel k.

[0166] To establish the relationship between two consecutive frames, a two-dimensional predicted displacement information is added to describe the offset of each target's position in the current frame relative to its position in the previous image frame. This offset includes offsets in both the X and Y directions and is calculated using a third mean absolute error loss function, which is expressed mathematically as follows:

[0167]

[0168] Among them, L pre Here, V represents the inter-frame target displacement prediction loss function value, and V is the number of targets in the current frame t. This refers to the center point location information of target i in the previous frame t-1. This provides the center point location information of target i in the current frame t. This is the predicted displacement distance of the center point of target i in the current frame t.

[0169] The final loss function value is expressed mathematically as follows:

[0170] L total =L off +L size +L pre +L heat

[0171] The following explains the process of obtaining the pruning rate of each convolutional layer in the neural network model.

[0172] The process of obtaining the pruning rate of each convolutional layer mainly includes three steps:

[0173] Perform low-rank decomposition on the input and output matrices of each convolutional layer to obtain the decomposition feature values ​​of that convolutional layer, and construct the decomposition feature matrix of that convolutional layer from the decomposition feature values ​​of that convolutional layer.

[0174] The obtained decomposed eigenvalues ​​are normalized to obtain normalized eigenvalues, and

[0175] By using normalized feature values ​​to obtain information entropy, the importance of filters can be evaluated based on information entropy. In this way, regardless of whether the original neural network model has been fully trained, the time consumed by model pruning and training can be greatly reduced.

[0176] See Figure 5 As shown, Figure 5 This is a schematic flowchart illustrating the process of determining the pruning rate in an embodiment of this application. For any convolutional layer in the current neural network model, the following steps are performed:

[0177] Step 401: Perform low-rank decomposition on the input and output matrices of the convolutional layer l to obtain the decomposed feature matrix composed of the decomposed feature values.

[0178] Neural network models contain numerous convolutional operations, and their filters often contain a lot of useless information, significantly increasing computational complexity. As an example of low-rank decomposition, Principal Component Analysis (PCA) is used for dimensionality reduction. PCA maps n-dimensional features to s-dimensional orthogonal features, also known as principal components. These s-dimensional orthogonal features are reconstructed from the original n-dimensional features, where s is less than n. The number of eigenvalues ​​represents the dimensionality, and the absolute value of each eigenvalue represents the intensity of the projection information along each dimension.

[0179] Step 402: Normalize the eigenvalues ​​in the decomposed feature matrix so that they are constrained to a range greater than or equal to 0 and less than or equal to 1, thus obtaining normalized eigenvalues.

[0180] To compare eigenvalues ​​of the same magnitude, each eigenvalue in the orthogonal eigenma matrix (decomposed eigenma matrix) composed of orthogonal eigenvalues ​​is normalized to constrain each eigenvalue to be within the range of 0 to 1. The normalization process can be expressed mathematically as follows:

[0181]

[0182] in, This refers to the normalized result of the eigenvalues ​​j in the feature matrix of convolutional layer l, i.e., the normalized eigenvalues. The largest eigenvalue in the feature matrix of convolutional layer l. The smallest eigenvalue in the feature matrix of convolutional layer l. For the eigenvalues ​​j in the feature matrix of convolutional layer l that need to be normalized, The first difference, This is the second difference.

[0183] Step 403: Determine the weights of each normalized feature value. These weights are used to characterize the probability of all projections of the convolutional layer in the low-rank space.

[0184] Introducing the concept of "energy" for Boltzmann machines, we describe the importance of the decomposed feature matrix by converting the absolute value of each eigenvalue in the decomposed feature matrix into a probability, thereby describing the importance of the convolutional layer filters that decompose the feature matrix. Each normalized eigenvalue is considered as a possible state of its effective projection into the convolutional layer. As an example, Softmax is used to calculate the probability of each projection of the convolutional layer into the low-rank space. The Softmax formula is shown below.

[0185]

[0186] in, Normalized eigenvalues Its exponential function value e j The weights, which are values ​​between 0 and 1, are used to characterize the normalized eigenvalues ​​in the feature matrix of the convolutional layer l. The value of the exponential function e i The weights, i.e., the probabilities of all projections of the convolutional layer l onto the low-rank space, e j Normalized eigenvalues The exponential function value, where ∑ represents the exponential function value e of all normalized eigenvalues ​​in the feature matrix decomposed by convolutional layer l. j sum.

[0187] It should be understood that if other methods are used to calculate the probability of each projection of the convolutional layer into the low-rank space, this probability can be understood as the weight of the normalized eigenvalues.

[0188] Step 404: Using the weights of each normalized feature value, obtain the information entropy of the convolutional layer.

[0189] The sum of the probabilities of all normalized feature values ​​within any convolutional layer will equal 1. This means that the projection of a convolutional layer can be described using information entropy. Thus, after calculating the total information entropy of all convolutional layers, the importance of each filter can be determined based on the information entropy. The information entropy is calculated as follows.

[0190]

[0191] Where x = e j , As shown in the above formula, Normalized eigenvalues ​​in the feature matrix of convolutional layer l Its exponential function value e j weights Self-information, H l Let be the information entropy of convolutional layer l, which is calculated as the average of the self-information of all normalized feature values ​​of the convolutional layer, using their exponential function values ​​as random variables. A higher information entropy indicates greater importance of the filters in that convolutional layer. The exponential function value of all normalized eigenvalues ​​in the feature matrix of convolutional layer l is decomposed. The symbol ∑ represents the exponential function value corresponding to all normalized eigenvalues ​​of convolutional layer l. Accumulate.

[0192] Step 405: Determine the pruning rate of the convolutional layer based on the information entropy. The larger the information entropy, the smaller the pruning rate, and vice versa.

[0193] Therefore, after obtaining the information entropy of each convolutional layer, the information entropy is sorted, and the pruning rate of each convolutional layer is set according to the sorting result. The convolutional layers with high importance are set with a smaller pruning rate, while the convolutional layers with low importance are set with a larger pruning rate. The filters in the convolutional layers are pruned according to the pruning rate corresponding to each convolutional layer to obtain the pruned neural network model.

[0194] As an example, the pruning rate of any convolutional layer l is determined as follows:

[0195] Calculate the proportion of the information entropy of convolutional layer l to the total information entropy of all convolutional layers.

[0196] The difference between the value 1 and this ratio is used to obtain the quota coefficient for configuring the pruning rate of this convolutional layer.

[0197] The pruning rate of convolutional layer l is obtained by multiplying the quota coefficient by the total pruning rate of the network model.

[0198] Expressed mathematically as follows:

[0199]

[0200] Among them, P llayer H represents the pruning rate of any convolutional layer l, where L is the total number of convolutional layers in the neural network model, and H is the pruning rate of the convolutional layer l. l Let P be the information entropy of convolutional layer l. total The total pruning rate of the network model. This is the quota coefficient.

[0201] See Figure 6 As shown, Figure 6 This is a schematic diagram of a target detection and tracking device based on a pruned neural network model according to an embodiment of this application. The device includes:

[0202] The image acquisition module is used to acquire continuous images to be detected, and from the continuous images to be detected, to acquire the current image frame, the previous image frame, and a heatmap of the target to be detected in the previous image frame, wherein the heatmap is used to characterize the center point location information of the target to be detected.

[0203] The neural network model module is used to process the acquired current image frame, previous image frame, and heatmap using a trained pruned neural network model. It obtains the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction from the output of the trained pruned neural network model.

[0204] The association module is used to determine the target detection and tracking results based on the target center point position in each current image frame, the target box offset position within the frame, the target box size, and the target box displacement prediction between frames;

[0205] in,

[0206] The pruned neural network model is obtained in the following manner:

[0207] For an incompletely trained neural network model with at least one convolutional layer, filter pruning is performed on the filters in each convolutional layer according to the pruning rate configured by the information entropy corresponding to each convolutional layer. Here, information entropy is used to characterize the importance of the filters in the convolutional layer.

[0208] See Figure 7 As shown, Figure 7 This is another schematic diagram of a target detection and tracking device based on a pruned neural network model according to an embodiment of this application. The device includes a memory and a processor. The memory stores a computer program, and the processor is configured to execute the computer program to implement the steps of the target detection and tracking method based on a pruned neural network model described in this application.

[0209] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0210] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0211] This invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the target detection and tracking method based on a pruned neural network model.

[0212] For the device / network-side equipment / storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and relevant parts can be referred to in the description of the method embodiments.

[0213] In this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, without necessarily requiring or implying any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0214] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A target detection and tracking method based on a pruned neural network model, characterized in that, The method includes: Acquire the continuous image to be detected. From the continuous images to be detected, obtain the current image frame, the previous image frame, and a heatmap of the target being detected in the previous image frame. The heatmap is used to represent the center point location information of the target being detected. The trained pruned neural network model is used to process the acquired current image frame, previous image frame, and heatmap. The target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction are obtained from the output of the trained pruned neural network model. The target detection and tracking results are determined based on the target center point position, intra-frame target bounding box offset position, target bounding box size, and inter-frame target bounding box displacement prediction in each current image frame. in, The pruned neural network model is obtained in the following manner: For at least one convolutional layer in an incompletely trained neural network model, filter pruning is performed on the filters in each convolutional layer according to the pruning rate configured by the information entropy corresponding to each convolutional layer. Here, information entropy is used to characterize the importance of the filters in the convolutional layer.

2. The method as described in claim 1, characterized in that, The pruned neural network model includes: The multi-feature attention fusion module is used to extract features from the current image frame, the previous image frame, and the heatmap, and then perform feature fusion and filtering. The encoding module is used to extract abstract features from the features output by the multi-feature attention fusion module. The decoding module is used to process abstract features to output the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction.

3. The method as described in claim 2, characterized in that, The multi-feature attention fusion module includes: The first feature attention fusion submodule extracts features from the current image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the second feature attention fusion submodule for feature fusion. The second submodule then performs convolution processing on the weight ratio of each convolutional layer and outputs the results. The second feature attention fusion submodule extracts features from the previous image frame, obtains the weight ratio of each convolutional layer through attention processing, filters features based on the weight ratio, and provides them to the third feature attention fusion submodule for feature fusion. It then performs convolution processing on the weight ratio of each convolutional layer and outputs the results. The third feature attention fusion submodule is used to extract heatmap features. It obtains the weight ratio of each convolutional layer through attention processing, performs convolution processing on the weight ratio of each convolutional layer, and outputs the results. The splicing submodule is used to splice the output features from the first feature attention fusion submodule, the second feature attention fusion submodule, and the third feature attention fusion submodule, respectively.

4. The method as described in claim 3, characterized in that, The first feature attention fusion submodule includes: The first convolutional submodule is used to extract features from the current image frame. The first attention submodule is used to perform attention processing on the extracted features of the current image frame to obtain the weights of each convolutional layer. The first product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the first attention submodule to obtain the weight proportion of each convolutional layer. The fourth convolution submodule performs convolution processing on the weight ratio of each convolutional layer output from the first product submodule, and filters out features based on the weight ratio of each convolutional layer before inputting them into the fifth convolution submodule. The second feature attention fusion submodule includes: The second convolutional submodule is used to extract features from the previous image frame. The second attention submodule is used to perform attention processing on the extracted features from the previous image frame to obtain the weights for each convolutional layer. The second product submodule is used to multiply the extracted features of the current image frame with the weight proportion of each convolutional layer output by the second attention submodule to obtain the weight proportion of each convolutional layer. The fifth convolutional submodule performs convolution processing on the weight proportions of each convolutional layer output from the second product submodule and the filtered features output from the fourth convolutional submodule. It then filters out features based on the weight proportions of each convolutional layer and inputs them into the sixth convolutional submodule. The third feature attention fusion submodule includes: The third convolutional submodule is used to extract heatmap features. The third attention submodule is used to perform attention processing on the extracted heatmap features to obtain the weights for each convolutional layer. The third product submodule is used to multiply the extracted heatmap features by the weight percentage of each convolutional layer output by the third attention submodule to obtain the weight percentage of each convolutional layer. The sixth convolutional submodule is used to perform convolution processing on the weight ratio of each convolutional layer output by the third product submodule and the filtering features output by the fifth convolutional submodule.

5. The method as described in claim 1, characterized in that, The pruning rate configured according to the information entropy of each convolutional layer in an incompletely trained neural network model includes: The initial neural network model is trained in only one round. For any convolutional layer in a neural network model after one round of training: The input and output matrices of the convolutional layer are decomposed into a low-rank matrix to obtain a decomposed feature matrix composed of the decomposed feature values. The eigenvalues ​​in the decomposed eigenma matrix are normalized so that they are constrained to a range greater than or equal to 0 and less than or equal to 1, resulting in normalized eigenvalues. Determine the weights for each normalized feature value; these weights represent the probabilities of all projections of the convolutional layer into the low-rank space. By using the weights of each normalized feature value, the information entropy of the convolutional layer can be obtained. The pruning rate of the convolutional layer is determined based on the information entropy. The larger the information entropy, the smaller the pruning rate, and vice versa. The filter pruning in the convolutional layer includes: Based on the neural network model trained in only one round, filter pruning is performed according to the pruning rate of each convolutional layer; The trained pruned neural network model is obtained as follows: The pruned neural network model is fully trained using the training sample set to obtain the trained pruned neural network model.

6. The method as described in claim 5, characterized in that, The process of training the initial neural network model in only one round includes: Select the current batch of sample data from the training sample set and input it into the current neural network model to obtain the prediction results for the current batch of sample data. Based on the loss function value between the predicted and actual results, adjust the model parameters of the current neural network model. Repeat this process until all sample data in the training sample set has been selected to complete one round of training. Determining the pruning rate of the convolutional layer based on information entropy includes: Calculate the proportion of the information entropy of this convolutional layer to the information entropy of all convolutional layers. The difference between the value 1 and this ratio is used to obtain the quota coefficient for configuring the pruning rate of this convolutional layer. The pruning rate of convolutional layer l is obtained by multiplying the quota coefficient by the total pruning rate of the network model.

7. The method as described in claim 6, characterized in that, The loss function value is the sum of the target center point position loss function value, the intra-frame target box offset position loss function value, the target box size loss function value, and the inter-frame target box displacement prediction loss function value; in, The target center point location loss function value is calculated using the focus loss function. The intra-frame target box offset position loss function value is calculated using the first mean absolute error loss function. The target bounding box size loss function value is calculated using the second mean absolute error loss function. The inter-frame target box displacement prediction loss function value is calculated using the third mean absolute error loss function.

8. The method as described in claim 5, characterized in that, The normalization process for the decomposed eigenvalues ​​in the decomposed feature matrix includes: For any eigenvalue in the decomposition characteristic matrix: Calculate the first difference between the decomposed eigenvalue and the smallest decomposed eigenvalue in the decomposed eigenma matrix. Calculate the second difference between the largest and smallest eigenvalues ​​in the decomposed feature matrix. Calculate the ratio between the first difference and the second difference to obtain the normalized result of the decomposed eigenvalue.

9. The method as described in claim 5, characterized in that, The determination of the weights for each normalized eigenvalue includes: For any normalized eigenvalue: The weight of the exponential function value of the normalized eigenvalue is obtained by calculating the ratio between the exponential function value of the normalized eigenvalue and the sum of the exponential function values ​​of each normalized eigenvalue in the decomposed eigenvalue matrix. The step of obtaining the information entropy of the convolutional layer using the weights of each normalized feature value includes: The weights of the exponential function value for any normalized eigenvalue of this convolutional layer. Calculate the self-information of the weights of the exponential function value of the normalized eigenvalue. The average value of the self-information is calculated by taking the exponential function value of all normalized feature values ​​of the convolutional layer as random variables; Obtain the information entropy of the convolutional layer.

10. A target detection and tracking device based on a pruned neural network model, characterized in that, The device includes: The image acquisition module is used to acquire continuous images to be detected, and from the continuous images to be detected, to acquire the current image frame, the previous image frame, and a heatmap of the target to be detected in the previous image frame, wherein the heatmap is used to characterize the center point location information of the target to be detected. The neural network model module is used to process the acquired current image frame, previous image frame, and heatmap using a trained pruned neural network model. It obtains the target center point position, intra-frame target box offset position, target box size, and inter-frame target box displacement prediction from the output of the trained pruned neural network model. The association module is used to determine the target detection and tracking results based on the target center point position in each current image frame, the target box offset position within the frame, the target box size, and the target box displacement prediction between frames; in, The pruned neural network model is obtained in the following manner: For an incompletely trained neural network model with at least one convolutional layer, filter pruning is performed on the filters in each convolutional layer according to the pruning rate configured by the information entropy corresponding to each convolutional layer. Here, information entropy is used to characterize the importance of the filters in the convolutional layer.