A target detection model training method, a training device, and a storage medium

By constructing an initial model and utilizing reparameterized channel shuffling units and structural reparameterization, the problem of efficient inference for target detectors on embedded devices and edge mobile platforms is solved, achieving high-precision and low-resource target detection results.

CN117315321BActive Publication Date: 2026-06-26HANGZHOU HUACHENG SOFTWARE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU HUACHENG SOFTWARE TECH CO LTD
Filing Date
2023-08-22
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing object detectors based on convolutional neural networks struggle to achieve efficient inference on embedded devices and edge mobile platforms, with unacceptable model size and inference efficiency.

Method used

An initial model is constructed, including a backbone network, a neck network, and a head network. Feature extraction is performed using reparameterized channel shuffling units, and the multi-path structure is converted into a single-path structure through structural reparameterization, reducing the number of model parameters and computational resource requirements.

Benefits of technology

It enables efficient inference for high-precision target detection on embedded and mobile devices, reducing the number of model parameters and runtime resource requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117315321B_ABST
    Figure CN117315321B_ABST
Patent Text Reader

Abstract

The application discloses a target detection model training method, a training device and a storage medium. The method comprises the following steps: constructing an initial model, wherein the initial model comprises a re-parameter channel mixing unit, the re-parameter channel mixing unit comprises a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a full connection module and a branch formed by a channel mixing module, the first-stage processing module and the third-stage processing module comprise a convolution branch and a batch normalization branch, and the second-stage processing module comprises two convolution branches; inputting sample data into the initial model for training to obtain an intermediate model; and adjusting two branches in the first-stage processing module, the second-stage processing module and the third-stage processing module of the intermediate model into a single branch to obtain a target detection model. Through the above method, the high precision of the model can be ensured, and the efficient inference speed can be achieved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image detection technology, and in particular to a training method, training device and storage medium for an object detection model. Background Technology

[0002] Object detection, as a core research direction in the field of computer vision, is widely used in many fields such as intelligent security, industrial inspection, and medical diagnosis.

[0003] Detection methods based on convolutional neural networks (CNNs) can generally be divided into two-stage detectors and one-stage detectors. Two-stage detectors typically use a region candidate network to generate candidate boxes first, and then refine the positions of the candidate boxes in the next stage to obtain the final prediction. Due to the multi-stage nature of this approach, these detectors are consistently inefficient. In contrast, one-stage detectors directly predict the target category and regress bounding boxes on the convolutional feature map. Because the entire network is simplified, one-stage detectors generally have faster inference speeds than two-stage detectors, but their model size and inference efficiency are still unacceptable on some embedded devices and edge mobile platforms. Therefore, designing models for deployment on embedded devices and edge mobile platforms remains a research challenge. Summary of the Invention

[0004] To address the aforementioned issues, this application discloses a training method, training device, and storage medium for an object detection model. This method can aggregate the extraction of local features while reducing the number of model parameters, thereby ensuring high model accuracy and achieving efficient inference speed. Furthermore, it can be deployed on embedded devices and mobile devices.

[0005] One technical solution adopted in this application is to provide a training method for an object detection model. The training method includes: constructing an initial model; wherein the initial model includes a backbone network, a neck network, and a head network connected in sequence; the backbone network includes a feature extraction module, which includes a reparameter channel shuffling unit; the reparameter channel shuffling unit includes a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module connected in sequence to form branches; wherein the first-stage and third-stage processing modules each include convolutional branches and batch normalization branches, and the second-stage processing module includes two-way convolutional branches; inputting sample data into the initial model for training to obtain an intermediate model; adjusting the two-way branches in the first-stage, second-stage, and third-stage processing modules of the intermediate model to single-way branches to obtain the object detection model.

[0006] The first-stage processing module and the third-stage processing module each include a 1×1 basic convolutional branch and a batch normalization branch, while the second-stage processing module includes two 3×3 channel-wise convolutional branches.

[0007] Specifically, the two-way branches in the first-stage processing module, the second-stage processing module, and the third-stage processing module of the intermediate model are adjusted to single-way branches. This includes: merging one 1×1 basic convolutional branch and one batch normalization branch in the first-stage processing module and the third-stage processing module into one 1×1 convolutional branch; and merging two 3×3 channel-wise convolutional branches in the second-stage processing module into one 3×3 convolutional branch.

[0008] The backbone network includes a first feature extraction module, a second feature extraction module, a third feature extraction module, and a fourth feature extraction module connected in sequence. The second feature extraction module, the third feature extraction module, and the fourth feature extraction module include a reparameter channel shuffling unit.

[0009] The first feature extraction module is the Stem module.

[0010] The second feature extraction module includes a first-level parameter channel shuffling unit with a stride of 1 and three second-level parameter channel shuffling units with a stride of 2. The third feature extraction module includes a first-level parameter channel shuffling unit and seven second-level parameter channel shuffling units. The fourth feature extraction module includes a first-level parameter channel shuffling unit and a second-level parameter channel shuffling unit.

[0011] The first-level parameter channel shuffling unit comprises a first branch formed by sequentially connecting a channel separation module, a fully connected module, and a channel shuffling module, and a second branch formed by sequentially connecting a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module; the second-level parameter channel shuffling unit comprises a first branch formed by sequentially connecting a channel separation module, a first-stage processing module, a second-stage processing module, a fully connected module, and a channel shuffling module, and a second branch formed by sequentially connecting a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module.

[0012] The neck network comprises a first convolutional module, a transposed convolutional cascade module, and a channel-by-channel convolutional cascade module connected in sequence.

[0013] The first convolutional module includes a first convolutional unit, a second convolutional unit, and a third convolutional unit. The first convolutional unit is connected to the second feature extraction module, the second convolutional unit is connected to the third feature extraction module, and the third convolutional unit is connected to the fourth feature extraction module. The transposed convolutional cascade module includes a first transposed convolutional cascade module and a second transposed convolutional cascade module. The first transposed convolutional cascade module is connected to the second convolutional unit and the third convolutional unit, and is used to transform the second feature map output by the second convolutional unit and the third feature map output by the third convolutional unit into a fourth feature map. The second transposed convolutional cascade module is connected to the first convolutional unit and the first transposed convolutional unit. The convolutional convolutional concatenation module is used to transform the first feature map and the fourth feature map output by the first convolutional unit into a fifth feature map; the channel-wise convolutional concatenation concatenation module includes a first channel-wise convolutional concatenation concatenation module and a second channel-wise convolutional concatenation concatenation module. The first channel-wise convolutional concatenation concatenation concatenation module is connected to the first transposed convolutional concatenation concatenation module and the second transposed convolutional concatenation concatenation module, and is used to transform the fifth feature map into a first feature map and to transform the fourth feature map and the fifth feature map into a second feature map. The second channel-wise convolutional concatenation concatenation concatenation module is connected to the first channel-wise convolutional concatenation concatenation module and the third convolutional unit, and is used to transform the third feature map into a third feature map.

[0014] The head network includes a first prediction unit, a second prediction unit, and a third prediction unit. The first prediction unit includes a second convolutional module, a first compact convolutional module, a second compact convolutional module, a third convolutional module, and a category detection module connected in sequence. The second prediction unit includes a second convolutional module, a third compact convolutional module, a fourth compact convolutional module, a fourth convolutional module, and a regression detection module connected in sequence. The third prediction unit includes a second convolutional module, a third compact convolutional module, a fourth compact convolutional module, a fifth convolutional module, and a confidence detection module connected in sequence.

[0015] The process of inputting sample data into the initial model for training includes: acquiring sample data; dividing the sample data into training set, validation set and test set according to a set ratio; preprocessing the training set; and inputting the preprocessed training set into the initial model for training.

[0016] Another technical solution adopted in this application is: providing a training device for an object detection model, the training device including a processor and a memory, the memory being used to store program data, and the processor being used to execute the program data to implement the above-mentioned training method for the object detection model.

[0017] Another technical solution adopted in this application is to provide a computer-readable storage medium that stores program data, which, when executed by a processor, implements the above-mentioned training method for the target detection model.

[0018] This application provides a training method, training device, and storage medium for an object detection model. The training method includes: constructing an initial model; wherein the initial model includes a backbone network, a neck network, and a head network connected in sequence; the backbone network includes a feature extraction module, which includes a reparameter channel shuffling unit; the reparameter channel shuffling unit includes a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module connected in sequence to form a branch; wherein the first-stage and third-stage processing modules each include a convolutional branch and a batch normalization branch, and the second-stage processing module includes two convolutional branches; inputting sample data into the initial model for training to obtain an intermediate model; adjusting the two branches in the first-stage, second-stage, and third-stage processing modules of the intermediate model to single branches to obtain the object detection model. Through the above methods, the reparameterized channel shuffling unit retains the original lightweight network structure to the greatest extent while horizontally expanding its structure. This allows the network to focus more on the extraction of local features during training. Furthermore, by using structural reparameterization, the multi-path structure is converted into a single-path structure during the inference phase, effectively ensuring the model's feature extraction capability and high accuracy. It also reduces the number of model parameters and the memory and computing resources required to run the model, achieving a more efficient inference speed. As a result, it can be deployed on embedded devices and mobile devices. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0020] Figure 1 This is a flowchart illustrating an embodiment of the training method for the object detection model provided in this application;

[0021] Figure 2 This is a network structure diagram of an embodiment of the initial model provided in this application;

[0022] Figure 3 This is a network structure diagram of an embodiment of the backbone network in the initial model provided in this application;

[0023] Figure 4 This is a network structure diagram of an embodiment of the first feature extraction module provided in this application;

[0024] Figure 5 This is a network structure diagram of an embodiment of the first-level parameter channel shuffling unit provided in this application;

[0025] Figure 6 This is a network structure diagram of an embodiment of the second-level parameter channel shuffling unit provided in this application;

[0026] Figure 7 This is a schematic diagram of the structure of an embodiment of the neck network in the initial model provided in this application;

[0027] Figure 8 This is a schematic diagram of the structure of an embodiment of the head network in the initial model provided in this application;

[0028] Figure 9 This is a network structure diagram of an embodiment of the first parameter channel shuffling unit in the inference stage provided in this application;

[0029] Figure 10 This is a network structure diagram of an embodiment of the second parameter channel shuffling unit in the inference stage provided in this application;

[0030] Figure 11 This is a schematic diagram of the structure of an embodiment of the training device for the target detection model provided in this application;

[0031] Figure 12 This is a schematic diagram of an embodiment of the computer-readable storage medium provided in this application. Detailed Implementation

[0032] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It is understood that the specific embodiments described herein are only for explaining this application and not for limiting it. Furthermore, it should be noted that, for ease of description, only the parts related to this application are shown in the accompanying drawings, not all structures. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0033] The terms "first," "second," etc., used in this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.

[0034] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0035] See Figure 1 , Figure 1 This is a flowchart illustrating an embodiment of the training method for the object detection model provided in this application. The training method for the object detection model includes:

[0036] Step S11: Construct the initial model; wherein the initial model includes a backbone network, a neck network, and a head network connected in sequence. The backbone network includes a feature extraction module, which includes a reparameter channel shuffling unit. The reparameter channel shuffling unit includes a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module connected in sequence to form a branch. The first-stage processing module and the third-stage processing module both include convolutional branches and batch normalization branches, and the second-stage processing module includes two convolutional branches.

[0037] See Figure 2 and Figure 3 , Figure 2 This is a network structure diagram of an embodiment of the initial model provided in this application. The initial model 20 includes a backbone network 30, a neck network 40, and a head network 50 connected in sequence. Figure 3 This is a network structure diagram of an embodiment of the backbone network in the initial model provided in this application. The backbone network 30 includes a first feature extraction module 31, a second feature extraction module 32, a third feature extraction module 33, and a fourth feature extraction module 34 connected in sequence. The second feature extraction module 32, the third feature extraction module 33, and the fourth feature extraction module 34 each include a first multi-parameter channel shuffling unit with a stride of 1 and a second multi-parameter channel shuffling unit with a stride of 2.

[0038] Optionally, the first feature extraction module 31 is a Stem module, such as... Figure 4 As shown, Figure 4This is a network structure diagram of an embodiment of the first feature extraction module provided in this application. The Stem module includes a branch of 1×1 convolution, 1×1 convolution, 3×3 convolution, fully connected module and 3×3 convolution connected in sequence, and another branch of 1×1 convolution, max pooling, fully connected module and 3×3 convolution. It is used to first perform a convolution operation with a kernel size of 1×1 on the input sample data. Its main purpose is to change the number of channels in the feature map. Then the network structure is divided into two branches, and the feature map is also divided into two parts. One part of the feature map is max pooled, and the other part of the feature map is first subjected to 1×1 convolution to reduce the number of channels by half, and then subjected to 3×3 convolution with a stride of 2 to achieve a second downsampling. The output results of the two branches are concatenated according to the channel dimension. Finally, a 3×3 convolution is performed to restore the number of channels. This process can shrink the resolution of the output convolution feature map to 1 / 4 of the original input image, and can reduce a large number of parameters while ensuring strong feature expression ability, without causing excessive loss of information. Optionally, in one embodiment, the second feature extraction module 32 includes a first-level parameter channel shuffling unit with a stride of 1 and three second-level parameter channel shuffling units with a stride of 2; the third feature extraction module 33 includes a first-level parameter channel shuffling unit and seven second-level parameter channel shuffling units; and the fourth feature extraction module 34 includes a first-level parameter channel shuffling unit and a second-level parameter channel shuffling unit. The second feature extraction module 32, the third feature extraction module 33, and the fourth feature extraction module 34 can alternately use the first-level parameter channel shuffling unit and the second-level parameter channel shuffling unit to extract features from the input feature map.

[0039] See Figure 5 and Figure 6 , Figure 5 This is a network structure diagram of an embodiment of the first-level parameter channel shuffling unit provided in this application. Figure 6This is a network structure diagram of an embodiment of the second-level parameter channel shuffling unit provided in this application. Specifically, the first-level parameter channel shuffling unit 510 includes a first branch formed by sequentially connecting a channel separation module 511, a fully connected module 515, and a channel shuffling module 516, and a second branch formed by sequentially connecting a channel separation module 511, a first-stage processing module 512, a second-stage processing module 513, a third-stage processing module 514, a fully connected module 515, and a channel shuffling module 516. The second-level parameter channel shuffling unit 610 includes a first branch formed by sequentially connecting a channel separation module 511, a first-stage processing module 512, a second-stage processing module 513, a fully connected module 515, and a channel shuffling module 516, and a second branch formed by sequentially connecting these modules. Understandably, the first-level parameter channel shuffling unit 510 and the second-level parameter channel shuffling unit 610 have a multi-path structure, which enables efficient expression of feature sparsity during training while allowing more attention to the extraction of local features.

[0040] Optionally, in one embodiment, both the first-stage processing module 512 and the third-stage processing module 514 include a 1×1 basic convolution branch and a batch normalization branch, while the second-stage processing module 513 includes two 3×3 channel-wise convolution branches. Specifically, both the first-stage processing module 512 and the third-stage processing module 514 include a 1×1 basic convolution branch controlled by hyperparameter k and a batch normalization branch, while the second-stage processing module 513 includes a 3×3 channel-wise convolution branch controlled by hyperparameter k and a separate 3×3 channel-wise convolution branch.

[0041] See Figure 7 , Figure 7This is a schematic diagram of the structure of an embodiment of the neck network in the initial model provided in this application. The neck network 40 includes a first convolutional module, a transposed convolutional cascade module, and a channel-by-channel convolutional cascade module connected in sequence. The first convolutional module includes a first convolutional unit 711, a second convolutional unit 712, and a third convolutional unit 713. The first convolutional unit 711 is connected to the second feature extraction module 32, the second convolutional unit 712 is connected to the third feature extraction module 33, and the third convolutional unit 713 is connected to the fourth feature extraction module 34. Optionally, the first, second, and third convolutional units are all 1×1 convolutions. After the 1×1 convolution, the number of channels is reduced to 96, thereby reducing the computational cost. The transposed convolution cascade module includes a first transposed convolution cascade module 721 and a second transposed convolution cascade module 722. The first transposed convolution cascade module 721 connects the second convolution unit 712 and the third convolution unit 713, and is used to transpose the third feature map M3 output by the third convolution unit 713 and perform a channel cascade operation with the second feature map M2 output by the second convolution unit 712 to obtain a fourth feature map M4 with a size of 20×20. The second transposed convolution cascade module 722 connects the first convolution unit 711 and the first transposed convolution cascade module 721, and is used to further transpose the fourth feature map M4 and perform a channel cascade operation with the first feature map M1 output by the first convolution unit 711 to obtain a fifth feature map M5 with a size of 40×40. The channel-wise convolutional convolutional cascade module includes a first channel-wise convolutional cascade module 731 and a second channel-wise convolutional cascade module 732. The first channel-wise convolutional cascade module 731 connects the second transposed convolutional cascade module 722 and the first transposed convolutional cascade module 721. It is used to perform a depthwise separable convolution on the fifth feature map M5, compressing the number of channels to 96, and obtaining a first feature map P1 with a size of 40×40. Then, the fifth feature map M5 is convolved with the fourth feature map M4 through channel-wise convolution, and then subjected to depthwise separable convolution to obtain a second feature map P2 with 96 channels and a size of 20×20. The second channel-wise convolutional cascade module 732 connects the first channel-wise convolutional cascade module 731 and the third convolutional unit 713. It is used to perform a channel-wise convolution on the third feature map M3 output by the third convolutional unit 713, and then subjected to depthwise separable convolution to obtain a third feature map P3 with 96 channels and a size of 10×10. Understandably, the neck network adopts a PAN structure, which integrates feature maps at different scales. On the one hand, it replaces the traditional upsampling operation with transposed convolution during feature fusion, which enhances the model's feature extraction capability while increasing the number of parameters by a very small amount. On the other hand, the use of depthwise separable convolution can reduce the computational cost of the model, which is beneficial for deployment on resource-constrained devices.

[0042] See Figure 8 , Figure 8This is a schematic diagram of the structure of an embodiment of the head network in the initial model provided in this application. The head network 50 includes a first prediction unit, a second prediction unit, and a third prediction unit. The first prediction unit includes a second convolutional module 80, a first compact convolutional module 801, a second compact convolutional module 802, a third convolutional module 803, and a category detection module 808 connected in sequence. The second prediction unit includes a second convolutional module 80, a third compact convolutional module 804, a fourth compact convolutional module 805, a fourth convolutional module 806, and a regression detection module 809 connected in sequence. The third prediction unit includes a second convolutional module 80, a third compact convolutional module 804, a fourth compact convolutional module 805, a fifth convolutional module 807, and a confidence detection module 810 connected in sequence. The regression task and the classification task are processed separately because they are mutually exclusive in the object detection process. In addition, the use of two consecutive first compact convolutional modules and second compact convolutional modules is to expand the receptive field. Optionally, the second, third, fourth, and fifth convolutional modules are 1×1 convolutions. Optionally, the category detection module is Cls: H×W×C, the regression detection module is Reg: H×W×4, where 4 refers to 4 regressions, and the confidence detection module is IOU: H×W×1, where 1 refers to 1 confidence level.

[0043] Step S12: Input the sample data into the initial model for training to obtain the intermediate model.

[0044] Optionally, step S12 can specifically be as follows: obtain sample data, divide the sample data into training set, validation set and test set according to the set ratio of 8:1:1, preprocess the training set, input the preprocessed training set into the initial model for training, and obtain the intermediate model.

[0045] Optionally, in one embodiment, the preprocessed training set is input into the backbone network for multi-path training and feature extraction to obtain an initial feature map. Then, the initial feature map is fused and a second feature extraction is performed through the neck network to obtain a first feature map P1, a second feature map P2, and a third feature map P3 at different scales. The first feature map P1, the second feature map P2, and the third feature map P3 are then convolved through the head network to output the category, location, and confidence of the predicted target feature map. Then, a loss function is calculated on the output predicted category, location, and confidence to obtain the optimized gradient. The weights and biases are adjusted until the loss function converges or the initial model training reaches the specified number of iterations, thus obtaining an intermediate model.

[0046] Optionally, the constructed multi-path lightweight object detection network model is trained using the preprocessed training set to obtain feature maps at different scales. These feature maps are then further meshed to obtain the coordinates, category, and confidence score of the predicted target. Based on the prior anchor boxes and the ground truth target boxes, positive and negative samples are assigned according to the aspect ratio matching strategy. The cross-entropy loss function is then used to calculate the category prediction loss, the binary cross-entropy loss function to calculate the confidence prediction loss, and the CIOU loss function to calculate the bounding box regression prediction loss. The regression of the target box is described from three aspects: overlap area, center point distance, and aspect ratio. The specific expression of the loss function is as follows:

[0047] loss = loss Cls +loss Reg +loss obj

[0048]

[0049]

[0050]

[0051] in, Used to determine whether the j-th anchor box of the i-th grid is a positive sample. w is used to determine whether the j-th anchor box of the i-th grid is a negative sample. i and h i These are the width and height of the j-th anchor frame in the i-th grid, respectively, and λ corrd and λ noobj These are the weighting coefficients for positive and negative samples, C. i and Let be the predicted value and the ground truth value of the i-th grid, b be the model prediction box, bgt be the ground truth box, b and bgt be the center points of the prediction box and the ground truth box, respectively, c be the diagonal distance between the minimum outer rectangle of the prediction box and the ground truth box, ρ be the Euclidean distance, α be the consistency ratio, and v be used to measure the consistency of the coordinates. The expressions for α and v are as follows:

[0052]

[0053] Where w and h represent the width and height of the real sample detection box, respectively, and wgt and hgt represent the width and height of the predicted box, respectively.

[0054] Step S13: Adjust the two-way branches in the first-stage processing module, the second-stage processing module, and the third-stage processing module of the intermediate model to a single-way branch to obtain the target detection model.

[0055] Specifically, the structure reparameterization method is used. Structure reparameterization refers to first constructing a series of structures (generally used for training) and then converting their parameters into another set of parameters (generally used for inference). This series of structures is then equivalently converted into another series of structures. A 1×1 basic convolutional branch and a batch normalization branch in the first-stage processing module and the third-stage processing module are merged into a 1×1 convolutional branch. In addition, two 3×3 channel-wise convolutional branches in the second-stage processing module are merged into a 3×3 convolutional branch to obtain the object detection model.

[0056] See Figure 9 and Figure 10 , Figure 9 This is a network structure diagram of an embodiment of the first-level parameter channel shuffling unit in the inference stage provided in this application. Figure 10 This is a network structure diagram of an embodiment of the second-level parameter channel shuffling unit in the inference stage provided in this application. The first-stage processing module and the third-stage processing module of the first-level parameter channel shuffling unit and the second-level parameter channel shuffling unit convert a 1×1 basic convolutional branch controlled by hyperparameter k and a batch normalization branch into a 1×1 convolutional branch by fusing convolution and batch normalization, based on the homogeneity of convolutions. Similarly, the second-stage processing module converts a 3×3 channel-wise convolutional branch controlled by hyperparameter k and a separate 3×3 channel-wise convolutional branch into a 3×3 convolutional branch by fusing convolution and batch normalization, based on the homogeneity of convolutions, to obtain the target detection model.

[0057] Optionally, in one embodiment, the target detection model and the stored weight parameters can be transformed and quantized to form an offline network model, which can then be deployed to an embedded platform or mobile edge device to complete the detection of target data and output the detection results.

[0058] Unlike existing technologies, this application provides a training method for an object detection model, comprising: constructing an initial model; wherein the initial model includes a backbone network, a neck network, and a head network connected in sequence, the backbone network including a feature extraction module, the feature extraction module including a reparameter channel shuffling unit, the reparameter channel shuffling unit including a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module connected in sequence to form a branch, wherein the first-stage processing module and the third-stage processing module both include convolutional branches and batch normalization branches, and the second-stage processing module includes two-way convolutional branches; inputting sample data into the initial model for training to obtain an intermediate model; adjusting the two-way branches in the first-stage processing module, the second-stage processing module, and the third-stage processing module of the intermediate model to single-way branches to obtain the object detection model. Through the above methods, the reparameterized channel shuffling unit retains the original lightweight network structure to the greatest extent while horizontally expanding its structure. This allows the network to focus more on the extraction of local features during training. Furthermore, by using structural reparameterization, the multi-path structure is converted to a single-path structure during the inference phase, effectively ensuring the high accuracy of the model, reducing the number of model parameters and the memory and computing resources required to run the model, and achieving a more efficient inference speed. As a result, it can be deployed on embedded devices and mobile devices.

[0059] See Figure 11 , Figure 11 This is a schematic diagram of an embodiment of the training device for the target detection model provided in this application. The training device 110 includes a memory 111 and a processor 112. The memory 111 is used to store program data, and the processor 112 is used to execute the program data to implement the above-described training method for the target detection model.

[0060] See Figure 12 , Figure 12 This is a schematic diagram of an embodiment of the computer-readable storage medium provided in this application. The computer-readable storage medium 120 stores program data 130, which is used to implement the above-described training method for the target detection model when executed by a processor.

[0061] In the above embodiments, the processor can also be referred to as a CPU (Central Processing Unit). The processor may be an integrated circuit chip with signal processing capabilities. The processor can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor. Furthermore, the processor can be implemented using multiple integrated circuit chips.

[0062] In the above embodiments, the memory or computer-readable storage medium can be a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, or a medium that can store program data. Alternatively, it can be a server that stores the program data, which can send the stored program data to other devices for execution or run the stored program data itself.

[0063] In the several embodiments provided in this application, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.

[0064] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0065] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0066] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A target detection method in a mobile edge device, characterized in that, include: Determine the object detection model, wherein the training process of the object detection model includes: An initial model is constructed, comprising a backbone network, a neck network, and a head network connected in sequence. The backbone network comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, and a fourth feature extraction module connected in sequence. Each of the second, third, and fourth feature extraction modules includes a first-parameter channel shuffling unit with a stride of 1 and a second-parameter channel shuffling unit with a stride of 2. Both the first and second-parameter channel shuffling units have a multi-path structure, which includes a branch formed by sequentially connecting a channel separation module, a first-stage processing module, a second-stage processing module, a third-stage processing module, a fully connected module, and a channel shuffling module. The first and third-stage processing modules each include a convolutional branch and a batch normalization branch, and the second-stage processing module includes two convolutional branches. The sample image data is input into the initial model for training to obtain an intermediate model; and The two branches in the first-stage processing module, the second-stage processing module, and the third-stage processing module of the intermediate model are adjusted to a single branch to obtain the target detection model; The target detection model is converted into an offline network model; and The offline network model is deployed to the mobile edge device to detect target image data and output detection results.

2. The method according to claim 1, characterized in that, Both the first-stage processing module and the third-stage processing module include a 1×1 basic convolutional branch and a batch normalization branch, while the second-stage processing module includes two 3×3 channel-wise convolutional branches.

3. The method according to claim 2, characterized in that, The adjustment of the two branches in the first-stage processing module, the second-stage processing module, and the third-stage processing module of the intermediate model to a single branch includes: Merge a 1×1 basic convolutional branch and a batch normalization branch from the first-stage processing module and the third-stage processing module into a single 1×1 convolutional branch; and The two 3×3 channel-wise convolutional branches in the second-stage processing module are merged into one 3×3 convolutional branch.

4. The method according to claim 1, characterized in that, The first feature extraction module is the Stem module.

5. The method according to claim 1, characterized in that, The second feature extraction module includes a first-level parameter channel shuffling unit with a stride of 1 and three second-level parameter channel shuffling units with a stride of 2. The third feature extraction module includes a first-level parameter channel shuffling unit and seven second-level parameter channel shuffling units. The fourth feature extraction module includes a first-level parameter channel shuffling unit and a second-level parameter channel shuffling unit.

6. The method according to claim 5, characterized in that, The first-level parameter channel washing unit includes a first branch formed by sequentially connecting the channel separation module, the fully connected module, and the channel washing module, and a second branch formed by sequentially connecting the channel separation module, the first-stage processing module, the second-stage processing module, the third-stage processing module, the fully connected module, and the channel washing module. The second-level parameter channel washing unit includes a first branch formed by sequentially connecting the channel separation module, the first-stage processing module, the second-stage processing module, the fully connected module, and the channel washing module, and a second branch formed by sequentially connecting the channel separation module, the first-stage processing module, the second-stage processing module, the third-stage processing module, the fully connected module, and the channel washing module.

7. The method according to claim 1, characterized in that, The neck network includes a first convolutional module, a transposed convolutional cascade module, and a channel-by-channel convolutional cascade module connected in sequence.

8. The method according to claim 7, characterized in that, The first convolution module includes a first convolution unit, a second convolution unit, and a third convolution unit. The first convolution unit is connected to the second feature extraction module, the second convolution unit is connected to the third feature extraction module, and the third convolution unit is connected to the fourth feature extraction module. The transposed convolutional cascade module includes a first transposed convolutional cascade module and a second transposed convolutional cascade module. The first transposed convolutional cascade module connects the second convolutional unit and the third convolutional unit, and is used to transform the second feature map output by the second convolutional unit and the third feature map output by the third convolutional unit into a fourth feature map. The second transposed convolutional cascade module connects the first convolutional unit and the first transposed convolutional cascade module, and is used to transform the first feature map output by the first convolutional unit and the fourth feature map into a fifth feature map. The channel-wise convolutional cascade module includes a first channel-wise convolutional cascade module and a second channel-wise convolutional cascade module. The first channel-wise convolutional cascade module is connected to the first transposed convolutional cascade module and the second transposed convolutional cascade module, and is used to convert the fifth feature map into a first feature map and to convert the fourth feature map and the fifth feature map into a second feature map. The second channel-wise convolutional cascade module is connected to the first channel-wise convolutional cascade module and the third convolutional unit, and is used to convert the third feature map into a third feature map.

9. The method according to claim 1, characterized in that, The head network includes a first prediction unit, a second prediction unit, and a third prediction unit; The first prediction unit includes a second convolution module, a first compact convolution module, a second compact convolution module, a third convolution module, and a category detection module connected in sequence. The second prediction unit includes a second convolution module, a third compact convolution module, a fourth compact convolution module, a fourth convolution module, and a regression detection module connected in sequence; The third prediction unit includes the second convolution module, the third compact convolution module, the fourth compact convolution module, the fifth convolution module, and the confidence detection module, which are connected in sequence.

10. The method according to claim 1, characterized in that, The sample image data is input into the initial model for training, including: Acquire sample image data; The sample image data is divided into a training set, a validation set, and a test set according to a set ratio; The training set is preprocessed; The preprocessed training set is input into the initial model for training.

11. A mobile edge device, characterized in that, It includes a processor and a memory, the memory being used to store program data, and the processor being used to execute the program data to implement the target detection method as described in any one of claims 1-10.

12. A readable storage medium, characterized in that, The readable storage medium stores a computer program that, when executed by a processor, implements the target detection method according to any one of claims 1-10.