A training method of a 3D target detection framework based on an auxiliary denoising task and sparse relation modeling

By optimizing the feature processing and denoising tasks through a 3D object detection framework training method based on auxiliary denoising tasks and sparse relation modeling, the existing 3D object detection frameworks have failed to fully consider performance and computational complexity in their multi-dimensional improvements, thus achieving more efficient 3D object detection.

CN116630740BActive Publication Date: 2026-06-23BRETON TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BRETON TECHNOLOGY CO LTD
Filing Date
2023-05-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing 3D object detection frameworks have failed to fully consider performance and computational complexity when making multi-dimensional improvements, resulting in poor detection performance.

Method used

We employ a 3D object detection framework training method based on auxiliary denoising tasks and sparse relation modeling. Through pre-training, feature aggregation, and 3D prediction steps, combined with a cross-attention mechanism of Dynamic 6d AnchorBox and scale modulation, we optimize feature processing and denoising tasks and reduce unnecessary computation.

Benefits of technology

This improves the performance of 3D object detection and reduces computational complexity, resulting in more efficient 3D object detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116630740B_ABST
    Figure CN116630740B_ABST
Patent Text Reader

Abstract

The application discloses a training method of a 3D target detection framework based on an auxiliary denoising task and sparse relation modeling, and belongs to the technical field of machine learning. The method comprises the following steps: S100, pre-training, pre-training of all required pictures in a 2D pixel space to obtain initial parameters; S200, preprocessing, outputting features; S300, feature convergence, dividing a plurality of camera view settings into two groups of front and back, converging features of several pictures belonging to the front to a front view main picture, and converging features of several pictures belonging to the back to a back view main picture; and S400, 3D prediction, inputting the converged features into an FFN to perform boxheads prediction to obtain a 3D proposal. In the application, the convergence of all features is only in a single direction, from specific details to abstract generalization, and cannot be convergent in the reverse direction, so that unnecessary calculation amount is reduced.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of machine learning technology, specifically a training method for a 3D object detection framework based on auxiliary denoising tasks and sparse relation modeling. Background Technology

[0002] Since the introduction of the Transformer, an attention mechanism for modeling relationships, into Computer Vision (CV), numerous multi-camera or multi-modal 3D object detection frameworks have emerged in the Bev field. Each framework proposes novel structures or combinations of extensions, and experimental results have demonstrated the feasibility of these structures. However, a comprehensive and profound understanding of various dimensions and directions of 2D and 3D object detection reveals that each existing model structure is incomplete, only considering improvements in one dimension while ignoring the significant impact of other aspects on performance, computational complexity, and space complexity. To achieve comprehensive and accurate perception of object detection tasks in 3D space, it is necessary to consider multiple dimensions simultaneously and improve each dimension to its current limit. Therefore, existing technologies all suffer from incompleteness. Summary of the Invention

[0003] 1. The technical problem that the invention aims to solve

[0004] The purpose of this invention is to solve the problem of uneven heating of existing solar collectors under sunlight.

[0005] 2. Technical Solution

[0006] To achieve the above objectives, the technical solution provided by this invention is as follows:

[0007] The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling according to claim 1 of the present invention includes the following steps:

[0008] S100, Pre-training: Pre-train all eligible images in 2D pixel space to obtain the initial parameters of the network's backbone and encoder; S200, Preprocessing: Output features after the images pass through the backbone and encoder.

[0009] S300, Feature Convergence, divides multiple camera views into two groups: front and back. Images from the front angle are used as the main image of the front view, and features of several images belonging to the front converge to the main image of the front view. Images from the back angle are used as the main image of the back view, and features of several images belonging to the back converge to the main image of the back view.

[0010] S400, 3D prediction: The converged features are input into FFN to perform boxhead prediction to obtain 3D proposals.

[0011] Preferably, the convergence method in step S300 is to perform deformable-attention on the C5 feature layer, which includes 3D position embedding.

[0012] Preferably, it also includes a 3D position prior design that can be iteratively updated and takes into account the target scale: Dynamic6dAnchorBox, specifically...

[0013] The (xyz) in (xyzwhd) is independently embedded to 256 dimensions using sinxcosx, and scale information is injected based on this. Relevant terms of w and hd are added to the formula.

[0014] By dividing the x, y, and z terms by w, h, and d respectively, an adaptive factor is set for each of the x, y, and z terms for adjustment.

[0015] Finally, the concatenation is used as the position embedding and fed into the decoder's cross-attention part for attention calculation; the prior of the unique scale-modulated cross-attention mechanism can be expressed by the following formula (1):

[0016]

[0017] Formula Explanation: (x, y, z) represents the query position part of the Decoder, serving as the query in the transform network. Xref, Yref, and Zref represent the query position parts of the Encoder, serving as the keys in the transform network. `positionembedding` is the positional encoding function; both are independently PE (positional encoding) in the x, y, and z directions and then concatenated. `*` represents dot product, and the result of the dot product is the sum of the product of the two parts. Positions x, y, and z are encoded into 256 dimensions using the cosx and sinx functions respectively. w, h, and d are the width, height, and depth of the bounding box. MLP is a multilayer perceptron model, σ is the sigmoid function, and Wq, refHq, refDq, and ref are the query content parts. Cq is a self-adjusting factor generated by MLP and the σ function, with a value range of 0-1. D is the dimension 256 after positional encoding.

[0018] Preferably, this also includes adding a bypass denoising task to the decoder, that is, there are two types of queries input to the decoder.

[0019] A type of 3D proposal predicted by the front / backview aggregation from the encoder, the 6-dimensional anchor is queryembedded according to formula (1), and the target content part is learnable;

[0020] Another type of query is used for denoising tasks. The content (target) is a class, which includes the target and the background. The background is set to the maximum value (no object). The class embedding is 256-dimensional. The denoising of the query embedding can be summarized as: center point displacement and scale scaling.

[0021] Preferably, the displacement of the center point is specifically:

[0022] First, sample one perturbation parameter λ1 from the uniform distribution, and then calculate the offset corresponding to the center point (xyz). Ensure that the center point remains within the original frame after the disturbance.

[0023] Preferably, the scaling is specifically as follows:

[0024] One perturbation parameter λ2 is sampled from the uniform distribution, and then the offsets corresponding to the width, height and depth are calculated respectively: |Δw|=λ2*W|Δh|=λ2*H|Δd|=λ2*D, and finally the scaled width, height and depth are obtained.

[0025] Preferably, the query for the denoising task includes a target class and a background class. After calculation by the decoder layer, the output directly predicts the box and performs loss calculation with the corresponding ground truth box, and completes backpropagation. The query for Hungarian matching comes from the 3D proposal. The result of the decoder calculation is then processed by Hungarian matching to obtain the corresponding matching box, and loss calculation is performed with the matching box. During the training phase, the denoising task and the matching task are performed simultaneously, while only the Hungarian matching task is performed during inference.

[0026] Preferably, in the iterative fine-tuning structure of deformable-detr, the output rp of each layer is detached and the gradient is controlled not to be propagated forward. Specifically, the parameters of decoder layer 1 are used to participate in the loss calculation of decoder layer 2 through the predicted 3dbox1, and are also propagated backward twice. Layer 2 to layer 3 follow the same pattern, for a total of 6 layers.

[0027] 3. Beneficial effects

[0028] Compared with the prior art, the technical solution provided by this invention has the following advantages:

[0029] The present invention discloses a training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling, comprising the following steps: S100, pre-training: pre-training all eligible images in 2D pixel space to obtain the initialization parameters of the network's backbone and encoder; S200, preprocessing: outputting features after the images pass through the backbone and encoder; S300, feature convergence: dividing multiple camera views into front and rear groups, with images from the front angle used as the front view main image, and features from several front images converged into the front view main image, and images from the rear angle used as the back view main image, and features from several rear images converged into the back view main image; S400, 3D prediction: inputting the converged features into the FFN to perform boxhead prediction to obtain 3D proposals. Unlike deformable-attention, which involves uniform sampling across all feature layers without directionality, this invention converges all features in a single direction, from concrete details to abstract overviews. It cannot converge in the reverse direction, which is more in line with the process of human observation, from details to the whole. This reduces unnecessary computation and eliminates the unreasonable reverse process of converging from abstraction to details. Attached Figure Description

[0030] Figure 1 This is a schematic diagram of the 2D pre-trained network structure in this embodiment;

[0031] Figure 2 This is a schematic diagram of the neural network structure for generating 3D proposals in this embodiment;

[0032] Figure 3 This is a schematic diagram of the neural network structure of the decoder part in this embodiment;

[0033] Figure 4 This is a schematic diagram of the neural network structure of the iterative refinement part, in which each decoder layer participates in two loss calculations in this embodiment. Detailed Implementation

[0034] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.

[0035] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0036] In this application, the terms "upper," "lower," "left," "right," "front," "rear," "top," "bottom," "inner," "outer," "middle," "vertical," "horizontal," "lateral," and "longitudinal" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are primarily for the purpose of better describing this application and its embodiments, and are not intended to limit the indicated device, element, or component to having a specific orientation, or to be constructed and operated in a specific orientation.

[0037] Furthermore, in addition to indicating location or positional relationship, some of the aforementioned terms may also have other meanings. For example, the term "above" may also be used in some cases to indicate a certain dependency or connection relationship. Those skilled in the art can understand the specific meaning of these terms in this application based on the specific circumstances.

[0038] Furthermore, the terms "installation," "setup," "equipped with," "connection," "linking," and "socketing" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral structure; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium, or an internal connection between two devices, components, or parts. Those skilled in the art can understand the specific meaning of these terms in this application based on the specific circumstances.

[0039] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0040] Example 1

[0041] See attached document Figure 1-4 The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling in this embodiment includes the following steps:

[0042] S100, Pre-training: Pre-train all eligible images in 2D pixel space to obtain the initial parameters of the network's backbone and encoder. This ensures that the parameters of these two parts already contain rich semantic information before considering 3D spatial geometric information, allowing for more comprehensive subsequent steps.

[0043] S200, Preprocessing: After passing the image through the backbone and encoder, output the features;

[0044] S300, Feature Convergence, divides multiple camera views into two groups: front and back. Images from the front angle are used as the main image of the front view, and features of several images belonging to the front converge to the main image of the front view. Images from the back angle are used as the main image of the back view, and features of several images belonging to the back converge to the main image of the back view.

[0045] S400, 3D prediction: The converged features are input into FFN to perform boxhead prediction to obtain 3D proposals.

[0046] This embodiment, through a comprehensive and profound understanding of all dimensions and directions of 2D and 3D object detection, weighs the advantages and disadvantages of existing 3D space substructures from the perspective of factors affecting performance, storage, and computational complexity, simplifying and eliminating some seemingly novel designs that are actually mathematically ineffective. It extends designs that significantly improve performance in the 2D space domain to 3D object detection. Thus, a simple yet comprehensive 3D object detection framework is constructed from a holistic perspective. This framework conforms to mathematical completeness, simplicity, and effectiveness, and ablation experiments demonstrate the feasibility and effectiveness of each key design point. By avoiding the shortcomings of various dimensional factors and extending the key performance improvement points in 2D object detection to 3D space, this neural network model significantly improves 3D object detection performance compared to existing neural network models that have advantages in a specific dimension, while relatively reducing storage space and computational complexity.

[0047] The convergence method in step S300 is to perform deformable-attention on the C5 feature layer, which includes 3D position embedding.

[0048] It also includes a 3D position prior design that can be iteratively updated and takes into account the target scale: Dynamic6dAnchorBox, specifically...

[0049] The (xyz) in (xyzwhd) is independently embedded to 256 dimensions using sinxcosx, and scale information is injected based on this. Relevant terms of w and hd are added to the formula.

[0050] By dividing the x, y, and z terms by w, h, and d respectively, an adaptive factor is set for each of the x, y, and z terms for adjustment.

[0051] Finally, the concatenation is used as the position embedding and fed into the decoder's cross-attention part for attention calculation. The prior of the unique scale-modulated cross-attention mechanism can be expressed by the following formula (1):

[0052]

[0053] Formula Explanation: (x, y, z) represents the query position part of the Decoder, serving as the query in the transform network. Xref, Yref, and Zref represent the query position parts of the Encoder, serving as the keys in the transform network. `positionembedding` is the positional encoding function; both are independently PE (positional encoding) in the x, y, and z directions and then concatenated. `*` represents dot product, and the result of the dot product is the sum of the product of the two parts. Positions x, y, and z are encoded into 256 dimensions using the cosx and sinx functions respectively. w, h, and d are the width, height, and depth of the bounding box. MLP is a multilayer perceptron model, σ is the sigmoid function, and Wq, refHq, refDq, and ref are the query content parts. Cq is a self-adjusting factor generated by MLP and the σ function, with a value range of 0-1. D is the dimension 256 after positional encoding.

[0054] The overall explanation of the formula is as follows: Dividing (x, y, z) by (w, h, d) is because when W is much larger than H and D, y and z remain unchanged when moving in the x-direction. To ensure that the attention result on the horizontal line where y equals i in the attention map remains highlighted within a larger W range, the attention calculation result in the formula needs to favor the y and z directions. Therefore, xyz divided by whd results in a large w and a relatively small hd, leading to a larger value for the yz term, which constitutes a larger portion of the total result. The yz term remains unchanged when moving within the w range in the x-direction. Therefore, the final attention map of each anchorbox and all feature points of the encoder presents an ellipse based on the scale information of w, h, and d, rather than a scale-independent prototype.

[0055] This also includes adding a bypass denoising task to the decoder, where there are two types of queries input to the decoder.

[0056] A type of 3D proposal predicted by the front / backview aggregation from the encoder, the 6-dimensional anchor is queryembedded according to formula (1), and the target content part is learnable;

[0057] Another type of query is used for denoising tasks. The content (target) is a class, which includes the target and the background. The background is set to the maximum value (no object). The class embedding is 256-dimensional. The denoising of the query embedding can be summarized as: center point displacement and scale scaling.

[0058] The displacement of the center point is specifically as follows:

[0059] First, sample one perturbation parameter λ1 from the uniform distribution, and then calculate the offset corresponding to the center point (xyz). Ensure that the center point remains within the original frame after the disturbance.

[0060] The scaling specifically refers to

[0061] One perturbation parameter λ2 is sampled from the uniform distribution, and then the offsets corresponding to the width, height and depth are calculated respectively: |Δw|=λ2*W|Δh|=λ2*H|Δd|=λ2*D, and finally the scaled width, height and depth are obtained.

[0062] The denoising task query includes target class and background class. After the decoder layer calculates, the output directly predicts the box and performs loss calculation with the corresponding ground truth box and completes backpropagation. The query for Hungarian matching comes from 3D proposal. The result of the decoder calculation is then processed by Hungarian matching to obtain the corresponding matching box, and loss calculation is performed with the matching box. During the training phase, the denoising task and the matching task are performed simultaneously, while only the Hungarian matching task is performed during inference.

[0063] The queries described above are one-to-one within a single group. Performing this task simultaneously across multiple groups results in multiple one-to-one relationships, or many-to-one. This allows the system to learn multiple one-to-one correspondences within a single iteration, thus significantly increasing efficiency. Since the total number of targets in each image is different, a fixed number of denoising groups would result in enormous computational overhead for images with a large number of targets. To keep the number of queries per image within a stable range, the number of denoising groups is determined by the total number of targets in the image: fewer groups for more targets, and vice versa. This ensures that the total number of queries in each iteration remains roughly the same.

[0064] In the iterative fine-tuning structure of deformable-detr, the output rp of each layer is detached, and the gradient is controlled not to propagate forward. Specifically, the parameters of decoder layer 1 participate in the loss calculation of decoder layer 2 through the predicted 3dbox1, and are also processed twice backward; Layer 2 to Layer 3 follow the same pattern, for a total of 6 layers. The sparsity relationship modeling part of this embodiment improves deformableattention. Unlike deformable-attention, which uniformly samples across all feature layers without directionality, in this embodiment, during the encoder process from shallow layer C2 to deep layer C5, C3 is generated by combining C2 and C3, C4 by combining C2C3C4, and C5 by combining C2C3C4C5. The convergence of all features is only in a single direction, from specific details to abstract overview, and cannot converge in the opposite direction. This is more in line with the process of human eyes observing things, from details to the whole, reducing unnecessary computation, and eliminating the unreasonable reverse process of abstraction and convergence back to details.

[0065] The above method is used to perform end-to-end 3D object detection training on multiple servers equipped with high-performance GPUs until the loss function converges.

[0066] Training environment:

[0067] Hardware: 4 V100 GPUs, each with 32GB of video memory

[0068] Software: Ubuntu 20.04, CUDA 11.2, cuDNN 8.4, PyTorch 11.1, annaconda 3.6

[0069] Experimental details:

[0070] Training dataset: nuscenestrain

[0071] Validation dataset: nuscenesval

[0072] Backbone network: resnet50

[0073] Intermediate Network: FPN

[0074] Optimizer: AdamW, weightdecay, 0.0001

[0075] Small batch: batchsize 8, 8GPU V100

[0076] Training cycles: 50

[0077] Learning rate: Initially 2.5 * 10e-5, divided by 10 at epochs 27, 33, and 44 respectively.

[0078] Model parameter initialization: The backbone network is initialized with ImageNet pre-trained weights, and other layers are initialized with Xavier.

[0079] Data augmentation: Random horizontal flip, random resize (shortest side at least 480, maximum 900; longest side at maximum 1600, shortest 1333).

[0080] Weighting coefficient for classification loss: 2

[0081] Weighting factor for crossover ratio loss: 2

[0082] The weighting factor for L1 norm loss is 5.

[0083] Number of queries: 300

[0084] Number of iterations: 6

[0085] Computing power unit: 8 V100 GPUs, each with 16GB of video memory.

[0086] The above-described embodiments are merely illustrative of certain implementations of the present invention, and are described in a relatively specific and detailed manner. However, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements are all within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.

Claims

1. A training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling, characterized in that, Includes the following steps: S100, Pre-training: Pre-train all eligible images in 2D pixel space to obtain the initial parameters of the network's backbone and encoder; S200, Preprocessing: After passing the image through the backbone and encoder, output the features; S300, Feature Convergence, divides multiple camera views into two groups: front and back. Images from the front angle are used as the main image of the front view, and features of several images belonging to the front converge to the main image of the front view. Images from the back angle are used as the main image of the back view, and features of several images belonging to the back converge to the main image of the back view. S400, 3D prediction, input the converged features into FFN to perform boxhead prediction to obtain 3dproposal; the convergence method in step S300 is to perform deformable-attention in the C5 feature layer, which has 3d position embedding; it also includes a 3D position prior design that can be iteratively updated and takes into account the target scale: 6-dimensional AnchorBbox, specifically (x,y,z,w,h,d), where x,y,z represent the coordinates of the center point of the AnchorBbox on the x-axis, y-axis, and z-axis of the coordinate system, and w,h,d represent the scale information of the AnchorBbox, namely: width, height, and depth; the xyz in (x,y,z,w,h,d) are respectively used with sine and cosine functions to perform position embedding of 256 dimensions, and scale information w,h,d is injected based on this, and w,h,d is added to formula (1); After performing position-encoded PE (Placement Encoding), the x, y, and z terms are divided by the bounding box scale information: w, h, and d, respectively. Then, an adaptive factor, denoted as w, is assigned to each of the x, y, and z terms for adjustment. q,ref h q,ref d q,ref Finally, the concatenation is used as the position embedding and fed into the decoder cross-attention part for attention calculation; the prior of the characteristic scale-modulated cross-attention mechanism is expressed by the following formula (1): (1) Formula explanation: x, y, and z are the coordinates of the query position in the Decoder, which are used as the query in the transform network. ref y ref z ref The query position part of the Encoder serves as the key in the transform network; positionembedding is the position encoding function. Both are independently encoded in the x, y, and z directions (PE) and then concatenated; MLP is a multilayer perceptron model, σ is the sigmoid function, and w... q,ref h q,ref d q,ref The content part Cq of the query is processed by MLP and σ function to generate a self-adjusting factor with a value range of 0-1, and V is the dimension of the position encoded 256.

2. The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling as described in claim 1, characterized in that, This also includes adding a bypass denoising task to the decoder, where there are two types of queries input to the decoder. A type of 3D proposal predicted by combining the front and back views from the encoder, the 6-dimensional anchor is queryembedded according to formula (1), and the target content part is learnable; Another type of query is used for denoising tasks, where the target content is a class, which includes the target and the background, with the background set to the maximum value; the denoising of query embedding can be summarized as: center point displacement and scale scaling.

3. The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling as described in claim 2, characterized in that: The displacement of the center point is specifically as follows: First, sample a perturbation parameter λ1 from a uniform distribution ranging from 0 to 1. Then, calculate the offsets corresponding to the center point x, y, and z respectively. , , Ensure that the center point remains within the original frame after the disturbance.

4. The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling as described in claim 2, characterized in that, The scaling specifically refers to: A perturbation parameter λ2 is sampled from a uniform distribution ranging from 0 to 1, and then the offsets corresponding to width, height, and depth are calculated respectively: , , Finally, the scaled width, height, and depth are obtained.

5. The training method for a 3D object detection framework based on assisted denoising tasks and sparse relation modeling as described in claim 2, characterized in that: The query for the denoising task includes a target class and a background class. After calculation by the decoder layer, the output directly predicts the box and performs loss calculation with the corresponding ground truth box, and completes backpropagation. The query for Hungarian matching is obtained by performing Hungarian matching on the result of the 3D proposal after being calculated by the decoder, and then performing loss calculation. During the training phase, the denoising task and the matching task are performed simultaneously, while only the Hungarian matching task is performed during inference.