Image 3D target detection model training enhancement method under limited data labeling

By employing image enhancement and self-distillation training methods, the overfitting problem of 3D object detection models under limited data annotation was solved, improving the robustness and detection performance of the models and developing a stronger monocular 3D object detection neural network.

CN117011671BActive Publication Date: 2026-06-23SUZHOU BAICHUAN DATA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUZHOU BAICHUAN DATA TECH CO LTD
Filing Date
2023-09-09
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing methods are prone to overfitting and have poor generalization performance under limited data annotation conditions in 3D target detection models. Furthermore, the potential of existing distillation methods to enhance robustness at the target representation level has not been fully explored.

Method used

By acquiring 3D feature descriptions of images, enhanced images are generated using an image enhancement module. These images are then trained using a student/teacher 3D object detection network through self-distillation. Supervised learning and backpropagation are employed to construct a multi-stage feature mapping representation, thereby promoting effective training of the network with limited data.

Benefits of technology

It achieves more accurate 3D object detection under limited data conditions, improves the robustness and detection performance of the model, makes full use of manually labeled data, and develops a stronger monocular 3D object detection neural network.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117011671B_ABST
    Figure CN117011671B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of image target detection and discloses a limited data labeling-based image 3D target detection model training enhancement method, which obtains image 3D feature descriptions, target features of different feature layers are used to obtain different detection results through a student / teacher 3D target detection network, the target features of different feature layers are used to train a supervised image 3D target detection network through a self-distillation module and different detection results through a supervised learning module, a network training loss function is obtained based on the training of the supervised image 3D target detection network, the network training loss function is used for training the image 3D target detection network through back propagation, different training image 3D target detection networks are adopted, and then in the face of the need for robust monocular 3D target detection neural network model development, limited artificial labeling data is fully utilized to effectively train the network, and a monocular 3D target detection deep neural network model with stronger performance and more robust performance is ensured.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image target detection technology, specifically to a method for enhancing the training of 3D image target detection models under limited data annotation. Background Technology

[0002] 3D object detection based on monocular vision primarily uses 3D bounding boxes for image annotation. The motion of objects in the real world can be viewed as a six-degree-of-freedom problem. Typical 3D bounding box annotations include the coordinates of the bounding box's center point, length, width, and height, as well as yaw angles (pitch, roll, and yaw). Since targets in traffic scenarios move on the road plane, pitch and roll angles are not considered; instead, seven parameters—center point, length, width, height, and yaw—are typically used to annotate the 3D position and size information of the bounding box. 3D object detection can be considered a multi-task learning problem, with the goal of regressively predicting the object category and the position and size information of the 3D bounding box.

[0003] Existing methods rely on fully supervised learning, which easily leads to overfitting to labeled datasets. Feature extraction is only effective in the training domain, resulting in poor generalization performance. While some distillation methods can alleviate these problems to some extent—for example, using a radar detection model as a teacher model to generate pseudo-labels in teacher-student models, with the student model trained on both pseudo-label and labeled datasets—their potential for robust enhancement at the target representation level has not been fully explored. How to perform semi-supervised training using a small amount of labeled data and a large amount of unlabeled data while simultaneously enhancing target representation capabilities remains a challenge. Therefore, a training enhancement method for image 3D target detection models with limited data labeling is proposed. Summary of the Invention

[0004] The purpose of this invention is to provide a method for enhancing the training of 3D object detection models in images with limited data annotation, so as to solve the problems mentioned in the background art.

[0005] To achieve the above objectives, the present invention provides the following technical solution:

[0006] Methods for enhancing the training of 3D object detection models in images with limited data annotation include:

[0007] The image 3D feature description is obtained, and the image 3D feature description is used to obtain enhanced image A and enhanced image B through image enhancement module. The enhanced image A and enhanced image B are used to obtain target features at different feature layers through target detection network. The target features at different feature layers are used to obtain different detection results through student / teacher 3D target detection network.

[0008] The target features of different feature layers are used to train a supervised image 3D target detection network through a self-distillation module and different detection results through a supervised learning module.

[0009] The training loss function of the supervised image 3D object detection network is obtained through training, and the training loss function is used to train the image 3D object detection network through backpropagation.

[0010] Optionally, the image 3D feature description is obtained by obtaining different enhanced images through the image input data enhancement module in the labeled data, and by inputting the lidar point cloud to which the corresponding labeled target belongs into the 3D feature extraction network.

[0011] Optionally, the image enhancement module is used to generate images with the same scale but different appearances.

[0012] Optionally, the image 3D feature description extraction network is composed of a point cloud feature extraction encoder.

[0013] Optionally, the student / teacher 3D target detection network includes three processes: a backbone network, a region recommendation network, and a detection head.

[0014] Optionally, the self-distillation module includes a scene-level feature self-distillation module, a region of interest feature self-distillation module, and a response-level label self-distillation module.

[0015] Optionally, when the target features of different feature layers and different detection results are processed by the module, half of the target features and half of the detection results are used.

[0016] Optionally, the loss calculation of the computation result of the image 3D object detection network is used for knowledge distillation during the training process.

[0017] Optionally, before the network parameters are finalized, i.e. before the loss function converges, only the student or teacher parameters are updated for each batch.

[0018] Optionally, if the student / teacher 3D object detection network has the same parameters, the image 3D object detection network parameters are updated only once in each batch.

[0019] This invention has at least the following beneficial effects:

[0020] This scheme acquires 3D feature descriptions of images, which are then used by an image enhancement module to obtain enhanced image A and enhanced image B. Enhanced image A and enhanced image B are then used by an object detection network to obtain target features at different feature layers. These target features at different feature layers are used by a student / teacher 3D object detection network to obtain different detection results. The target features at different feature layers, along with the different detection results, are used by a supervised learning module to train a supervised image 3D object detection network. Based on the training of the supervised image 3D object detection network, a network training loss function is obtained. This network training loss function is then used for backpropagation to train the image 3D object detection network. By training the same image 3D object detection neural network with limited labeled data, more accurate detection results and better detection performance can be obtained. By employing different aspects of training the image 3D object detection network, and thus meeting the needs of developing robust monocular 3D object detection neural network models, this scheme fully utilizes limited manually labeled data for effective network training, ensuring a more powerful and robust monocular 3D object detection deep neural network model.

[0021] Meanwhile, this application uses a feature enhancement method based on the prior of the target 3D model and supervises the training of the target features in the detection model to achieve better performance during training and improve the robustness of the model. Based on the data augmentation self-distillation learning method, isomorphic teacher / student network interactive distillation training is performed to construct the mapping representation between the teacher model and the student model in multiple stages such as image feature layer, target response layer, and position regression layer, so as to promote the alternating learning and updating of the two-way features in self-distillation. Attached Figure Description

[0022] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a schematic diagram of the process of the present invention;

[0024] Figure 2 The diagram illustrates the training of the same 3D object detection neural network for this invention (dark gray represents manually labeled results, light gray represents results trained without this invention, and light gray represents results trained with this invention). Detailed Implementation

[0025] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] Please see Figures 1-2 This invention provides a method for enhancing the training of 3D object detection models in images under limited data annotation, comprising:

[0027] The image 3D feature description is obtained, and the image 3D feature description is used to obtain enhanced image A and enhanced image B through image enhancement module. The enhanced image A and enhanced image B are used to obtain target features at different feature layers through target detection network. The target features at different feature layers are used to obtain different detection results through student / teacher 3D target detection network.

[0028] The target features of different feature layers are used to train a supervised image 3D target detection network through a self-distillation module and different detection results through a supervised learning module.

[0029] The training loss function of the supervised image 3D object detection network is obtained through training, and the training loss function is used to train the image 3D object detection network through backpropagation.

[0030] In some embodiments, the image 3D feature description is obtained by the image input data enhancement module in the labeled data to obtain different enhanced images, and by inputting the lidar point cloud to which the corresponding labeled target belongs into the 3D feature extraction network respectively; the image enhancement module is used to generate images with the same scale but different appearances; the image 3D feature description extraction network is composed of a point cloud feature extraction encoder.

[0031] In some embodiments, the student / teacher 3D target detection network includes three processes: a backbone network, a region recommendation network, and a detection head. Therefore, target features of different feature layers pass through the backbone network, the region recommendation network, and the detection head in sequence, resulting in different detection results. The self-distillation module includes a scene-level feature self-distillation module, a region of interest feature self-distillation module, and a response-level label self-distillation module. Target features of different feature layers pass through the scene-level feature self-distillation module, the region of interest feature self-distillation module, and the response-level label self-distillation module to obtain corresponding scene-level features, region of interest features, and response-level labels.

[0032] In some embodiments, when the target features of different feature layers and different detection results are processed by the module, half of the target features and half of the detection results are used; the loss calculation of the operation result of the image 3D object detection network is used for knowledge distillation during the training process; before the network parameters are finalized, that is, before the loss function converges, only the student or teacher parameters are updated in each batch; if the student / teacher 3D object detection network has the same parameters, only one image 3D object detection network parameter update is performed in each batch.

[0033] The method for enhancing the training of the 3D object detection model in images with limited data annotation is as follows;

[0034] Step 1: Input the images from the labeled data into the data enhancement module to obtain different enhanced images. Input the LiDAR point cloud corresponding to the labeled target into the 3D feature extraction network to obtain the corresponding 3D feature descriptions. The image enhancement module keeps the image size and orientation unchanged, and generates two images (enhanced image A and enhanced image B) of the same scale but different appearances by changing the image color, adding image degradation effects, and increasing image noise. The 3D feature extraction network consists of a point cloud feature extraction encoder. Input the point cloud of the 3D target instance and the one-dimensional feature vector corresponding to each instance.

[0035] Step 2: Input the enhanced images A and B obtained in Step 1 into the student / teacher 3D object detection network to obtain features from different feature layers. The student / teacher 3D object detection network is an isomorphic, parameter-equivalent Siamese network, consisting of a backbone network, a region recommendation network, and a detection head. The student / teacher backbone network inputs enhanced images A and B to obtain scene feature 1 and scene feature 2. The student / teacher region recommendation network inputs scene features to obtain target feature 1 and target feature 2. The student / teacher detection head inputs target features to obtain detection result 1 and detection result 2.

[0036] Step 3: Perform cross-modal feature distillation on the one-dimensional feature vector corresponding to the 3D target instance obtained in Step 1, and target feature 1 and target feature 2 obtained in Step 2. The cross-modal feature distillation module measures the similarity between target feature 1, target feature 2 and the one-dimensional feature vector of the point cloud, and this similarity is measured by the KL divergence function. By reducing the cross-modal similarity, the training of the image 3D target detection network (i.e., student / teacher network) is supervised.

[0037] Step 4: Input the scene features 1 / 2, target features 1 / 2, and detection results 1 / 2 obtained in Step 2 into the scene-level feature self-distillation module, the region of interest feature self-distillation module, and the response-level label self-distillation module, respectively, to perform self-distillation training of the student / teacher network: The scene-level self-distillation module measures the distance between scene features 1 and scene features 2 using the absolute difference; the region of interest feature self-distillation module measures the distance between target features 1 and target features 2 using the KL divergence function; the response-level label self-distillation module measures the distance between detection results 1 and detection results 2 by constructing the intersection-over-union (IoU) loss function; by reducing the distance measured by the above three modules, the training of the image 3D object detection network (i.e., the student / teacher network) is supervised.

[0038] Step 5: Input the detection results 1 / 2 and the true labels of the labeled samples obtained in Step 2 into the supervised learning module: The supervised learning module constructs the Intersection over Union (IoU) loss function to measure the distance between detection result 1 and the true label, and detection result 2 and the true label, respectively; by reducing the above measured distance, the training of the image 3D object detection network (i.e., student / teacher network) is supervised.

[0039] Step Six: Linearly sum the various distance metrics calculated in Steps Three, Four, and Five to obtain the network training loss function. Train the image 3D object detection network (i.e., the student / teacher network) using backpropagation and an alternating update strategy: For each batch of input data, the student and teacher models perform forward computation simultaneously in parallel; the computation results are passed to each other for loss calculation, achieving knowledge distillation during training; supervision is provided using real and pseudo-labels, and target feature distillation is performed using the generated data's 3D model; before the network parameters are finalized (i.e., before the loss function converges), only the student or teacher parameters are updated in each batch; since the student / teacher networks are isomorphic and have the same parameters, the alternating update strategy means that the image 3D object detection network parameters are updated only once in each batch.

[0040] The performance of the detection results on the dataset of this invention on the Waymo dataset is shown in the table below:

[0041]

[0042] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0043] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for enhancing the training of 3D object detection models in images under limited data annotation, characterized in that, include: The image 3D feature description is obtained, and the image 3D feature description is used to obtain enhanced image A and enhanced image B through image enhancement module. The enhanced image A and enhanced image B are used to obtain target features at different feature layers through target detection network. The target features at different feature layers are used to obtain different detection results through student / teacher 3D target detection network. The target features of different feature layers are used to train a supervised image 3D target detection network through a self-distillation module and different detection results through a supervised learning module. The training loss function of the supervised image 3D object detection network is obtained through training, and the training loss function is used to train the image 3D object detection network through backpropagation. The image 3D feature description is obtained by using the image input data enhancement module in the labeled data to obtain different enhanced images, and by inputting the lidar point cloud to which the corresponding labeled target belongs into the 3D feature extraction network. The student / teacher 3D target detection network consists of three processes: a backbone network, a region recommendation network, and a detection head. The self-distillation module includes a scene-level feature self-distillation module, a region of interest feature self-distillation module, and a response-level label self-distillation module. When the target features of different feature layers and the different detection results are processed by the module, half of the target features and half of the detection results are used. The loss calculation of the computational results of the image 3D object detection network is used for knowledge distillation during the training process.

2. The method for enhancing the training of an image 3D object detection model under limited data annotation as described in claim 1, characterized in that: The image enhancement module is used to generate images with the same scale but different appearances.

3. The method for enhancing the training of an image 3D object detection model under finite data annotation as described in claim 1, characterized in that: The image 3D feature description extraction network consists of a point cloud feature extraction encoder.

4. The method for enhancing the training of an image 3D object detection model under limited data annotation as described in claim 1, characterized in that: Before the network parameters are finalized, i.e. before the loss function converges, each batch only updates the parameters of the students or teachers.

5. The method for enhancing the training of an image 3D object detection model under finite data annotation as described in claim 1, characterized in that: If the parameters of the student / teacher 3D object detection network are the same, the image 3D object detection network parameters are updated only once in each batch.