A monocular vision 3D target detection method

By using a monocular vision 3D detection method based on the CenterNet architecture, and leveraging feature pyramids and the JoCon module to accurately predict target depth and distance, this method solves the problems of low accuracy and slow speed in existing methods, achieving efficient and economical 3D target detection.

CN117011681BActive Publication Date: 2026-06-26ZHEJIANG UFO AUTOMOBILE MFG CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UFO AUTOMOBILE MFG CO LTD
Filing Date
2023-08-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing monocular vision 3D target detection methods have failed to achieve a good balance between overall accuracy, detection speed and implementation cost, and traditional methods have significant errors in predicting target depth information.

Method used

A monocular vision 3D detection method based on the CenterNet architecture is adopted. Low-level and high-level semantic features are extracted through network skeleton and feature pyramid structure. The target center point position is obtained by combining candidate network and thinning network. The JoCon module is used to introduce surrounding target information to accurately predict depth and distance. Focal Loss and L1 Loss are used to optimize the center point and depth information to achieve prediction of length, width, height and heading angle.

Benefits of technology

It improves the accuracy and speed of monocular vision 3D target detection, reduces computational load, reduces prediction errors, and achieves cost-effective 3D target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117011681B_ABST
    Figure CN117011681B_ABST
Patent Text Reader

Abstract

The present application relates to monocular vision detection field, specifically to a kind of monocular vision 3D target detection method, low-level and high-level semantic features of picture are extracted using network skeleton and feature pyramid structure, the approximate position of target is obtained using candidate network P-Net, then the center point position of target is accurately obtained by refining network O-NeT, in order to eliminate the position deviation introduced when the image is down-sampled, in the present application, the position deviation information of target center will also be output, in addition, in order to solve the problem that the distance depth information of target is not estimated accurately in monocular 3D target detection, a simple and effective joint context module is proposed in the present application to more accurately predict the depth distance information of target, the problem of balancing comprehensive accuracy, detection speed and landing cost in the present monocular vision 3D target detection is solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of monocular vision detection, and more specifically to a monocular vision 3D target detection method. Background Technology

[0002] Object detection plays a crucial role in computer vision. Traditional object detection methods based on RGB images primarily detect 2D bounding boxes, meaning they only output the coordinates of the top-left corner or center point of the object, along with its length and width. With the rapid development of artificial intelligence, object detection technology is increasingly being applied in fields such as autonomous driving, robot navigation, and augmented reality. In autonomous driving and similar fields, real-time perception of 3D positional information, including the length, width, and depth of objects, is required, and traditional 2D object detection methods can no longer meet these requirements.

[0003] Currently, the solution to the problem that 2D image detection methods cannot obtain target depth information is to introduce LiDAR point cloud data and fuse it with 2D image data to obtain the target's depth, length, and width information. This method is also one of the mainstream methods. However, due to the high cost of LiDAR equipment and its sensitivity to severe weather, these factors directly restrict the practical application of this method. Therefore, it is of great importance to invent an economical, universal, and efficient 3D target detection method.

[0004] Currently, monocular vision-based 3D object detection methods first feed RGB images into the backbone of a neural network to extract high-level and low-level semantic information. Then, the extracted high-level and low-level semantic information are directly fused. Finally, in the detection head, depending on the task, the detection head directly predicts the center coordinates, depth information, length, width, height, and heading angle of the 3D object in the 2D view. Such a simplistic and crude prediction method inevitably brings great difficulty to the network learning process, resulting in poor prediction performance.

[0005] Although some scholars have recently used the relative positional relationships between targets and complex post-processing operations to alleviate the problem of inaccurate 3D target distance prediction, the complex post-processing operations have brought about serious time consumption problems.

[0006] Currently, 3D object detection methods based on monocular vision are not mature enough. Although detection methods are constantly improving, they have not yet achieved a good balance in terms of overall accuracy, detection speed, and implementation cost. Summary of the Invention

[0007] The purpose of this invention is to provide a monocular vision 3D target detection method to solve the problems of low overall accuracy and imbalance between detection speed and implementation cost in current monocular vision 3D target detection methods.

[0008] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:

[0009] A monocular vision 3D detection method, the overall network structure of which is based on the CenterNet architecture, for a given input RGB image , This represents the number of channels in the input image. Represents the height of the input image. The width of the input image is represented by the following steps:

[0010] Step 1: Extract low-level and high-level semantic features of the image using the network skeleton and feature pyramid structure: , This represents the input RGB image. Representing the skeleton network and the feature pyramid network, , The number of feature channels in the output. The downsampling factor of the network. Representative target Low-level and high-level fusion features extracted by the skeleton network and FPN network;

[0011] Step 2: Use the candidate network P-Net to obtain the approximate location of the target, and then use the refinement network to accurately locate it.

[0012] Obtain the center point position of the target , In the formula, , representing the predicted center point location of the target in the 2D image. The number of categories representing the target. Represents the target candidate network. Represents a refined network;

[0013] Step 3: To eliminate the positional bias introduced during image downsampling, the positional bias information of the target center will also be output. ,

[0014] Step 4: Increase the receptive field of the convolutional layers. The JoCon module is used to accurately predict the depth and distance information of the target. Information from surrounding targets is incorporated into the target depth and distance prediction. In a single JoCon module, the input feature map passes through four consecutive cascaded convolutional layers to obtain feature maps with different receptive fields. Finally, these feature maps are fused. This not only provides feature maps from different receptive fields but also reduces the computational cost of the model. By stacking this module, the size of different targets in the image can be mutually perceived, allowing the model to predict more accurate depth and distance information. , In the formula, Representing the Depth and distance information of each target. The cascaded first JoCon module

[0015] Further specifying, in addition to center point and depth distance prediction, it also includes target length, width, height prediction, and heading angle prediction, including step five: predicting the target's length, width, and height. and heading angle prediction ,

[0016]

[0017] In the formula, This indicates the module for predicting length, width, and height dimensions. This indicates the heading angle prediction module. During the model training phase, the target's center point position information is optimized using Focal Loss, while the center point's position deviation, depth distance, length, width, and height dimensions, and heading angle information are optimized using L1 Loss.

[0018] in, Indicates the loss at the center point. This indicates the loss due to the deviation in the center point position. This indicates a loss of dimensional information. This indicates the loss due to heading angle deviation. This indicates the loss of target depth and distance information. Indicates the total loss. This indicates the percentage of each loss.

[0019] The advantages of this invention over the current technology are as follows:

[0020] 1. For monocular vision 3D target detection networks, a cascaded target center point localization method is proposed. In order to eliminate the positional deviation introduced during image sampling, the positional deviation information of the target center is also output.

[0021] 2. For monocular vision-based 3D object detection networks, a novel network module is proposed. This module is simple and efficient, significantly improving the depth and distance estimation of objects. Currently, when estimating the depth information of objects, previous algorithms basically predict the depth and distance information of objects based on the characteristic that different types of objects appear relatively large in the image plane when they are close and relatively small when they are far away. The problem with this method is that if a large object is relatively far away, it will appear large in the image, and a small object will appear small when it is close. If the depth and distance information of the object is predicted based solely on its size, it will introduce a large error. By incorporating the information of surrounding objects into the depth and distance prediction, the estimation error caused by focusing only on the object's own size can be solved. This invention solves this problem. Attached Figure Description

[0022] Figure 1 This is a flowchart illustrating the logic control of the P-Net module in this invention.

[0023] Figure 2 This is a flowchart illustrating the logic control of the O-Net module in this invention.

[0024] Figure 3 This is a flowchart illustrating the logic control of the JoCon module in this invention.

[0025] Figure 4 This is a diagram of the overall network structure of the present invention. Detailed Implementation

[0026] To enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0027] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of the preferred embodiments of the invention includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of the invention pertain.

[0028] Example:

[0029] like Figure 1-4 As shown, a monocular vision 3D target detection method includes the following steps:

[0030] The overall network structure is based on the CenterNet architecture, given an input RGB image. , This represents the number of channels in the input image. Represents the height of the input image. This represents the width of the input image;

[0031] Step 1: Extract low-level and high-level semantic features of the image using the network backbone (DLA34 network in this example) and feature pyramid structure (FPN): ,

[0032] This represents the input RGB image. Representing the skeleton network and the feature pyramid network, , The number of feature channels in the output. The downsampling factor of the network. Representative target Low-level and high-level fusion features extracted by the skeleton network and FPN network;

[0033] Step 2: Use a candidate network (P-Net) to obtain the approximate location of the target, and then use a refinement network (O-Net) to accurately obtain the center point location of the target. :

[0034]

[0035] In the formula, , representing the predicted center point location of the target in the 2D image. The number of categories representing the target. Represents the target candidate network. Represents a refined network;

[0036] Step 3: To eliminate the positional deviation introduced during image downsampling, this invention also outputs the positional deviation information of the target center. ;

[0037] Step 4: To increase the receptive field of the convolutional layers, a simple and effective Joint Context Module (JoCon) is proposed to more accurately predict the depth and distance information of targets. Information from surrounding targets is incorporated into the target depth and distance prediction. In a single JoCon module, the input feature map passes through four consecutive cascaded convolutional layers to obtain feature maps with different receptive fields. Finally, these feature maps are fused. This approach not only obtains feature maps from different receptive fields but also reduces the computational cost of the model. By stacking this module, the size of different targets in the image can be mutually perceived, enabling the model to predict more accurate depth and distance information. When estimating the depth information of a target, previous algorithms have basically predicted the target's depth distance based on the characteristics that different types of targets appear relatively large in the image plane when they are close and relatively small when they are far away. The problem with this method is that if a large target is relatively far away, it will also appear large in the image, and a small target will also appear small when it is close. If the depth distance information of the target is predicted only based on the size of the target, it will introduce a large error.

[0038] In the formula, Representing the Target depth and distance information, The cascaded first One JoCon module,

[0039] Step 5: Predict the length, width, and height of the target. and heading angle prediction :

[0040]

[0041] In the formula, This indicates the module for predicting length, width, and height dimensions. This indicates the heading angle prediction module. During the model training phase, the target's center point position information is optimized using Focal Loss, while the center point's position deviation, depth distance, length, width, and height dimensions, and heading angle information are optimized using L1 Loss.

[0042] in, Indicates the loss at the center point. This indicates the loss due to the deviation in the center point position. This indicates a loss of dimensional information. This indicates the loss due to heading angle deviation. This indicates the loss of target depth and distance information. Indicates the total loss. This indicates the percentage of each loss.

[0043] The above provides a detailed description of a monocular vision 3D target detection method provided by the present invention. The specific embodiments are only used to help understand the method and core ideas of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made to the present invention without departing from the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A monocular vision 3D detection method, the overall network structure of which is based on the CenterNet architecture, for a given input RGB image. , This represents the number of channels in the input image. Represents the height of the input image. Represents the width of the input image, characterized by: Includes the following steps: Step 1: Extract low-level and high-level semantic features of the image using the network skeleton and feature pyramid structure. Conquest: , This represents the input RGB image. Representing the skeleton network and the feature pyramid network, , The number of feature channels in the output. The downsampling factor of the network. Representative target Low-level and high-level fusion features extracted by the skeleton network and FPN network; Step 2: Use the candidate network P-Net to obtain the approximate location of the target, and then use the refinement network to accurately locate it. Obtain the center point position of the target , In the formula, , representing the predicted center point location of the target in the 2D image. The number of categories representing the target. Represents the target candidate network. Represents a refined network; Step 3: To eliminate the positional bias introduced during image downsampling, the positional bias information of the target center will also be output. , Step 4: Increase the receptive field of the convolutional layers. The JoCon module is used to accurately predict the depth and distance information of the target. Information from surrounding targets is incorporated into the target depth and distance prediction. In a single JoCon module, the input feature map passes through four consecutive cascaded convolutional layers to obtain feature maps with different receptive fields. Finally, these feature maps are fused. This not only provides feature maps from different receptive fields but also reduces the computational cost of the model. By stacking this module, the size of different targets in the image can be mutually perceived, allowing the model to predict more accurate depth and distance information. , In the formula, Representing the Depth and distance information of each target. The cascaded first One JoCon module.

2. The monocular vision 3D detection method according to claim 1, characterized in that: In addition to center point and depth distance prediction, it also includes target length, width, height prediction, and heading angle prediction, including step five: predicting the target's length, width, and height. and heading angle prediction , , , In the formula, This indicates the module for predicting length, width, and height dimensions. This indicates the heading angle prediction module. During the model training phase, the target's center point position information is optimized using Focal Loss, while the center point's position deviation, depth distance, length, width, and height dimensions, and heading angle information are optimized using L1 Loss. , in, Indicates the loss at the center point. This indicates the loss due to the deviation in the center point position. This indicates a loss of dimensional information. This indicates the loss due to heading angle deviation. This indicates the loss of target depth and distance information. Indicates the total loss. This indicates the percentage of each loss.