Three-dimensional object detection method based on generation and refinement of occlusion representation

By generating and refining occlusion representations, the problem of object recognition in 3D target detection involving sparse point clouds and incomplete shapes is solved, improving the robustness and generalization ability of the detector and achieving high-precision object recognition.

CN117218607BActive Publication Date: 2026-06-12NANJING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF SCI & TECH
Filing Date
2023-08-27
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing 3D object detection methods cannot effectively identify object positions when processing sparse and incomplete point cloud data, and do not consider the alignment and domain differences between objects when generating target point clouds, resulting in poor detector generalization.

Method used

We employ a method of generating and refining occlusion representations. Initial occlusion representations are generated through a candidate box representation encoding voting strategy and a centrosymmetric method in spherical space. Weights are then refined in cylindrical space by combining density and distance weight allocation strategies. This constructs a shape learning network to improve the robustness of the detection network.

🎯Benefits of technology

This improved the detection performance of the 3D object detector for occluded objects, enabled end-to-end training, and enhanced the generalization ability and detection accuracy of the detection network.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117218607B_ABST
    Figure CN117218607B_ABST
Patent Text Reader

Abstract

The application discloses a three-dimensional target detection method based on generation and refining of occlusion representation, which comprises the following steps: generating initial occlusion representation by adopting a candidate box lower representation coding voting strategy and a center-symmetry method based on object centroid in a spherical space; constructing a shape learning network based on representation movement to generate occlusion representation similar to a prototype target; assigning weight to each representation according to prior knowledge of density and distance based on the completed occlusion representation, refining the occlusion representation with higher weight in a cylindrical space, and weighting the weight to a feature channel generated by the occlusion representation in the detection process; and generating high-quality object features through optimization iteration of the detection network and the shape learning network. The application is suitable for traditional three-dimensional target detection networks, and the method based on generation and refining of occlusion representation can improve the detection performance of the traditional three-dimensional target detector for severely occluded objects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of lidar point cloud processing in traffic scenarios, specifically involving a three-dimensional target detection method based on generating and refining occlusion representations. Background Technology

[0002] With the development of autonomous driving technology, 3D object detection plays an increasingly important role in the field. As a core component, the performance of 3D object detection is closely related to the traffic safety of drivers and passengers, making the acquisition of high-performance 3D object detectors a key research focus. While 3D object detection methods based on different network structures and feature representations have achieved considerable performance in terms of speed and accuracy, these works have neglected the impact of point cloud characteristics on model performance. Therefore, improving the performance of classic 3D object detectors can be achieved by analyzing point cloud characteristics.

[0003] In recent years, some works have conducted detailed analyses from the perspectives of point cloud statistics and model performance bottlenecks. These analyses can be summarized as follows: Point cloud data in traffic scenarios is sparse and incomplete, generally only covering part of the object's surface and failing to reflect the shape of the complete object, making it difficult for detectors to accurately identify the location of the complete object. These works have addressed the impact of point cloud characteristics on detectors by completing the object's shape and densifying the point cloud, constructing auxiliary tasks, or distilling networks. Although these methods have broken through the performance bottleneck of traditional models by addressing the characteristics of point clouds, the following problems still exist: (1) When generating target point clouds, the direct mixing method generates complete or dense point clouds without considering the alignment and domain differences between objects, and without filtering out noise interference during the mixing process; (2) The point cloud completion or densification module needs to be trained in advance, which complicates the training process of the 3D detection network. The trained point cloud completion or densification module can only be used on the given training dataset, resulting in poor generalization of the detection model. Therefore, in the field of 3D object detection, how to detect objects with sparse point clouds and incomplete shapes remains a challenge. Summary of the Invention

[0004] The purpose of this invention is to provide a three-dimensional target detection method based on generating and refining occlusion representations.

[0005] The technical solution to achieve the purpose of this invention is as follows: Firstly, this invention provides a method for three-dimensional target detection in traffic scenes based on the generation and refinement of occlusion representations, comprising the following steps:

[0006] Point cloud data is acquired from radar sensors and input into the feature extraction network and region proposal network for 3D object detection to generate raw representation information and candidate boxes.

[0007] A module for generating occlusion representations is established, including a representation encoding voting strategy under candidate boxes and a centroid-based centrosymmetric method in spherical space to generate initial occlusion representations. A shape learning network based on representation movement is constructed to generate occlusion representations similar to the prototype target.

[0008] Based on the completion of the occlusion representation, each representation is assigned a weight according to prior knowledge of density and distance, and the occlusion representation with a weight higher than a set threshold is refined in cylindrical space.

[0009] Weights are added to the occlusion representation in the feature channels generated during the detection process. Based on the completed object representation and candidate boxes, the object category and location are classified and regressed in the 3D object detection network.

[0010] In a second aspect, the present invention provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described in the first aspect.

[0011] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in the first aspect.

[0012] Compared with the prior art, the significant advantages of this invention are: (1) This invention constructs a representation densification method, which alleviates the noise problem brought to the target representation by the densification process of direct mixing of representations through the representation center symmetry initialization method in spherical space and the constructed shape learning network. It constructs a weight allocation strategy based on density and distance to further refine the generated occlusion representation. By setting weight thresholds to finely filter high-quality occlusion representations, it improves the consistency learning of the detection network for point clouds under the same object; (2) In the three-dimensional target detection network, this invention integrates the occlusion representation generation module and the weight allocation module, making the training and inference of the network an end-to-end process without multi-step implementation; (3) This invention is a plug-and-play module that can be applied to general three-dimensional target detection methods, including three-dimensional target detection methods based on key points, voxels, and key point-voxels, and can also be applied to different benchmark datasets. Attached Figure Description

[0013] Figure 1 This is a flowchart of the 3D target detection method in traffic scenarios based on the generation and refinement of occlusion representations according to the present invention.

[0014] Figure 2 This is a schematic diagram illustrating the process of completing the occlusion characterization of the present invention.

[0015] Figure 3 This is a schematic diagram of the density and distance-based weight allocation strategy process of the present invention. Specific implementation methods

[0016] This invention proposes a 3D object detection method based on generating and refining occlusion representations. The method includes a module for generating occlusion representations and a module for refining the generated occlusion representations within a 3D object detection model. These two modules respectively generate and refine the representations of the occluded parts of the object, improving the detector's performance in detecting occluded objects. The main steps of this method are: generating initial occlusion representations using a candidate box-based representation encoding voting strategy and a centroid-based centrifugal method in spherical space; constructing a shape learning network based on representation movement to generate occlusion representations similar to the prototype object; assigning weights to each representation based on prior knowledge of density and distance based on the completed occlusion representations; refining occlusion representations with higher weights in cylindrical space; and weighting these weights into the feature channels generated during the detection process; and generating high-quality object features through iterative optimization of the detection network and the shape learning network. This invention is applicable to traditional 3D object detection networks and, based on the method of generating and refining occlusion representations, can improve the detection performance of traditional 3D object detectors for severely occluded objects.

[0017] The following is combined with Figure 1 This paper provides a detailed description of the 3D target detection method in traffic scenes based on the generation and refinement of occlusion representations of the present invention. The method includes the following steps:

[0018] (1) Generate occlusion representation based on point cloud data:

[0019] Step 1: Acquire point cloud data P from the radar sensor and input it into the feature extraction network and region proposal network for 3D target detection to generate raw representation information R. r and candidate box B c .

[0020] Step 2: Establish an occlusion representation generation module, including a representation encoding voting strategy under candidate boxes and a centroid-based centrifugal method in spherical space to generate initial occlusion representations. Construct a shape learning network based on representation movement to generate an occlusion representation R similar to the prototype target. o .

[0021] (2) Constructing a weight allocation strategy to refine the occlusion representation:

[0022] Step 3: Based on the completed occlusion representation, assign weights to each representation according to prior knowledge of density and distance, and refine the occlusion representations with higher weights in cylindrical space.

[0023] Step 4: Weight the occlusion representation in the feature channels generated during the detection process, based on the completed object representation R. a and candidate box B cIn a 3D object detection network, the category and location of objects are classified and regressed.

[0024] Furthermore, firstly, an occlusion representation generation module is constructed, generating initial occlusion representations through a representation encoding voting strategy under candidate boxes and a symmetric method based on the object's centroid in spherical space. Then, a shape learning network based on representation movement is constructed, gradually generating potential occlusion representations similar to the prototype target by moving the positions of the initial occlusion representations. Next, based on the generated occlusion representations, a weight allocation strategy is established to further refine the generated occlusion representations, using density and distance prior knowledge, from spherical space to cylindrical space. Finally, the completed object representation, candidate boxes, and weights are input into a detection network to further classify and regress the object's category and location.

[0025] The process of generating occlusion representations based on point cloud data is as follows:

[0026] Step 1: Acquire point cloud data P from the radar sensor and input it into the feature extraction network and region proposal network for 3D target detection to generate raw representation information R. r and candidate box B c .

[0027] 3D object detection based on point clouds is an important task in the field of autonomous driving. Because high detection accuracy is required, detectors are often designed as two-stage networks. In traffic scenarios, autonomous vehicles need point clouds P acquired by radar sensors as input. First, the point cloud data undergoes representation and feature extraction to obtain the quantized raw representation R of the point cloud data. r Then, through the region proposal network, and via high compression and a two-dimensional backbone network, coarse candidate boxes B are generated and regressed. c Among them, R r ∈R x×y×z×f x is the spatial length coordinate, y is the spatial width coordinate, z is the spatial height coordinate, and f is the characterization feature; B c ∈B x×y×z×l×w×h×θ×c l is the candidate box length, w is the candidate box width, h is the candidate box height, θ is the angle between the candidate box and the spatial width axis, and c is the predicted object category in the candidate box.

[0028] Step 2: Establish an occlusion representation generation module, including a representation encoding voting strategy under candidate boxes and a centroid-based centrifugal method in spherical space to generate initial occlusion representations. Construct a shape learning network based on representation movement to generate an occlusion representation R similar to the prototype target. o Generate occlusion representation R o The formula is as follows:

[0029] R o =N(C(V(R)r B c ))+R r )-R r ,

[0030] Where V(·) is the candidate box representation encoding voting strategy, C(·) is the centroid-based centrifugal method in spherical space, and N(·) is the representation-based shape learning network.

[0031] The process of generating occlusion representations consists of three steps:

[0032] (1) Representation of generated objects R v :

[0033]

[0034] Among them, F c (·) is a counting function, T c I is the voting threshold. v Encode the representation under the candidate bounding box. The first step is to input the original representation and the candidate bounding box, and generate the representations that appear most frequently under different candidate bounding boxes as the representations of the corresponding objects.

[0035] (2) Generate initial occlusion representation

[0036]

[0037] Where L(·) is the centrally symmetric mapping function, C v R in spherical space v Centroid. The second step inputs the voting results as the representation and centroid of the same object under different candidate bounding boxes, generating the initial occlusion representation.

[0038] (3) Generate occlusion representation R through network o :

[0039]

[0040] Where t is the number of iterations of the detection network, and R s The third step involves inputting the initial occlusion representation and the sum of the original representations, along with the prototype target representation. Through a shape learning network based on representation movement, the network learns prior shape knowledge of the corresponding prototype target. After network iteration, the final occlusion representation R is generated. o .

[0041] The entire generation process is shown in the attached figure. Figure 2As shown, the process begins by initializing the occlusion representation list for the batch point cloud data and the occlusion representation list for each object. Then, the occlusion representations are initialized sequentially using a candidate bounding box representation encoding voting strategy and a centroid-based centrosymmetric method in spherical space. Next, a shape learning network based on representation movement is constructed, and a prototype target is built to supervise the shape prior of the occlusion representation learning. Finally, after each candidate bounding box is processed, the value of `i` is incremented by 1. When the number of processing times for a batch of candidate bounding boxes reaches the number of candidate bounding boxes, the loop is exited. After each batch is processed, the value of `j` is incremented by 1. When the number of batch processing times reaches the total number of batch processing times, the loop is exited, completing the generation of the occlusion representations.

[0042] Furthermore, the occlusion representation completion module can be divided into three parts: a representation encoding voting strategy under candidate boxes, a centroid-based central symmetry method in spherical space, and a shape learning network based on representation movement. The representation encoding voting strategy votes on the representation encoding under candidate boxes to obtain the original representation of the same object under different candidate boxes, reducing computational cost. The centroid-based central symmetry method in spherical space approximates the object's centroid by using the mean of multiple candidate box centers under the same object, performing central symmetry on the original representation to initialize the occlusion representation. The shape learning network based on representation movement builds a recurrent neural network on PointNet++, learning the offset of the occlusion representation position by introducing chamfer distance and ground motion distance constraints, thereby learning shape prior knowledge. The prototype target selection method, under a standard dataset, aligns and mixes point clouds of objects of the same category, selecting the set of voxel centers containing the highest representation information in the voxelization of the point cloud as the prototype target representation. The occlusion representation completion module is embedded in the detection network and can iterate along with it. Through data update processing and network iteration optimization, the occlusion representation can finally be generated.

[0043] Step 3: Based on the completed occlusion representation, assign weights to each representation according to prior knowledge of density and distance, and refine the occlusion representations with higher weights in cylindrical space.

[0044] The density- and distance-based weight allocation strategy process is as follows: Figure 3 As shown, the weight allocation strategy is divided into density and distance weight calculation methods. Density weight is used to evaluate the quality of the representation, while distance weight is used to distinguish between foreground and background representations of the same quality. The weight allocation is refined by shrinking from a spherical space to a cylindrical space.

[0045] Furthermore, the occlusion representation r' is refined from spherical space to cylindrical space. o The formula is:

[0046]

[0047] in, To satisfy the occlusion representation with distance constraints, i is the index of the occlusion representation in spherical space, l(·) is the distance function, and d m Let dx be the diameter of the cylinder. m dy is the width of the candidate box. m dz is the candidate box length. m Let be the height of the candidate bounding box, min(·) be the minimum function, and max(·) be the maximum function. Based on the prior conditions of the candidate bounding box, the type of object in the box can be approximately determined from its length, width, and height. For example, in a traffic scene, the cylindrical space for the pedestrian category is a cylinder perpendicular to the ground, while for the car category, it is a cylinder horizontal to the ground. The refined cylindrical space can filter out some background representations surrounding the foreground representations.

[0048] Based on prior knowledge of density, the density information p of each occlusion representation w (r' o ) is represented as:

[0049]

[0050] Where K(·) is the kernel function, w d Let N be the bandwidth, N be the number of occlusion representations, and i be the index of the occlusion representation in the spherical space. Using the kernel density calculation function, inputting the 3D coordinates of the occlusion representation and the bandwidth range, we can obtain the density information of each representation under the candidate box. However, due to the characteristic that most point clouds cover the object surface, this will result in the density of some sparse point clouds inside the object being close to the density of the point cloud outside the object in the cylindrical space. This is not desirable when assigning weights. Therefore, a distance factor still needs to be introduced to alleviate this situation.

[0051] Based on prior knowledge of distance, the distance information l(r') for each occlusion representation o ) is represented as:

[0052]

[0053] Where E(·) is the Euclidean distance calculation formula. Let be the coordinates of the candidate box center, and ∈ be the negative value of the candidate box tilt angle. The plane containing the line connecting the candidate box center and perpendicular to the Z-axis is used to represent the counterclockwise angle between the Z-axis and the Y-axis. Based on spatial geometry, the occlusion representation in the cylindrical space and the coordinates of the object's centroid are input, and then the angle information is used to calculate the distance of each representation from the perpendicular bisector of the cylinder.

[0054] After calculating the density function and distance function separately, a linear combination of these functions yields the weight allocation function W(r'). o ):

[0055] W(r'o ) = S p ·(μ·p w (r' o )+(1-μ)·l(r' o ) -1 ),

[0056] Among them, S p Let μ be the candidate box confidence score, and μ be the balancing weight coefficient, set to 0.6. The weights for each occlusion representation can be calculated using the weight allocation function. The formula implies that the detection network should learn points with higher density and closer proximity to the interior. By calculating weights for each representation and setting a weight threshold, points with lower empirical confidence are filtered out, thus simplifying the generated occlusion representations. The threshold is set to 0.2.

[0057] Step 4: Weight the occlusion representation in the feature channels generated during the detection process, based on the completed object representation R. a and candidate box B c In a 3D object detection network, the category and location of objects are classified and regressed.

[0058] Based on the refined occlusion representation r' in cylindrical space o And the density and distance-based weights W(r') o The threshold-filtered weights are multiplied into each channel of the occluded representation feature, forming an attention mechanism that allows the detection network to learn the semantic information of the features surrounding the generated occluded representation. This can be expressed by the formula:

[0059] F'=F·W(r' o '),

[0060] {r' o '|W(r)' o ')>T,r' o '∈r' o},

[0061] Where F represents the features aggregated in the occlusion representation, F' represents the features enhanced after weights are applied to F, T is the weight threshold, and r' is the weight threshold. o 'for r' o The occlusion characterization is refined based on the threshold T.

[0062] The network's loss function consists of three parts: the region proposal network loss L. rpn Shape learning network loss L shape and two-stage detection network loss L rcnn :

[0063] L rpn =L cls +α·Ldir +β·L loc ,

[0064] L shape =L cha +ε·L emd ,

[0065] L rcnn =L iou +γ·L reg ,

[0066] The total network loss is:

[0067] L = L rpn +L shape +L rcnn ,

[0068] Among them, L cls For the region-specific network category loss, L dir For the area-specific network direction loss, L loc For the region-recommended network regression loss, L cha For the chamfer distance loss of the shape learning network, L emd For the ground motion distance loss of the shape learning network, L iou To detect network category loss, L reg To test the network regression loss, α, β, ε, and γ are the balance loss coefficients, set to 0.2, 2.0, 0.001, and 1.0, respectively. By iterating the network parameters, the occlusion representation can learn the shape prior of the prototype target. In summary, point cloud data, after network propagation and processing, can yield high-quality regression boxes and categories for occluded targets.

[0069] The application of this invention requires the construction of a deep learning model using point cloud datasets in traffic scenarios. Through end-to-end learning and training, the resulting model alleviates the weakening effect of point cloud sparsity and incompleteness on the performance of general detectors, and can achieve higher detection accuracy on some difficult point cloud samples.

[0070] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit it. Parts not described in detail are common knowledge to those skilled in the art. The scope of protection of the present invention is determined by the claims, and any equivalent changes made based on the technical teachings of the present invention are also within the scope of protection of the present invention.

Claims

1. A three-dimensional target detection method based on generating and refining occlusion representations, characterized in that, Includes the following steps: Point cloud data is acquired from radar sensors and input into the feature extraction network and region proposal network for 3D object detection to generate raw representation information and candidate boxes. An occlusion representation generation module is established, including a representation encoding voting strategy under candidate boxes and a centroid-based centrism method in spherical space to generate initial occlusion representations. A shape learning network based on representation movement is then constructed to generate occlusion representations similar to the prototype target. Generate occlusion representation The formula is as follows: ; in, Encode the voting strategy for the representation under the candidate box. This is a centrosymmetric method based on the object's center of mass in spherical space. It is a shape learning network based on the representation of movement; The process of generating object occlusion representations consists of three steps: (1) Representation of generated objects : ; in, For counting functions, The voting threshold, Encode the representation under the candidate box; the first step is to input the original representation and the candidate box, and generate the representation of the corresponding object that appears more frequently under different candidate boxes than a set threshold. (2) Generate the initial occlusion representation : ; in, It is a centrally symmetric mapping function. In spherical space The first step is to obtain the centroid of the object; the second step is to input the representation of the same object under different candidate boxes and the coordinates of the object's centroid to generate the initial occlusion representation. ; (3) Generate occlusion representation through network : ; Where t is the number of iterations of the detection network. The first step is to represent the prototype target. The third step involves inputting the sum of the initial occlusion representation and the original representation, along with the prototype target representation. Through a shape learning network based on representation movement, the network learns the shape prior knowledge of the corresponding prototype target. After network iteration, the final occlusion representation is generated. ; Based on the completion of the occlusion representation, each representation is assigned a weight according to the prior knowledge of density and distance. The occlusion representation with a weight higher than a set threshold is refined in the cylindrical space. Specifically, the weight allocation strategy is divided into density and distance weight calculation methods. The density weight is used to evaluate the representation quality, and the distance weight is used to distinguish the foreground and background representations with the same density quality. The weight allocation is refined by shrinking from the spherical space to the cylindrical space. Occlusion representation in cylindrical space Represented as: ; in To satisfy the occlusion representation that meets the distance constraint, i is the number of the occlusion representation in spherical space. Let it be a distance function. The diameter of the cylinder is The candidate box width, The length of the candidate box. The height of the candidate box. To take the smaller function, To take the larger function; Based on prior knowledge of density, the density information of each occlusion representation Represented as: ; in For kernel function, For bandwidth, is the number of occlusion representations, and i is the number of the occlusion representation in the spherical space; Based on prior knowledge of distance, the distance information of each occlusion representation Represented as: ; in This is the formula for calculating Euclidean distance. The coordinates of the candidate box center are: The negative value of the candidate box tilt angle. This represents the counterclockwise angle between the plane containing the line connecting the center of the candidate box and the plane perpendicular to the Z-axis and the Y-axis; The density and distance information are weighted together to obtain a weight allocation based on density and distance. The calculation formula is: ; in The confidence score of the candidate box. To balance the weighting coefficients, an empirical threshold is set to filter out occlusion representations with lower weights. Weights are added to the occlusion representation in the feature channels generated during the detection process. Based on the completed object representation and candidate boxes, the object category and location are classified and regressed in the 3D object detection network.

2. The three-dimensional target detection method based on generating and refining occlusion representations according to claim 1, characterized in that, Point cloud data acquired from radar sensors The input is fed into the feature extraction network and region proposal network for 3D object detection to generate raw representation information. and candidate boxes The details are as follows: Point cloud data acquired by radar sensors As input, firstly, the raw representation of the quantized point cloud data is obtained through a 3D backbone network, including a point cloud feature extractor or a voxel feature extractor. Then, candidate boxes are classified and regressed using a region proposal network with highly compressed representations and a two-dimensional backbone network. ;in, , For spatial long coordinates, For spatial wide coordinates, For spatial high coordinates, For characterization features; , The length of the candidate box. The candidate box width, The height of the candidate box. The angle between the candidate bounding box and the width axis of the spatial coordinate system. Predict the category of the object in the candidate box.

3. The three-dimensional target detection method based on generating and refining occlusion representations according to claim 1, characterized in that, Weights are added to the occlusion representation in the feature channels generated during the detection process, based on the completed object representation. and candidate boxes In a 3D object detection network, the category and location of objects are classified and regressed, as follows: The weights are applied to the feature channels of the occlusion representation aggregation, and the formula is: ; ; in, Features that are clustered in the occlusion representation In order to be in Features enhanced after applying weights For weighted thresholds, In order to be in According to the threshold The refined occlusion representation; The network's loss function consists of three parts: the region proposal network loss, etc. Shape learning network loss and the second phase of network loss detection : ; ; ; The total network loss is: ; in, For regional network category loss, For regional network direction loss, The region is suggested to be the network regression loss. For the chamfer distance loss of the shape learning network, For the ground motion distance loss of the shape learning network, To detect network category loss, To detect network regression loss, and This is the balance loss coefficient.

4. The three-dimensional target detection method based on generating and refining occlusion representations according to claim 3, characterized in that, and They were set to 0.2, 2.0, 0.001, and 1.0 respectively.

5. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1-4.

6. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1-4.