A two-stage 3D object detection method, system, medium and terminal

By designing multi-scale bottom-up feature maps and lateral connection layers, and combining pooled feature maps at specified resolutions, 3D region candidates are refined directly in the bird's-eye view BEV space. This solves the problem of high computational and storage resource consumption in existing technologies and improves the accuracy and efficiency of small target detection.

CN116110029BActive Publication Date: 2026-06-23SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2023-02-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing two-stage 3D object detection methods consume significant computational and storage resources, and the representation of intermediate key points and pooling points limits performance improvement.

Method used

A feature pyramid (FPN) is constructed using multi-scale bottom-up feature maps. Combined with lateral connection layers and pooled feature maps of specified resolution, 3D region candidates are refined directly in the bird's-eye view (BEV) space to generate high-quality 3D object detection results.

Benefits of technology

By reducing the number of intermediate key points and pooling points, detection performance is improved, especially the accuracy and efficiency of small target detection, while reducing the consumption of computing and storage resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116110029B_ABST
    Figure CN116110029B_ABST
Patent Text Reader

Abstract

The application provides a two-stage 3D target detection method, comprising: extracting a multi-scale bottom-up feature map based on a point cloud pillar in an original airborne laser radar LiDAR point cloud; based on the multi-scale bottom-up feature map, a transverse connection layer is constructed, a feature pyramid FPN formed by the multi-scale bottom-up feature map is guided to a region candidate network RPN, and a high-quality 3D region candidate is generated; based on the multi-scale bottom-up feature map C1, C2, C3, C4 and C5 and based on the transverse connection layer, a pooling feature map with a specified resolution is constructed, which is used for refining the 3D region candidate in a bird's eye view BEV space and generating a final 3D target detection result. The application fully excavates the pillar representation of geometric structure features, improves the small target detection performance by using a basic component FPN in the field of computer vision, and completes target detection by using a bilinear interpolation RoI pooling in the BEV space, thereby improving the detection quality.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, specifically to a two-stage 3D target detection method, system, medium, and terminal. Background Technology

[0002] Object detection is an important area in computer vision, with broad application prospects and market value. With the development of various sensor technologies and automotive technologies, the role of 3D object detection in autonomous driving is gradually becoming apparent. One of the commonly used sensors for 3D object detection tasks in autonomous driving is LiDAR, which corresponds to unordered and sparse LiDAR point clouds. Due to the ability of LiDAR point clouds to accurately acquire geometric information of the environment, and the decreasing cost of LiDAR, high-performance two-stage 3D object detection methods have gradually become a research focus both domestically and internationally.

[0003] Existing two-stage 3D object detection methods use a point- / grid-based feature extraction backbone network in the first stage, and then refine 3D region proposals in 3D space. In previous two-stage detection methods, effective point representation plays a crucial role, which can be summarized into two main key steps: scene-to-keypoint feature encoding and 3D region proposal feature extraction in 3D space. 1. Scene-to-keypoint feature encoding: Encodes the entire scene features into a small number of keypoints, with the granularity of the keypoints closely related to the feature extraction backbone network; 2. 3D region proposal feature extraction in 3D space: Represents the 3D region proposal as uniformly distributed points or voxels, and then uses a point-based R-CNN process for point cloud region pooling.

[0004] Previous two-stage 3D object detection methods have relied on three key factors to improve performance: 1. Precisely located keypoints can retain rich 3D structural information; 2. Each 3D region is proposed as a pooling point distributed in 3D space; 3. Local features of pooling points are accumulated at keypoints. However, relying on intermediate keypoint representations and a large number of pooling points consumes significant computational and storage resources for the entire two-stage 3D object detection system. Summary of the Invention

[0005] To address the shortcomings of existing technologies, the purpose of this invention is to provide a two-stage 3D target detection method, system, medium, and terminal.

[0006] According to one aspect of the present invention, a two-stage 3D target detection method is provided, comprising a first stage and a second stage;

[0007] The first stage includes:

[0008] Extract multi-scale bottom-up feature maps based on point cloud pillars from the original airborne LiDAR point cloud;

[0009] Based on the multi-scale bottom-up feature map, a lateral connection layer is constructed to guide the feature pyramid FPN formed by the multi-scale bottom-up feature map to the region candidate network RPN, generating high-quality 3D region candidates.

[0010] The second stage includes:

[0011] Based on the multi-scale bottom-up feature maps {C1,C2,C3,C4,C5}, and based on the lateral connection layer, a pooling feature map of a specified resolution is constructed to refine the 3D region candidates in the bird's-eye view BEV space, generating the final 3D object detection result.

[0012] Preferably, the extraction of multi-scale bottom-up feature maps {C1, C2, C3, C4, C5} based on the point cloud pillar from the original airborne LiDAR point cloud includes:

[0013] Obtain raw LiDAR point cloud (L) and point cloud Pillar converter (FE) L and FE, a sparse feature extractor based on 2D CNN P ;

[0014] The original LiDAR point cloud L is input into the point cloud Pillar converter FE. L The sparse Pillar feature representation P of the point cloud L is obtained;

[0015] The sparse Pillar feature representation P is input into the feature extractor FE. P Generate multi-scale bottom-up feature maps, namely sparse and dense bottom-up backbone features {C1,C2,C3,C4,C5}, where {C1,C2,C3,C4} are low-level sparse backbone features and C5 is a dense high-level semantic backbone feature.

[0016] Preferably, the step of constructing a lateral connection layer based on the multi-scale bottom-up feature map to guide the feature pyramid (FPN) formed by the multi-scale bottom-up feature map to the region candidate network (RPN) to generate high-quality 3D region candidates includes:

[0017] Design a lateral connection layer that upsamples the low-resolution feature map C5 with dense high-level semantics to obtain a high-resolution top-down feature map, and merges it with the bottom-up feature map C4 of the same spatial size to form a high-resolution feature map P4 rich in semantic information, which is used to detect large objects.

[0018] The feature map P4 is further upsampled to obtain a high-resolution top-down feature map rich in semantic information, and then fused with the bottom-up feature map C3 of the same spatial size to form a new feature map P3, which is used to detect small objects.

[0019] Using the same center-based detection head, 3D region candidate Pro are generated for each of the different sizes of the target categories in their respective feature maps {P4, P3}. 3D This refers to bounding box regression and classification; where large objects are detected at low resolution P3 and small objects are detected at high resolution P4.

[0020] Preferably, a loss function is constructed. in, For heatmap loss, L off L z L size and L ori , respectively, are the center position deviation, height position offset, target size offset, and target orientation loss functions relative to their respective target ground truth values, and λ is the weighting coefficient.

[0021] Preferably, constructing a pooled feature map of a specified resolution based on the multi-scale bottom-up feature map and the lateral connection layer includes:

[0022] The feature map P4 is further upsampled to obtain a high-resolution top-down feature map rich in semantic information, and then fused with the bottom-up feature map C3 of the same spatial size to form a new feature map P3;

[0023] The feature map P3 is further upsampled to obtain a high-resolution top-down feature map rich in semantic information, and then fused with the bottom-up feature map C2 of the same spatial size to form a new feature map P2;

[0024] P2 or P3 is a pooling feature map with selectable size, used for RoI pooling of the 3D region candidates.

[0025] Preferably, the step of refining the 3D region candidates in the bird's-eye view BEV space to generate the final 3D target detection result includes:

[0026] The 3D region candidate Pro3D Projecting onto the bird's-eye view BEV space yields rotated 2D region candidate Pro. 2D ;

[0027] The 2D region candidate Pro 2D Candidate features of the region are extracted from the pooling feature map P2 or P3 using 2D RoI pooling. The pooling feature point size is G×G.

[0028] Preferably, for the 3D region candidate pro 3D Perform regression and confidence estimation, and calculate the bounding box regression loss L'. reg And category confidence loss L' cls Construct the loss function L refinement =L' cls +L' reg .

[0029] According to a second aspect of the present invention, a two-stage 3D target detection system is provided, comprising:

[0030] The initial feature extraction module extracts multi-scale bottom-up feature maps based on the point cloud pillar from the original airborne LiDAR point cloud.

[0031] The 3D region proposal generation module constructs a lateral connection layer based on the multi-scale bottom-up feature map, guiding the feature pyramid FPN formed by the multi-scale bottom-up feature map to the region candidate network RPN to generate high-quality 3D region candidates.

[0032] The target detection module constructs a pooled feature map of a specified resolution based on the multi-scale bottom-up feature map and the lateral connection layer. This map is used to refine the 3D region candidates in the bird's-eye view BEV space and generate the final 3D target detection result.

[0033] According to a third aspect of the present invention, a terminal is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, can be used to execute any of the two-stage 3D target detection methods described herein, or to run the two-stage 3D target detection system described herein.

[0034] According to a fourth aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, can be used to perform any of the two-stage 3D target detection methods described in the present invention, or to run the two-stage 3D target detection system described in the present invention.

[0035] Compared with the prior art, the present invention has at least one of the following beneficial effects:

[0036] The two-stage 3D object detection method and system in this invention embodiment uses only commonly used 2D CNN convolution operators to achieve high-performance 3D object detection technology. Specifically, in the first stage, a feature pyramid (FPN) composed of multi-scale bottom-up features is used to improve the detection performance of small objects. In the second stage, 3D region candidates are refined by directly extracting RoI features in the BEV space using bilinear interpolation RoI pooling. This invention fully utilizes the potential of Pillar representation during the detection process. Compared with existing technologies, it does not use intermediate keypoints, has fewer pooling points, and saves the overhead associated with intermediate keypoints and a large number of pooling points.

[0037] The two-stage 3D target detection method and system in this embodiment of the invention have a designed lateral connection layer that can effectively fuse high-level semantic features and low-level spatial features. The resulting feature maps at different scales have rich semantic information, and finally generate high-quality 3D region candidates for different size categories.

[0038] The two-stage 3D target detection method and system in this embodiment of the invention extracts RoI features of 3D region candidate proposals directly in the 2D BEV space by designing pooled feature maps with controllable size (such as P2 or P3) and using bilinear interpolation, thus achieving high-quality detection. Attached Figure Description

[0039] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0040] Figure 1 This is a flowchart of a two-stage 3D target detection method provided in one embodiment of the present invention;

[0041] Figure 2 This is a flowchart of a two-stage 3D target detection method provided in a preferred embodiment of the present invention;

[0042] Figure 3 This is a schematic diagram of the composition structure of a two-stage 3D target detection system provided in one embodiment of the present invention. Detailed Implementation

[0043] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention. These all fall within the scope of protection of the present invention.

[0044] See Figure 1 This invention provides an embodiment of a two-stage 3D target detection method, comprising:

[0045] S100, extract multi-scale bottom-up feature maps based on point cloud pillar from the original airborne LiDAR point cloud;

[0046] S200, based on the multi-scale bottom-up feature map in S100, constructs a lateral connection layer to guide the feature pyramid FPN formed by the multi-scale bottom-up feature map to the region candidate network RPN, generating high-quality 3D region candidates;

[0047] S300, based on the multi-scale bottom-up feature maps {C1,C2,C3,C4,C5} in S100, and using the same lateral connection layer as S200, constructs pooling feature maps of a specified resolution. These are used to refine the 3D region candidates in S200 in the bird's-eye view BEV space, generating the final 3D object detection results.

[0048] This embodiment fully leverages Pillar representation to highlight geometric structural features, utilizes the fundamental computer vision component FPN to enhance small target detection performance, and employs RoI pooling with bilinear interpolation in the BEV space to complete target detection, thereby improving detection quality.

[0049] In a preferred embodiment of the present invention, S100 is performed, namely, extracting the hierarchical features of the LiDAR point cloud L. The specific process is as follows:

[0050] S101, Obtain the raw LiDAR point cloud L, point cloud Pillar converter FE L FE, a sparse feature extractor based on 2D CNN P ;

[0051] S102, input the LiDAR point cloud L obtained in S101 into the point cloud Pillar converter FE. L The sparse Pillar feature representation P of the point cloud L is obtained;

[0052] S103, input the sparse Pillar feature representation P obtained in S102 into the feature extractor FE in S101. P Generate multi-scale bottom-up feature maps {C1,C2,C3,C4,C5}, which are sparse and dense bottom-up backbone features {C1,C2,C3,C4,C5}. {C1,C2,C3,C4} are low-level sparse backbone features, and C5 is a dense high-level semantic backbone feature.

[0053] This embodiment proposes a simple and elegant process to obtain multi-scale bottom-up feature maps {C1,C2,C3,C4,C5} as feature pyramids (FPN), which improves the performance of small target detection. It can obtain accurate 3D detection results based solely on Pillar representation of Lidar point clouds.

[0054] See Figure 2 In a preferred embodiment of the present invention, implementing S200, i.e., reusing the multi-scale bottom-up feature map {C1,C2,C3,C4,C5} extracted in S100, and designing a lateral connection to introduce the feature pyramid FPN into the first-stage region candidate network RPN to obtain a multi-scale feature map rich in semantic information, and generating 3D region candidates separately for different size categories, may include the following steps:

[0055] S201, designed for horizontal connection, uses transposed convolution with kernel size [2,2] to upsample feature C5 to obtain top-down features, uses convolution with kernel size [3,3] to reduce dimensionality corresponding to bottom-up features C4, and after concatenation, performs a convolution with kernel size [3,3] to fuse and obtain the feature output P4 of this layer;

[0056] S202, iterate the above S201 process on features C4 and C3 to obtain a set of multi-scale feature maps {P4,P3} that are rich in semantics;

[0057] S203, the multi-scale feature maps {P4, P3} generated in S202 are respectively introduced into the same center-based detection head to generate 3D region candidate targets for large-size categories (such as vehicles) and small-size categories (such as pedestrians) on the high-resolution feature map P4 and the low-resolution feature map P3, respectively. 3D .

[0058] In this embodiment, the designed lateral connection layer can effectively fuse dense and low-resolution semantic features with sparse and high-resolution spatial features, resulting in feature maps of different scales with rich semantic information, and finally generating high-quality 3D region candidates for different size categories.

[0059] In a preferred embodiment, a loss function is constructed. The network parameters are updated by minimizing this loss function until the network converges. Generally, λ is set to 2.

[0060] See Figure 2In a preferred embodiment of the present invention, step S300 is implemented, which constructs a pooling feature map of a specified resolution based on the sparse and dense bottom-up backbone features in step S100. This pooling feature map is used to refine the 3D region candidates in the bird's-eye view BEV space to generate the final 3D object detection result. The steps include the following:

[0061] S301, using the method described in S201 above, construct pooling feature maps {P2,P3} of optional size for RoI pooling;

[0062] S302, Obtain 3D region candidate Pro from S200 3D We extract features from pooling points G×G in 2D BEV space using only a simple and well-optimized bilinear interpolation method, and then apply two fully connected layers to extract 3D RoI features.

[0063] In this embodiment, a pooling feature map with controllable size can be designed, and 3D RoI features can be extracted using bilinear interpolation in the 2D BEV space to generate high-quality detection.

[0064] In a preferred embodiment, a 3D region proposal is made. 3D Perform regression and confidence prediction, and calculate the regression loss L' reg And classification loss L' cls Construct the loss function L refinement =L' cls +L' reg The network parameters are updated by minimizing this loss function until the network converges.

[0065] Based on the same inventive concept, other embodiments of the present invention also provide a two-stage 3D target detection system, see [link to relevant documentation]. Figure 3 The module consists of an initial feature extraction module, a 3D region proposal generation module, and an object detection module.

[0066] The initial feature extraction module extracts multi-scale bottom-up feature maps based on the point cloud pillar from the original airborne LiDAR point cloud; the 3D region proposal generation module constructs a lateral connection layer based on the multi-scale bottom-up feature maps, guiding the feature pyramid (FPN) formed by the multi-scale bottom-up feature maps to the region candidate network (RPN) to generate high-quality 3D region candidates; the target detection module constructs pooled feature maps of a specified resolution based on the multi-scale bottom-up feature maps and the lateral connection layer, which are used to refine the 3D region candidates in the bird's-eye view (BEV) space to generate the final 3D target detection results.

[0067] The specific implementation techniques of each module / unit in the above examples of the present invention can be referred to the steps of a two-stage 3D target detection method in the above embodiments, and will not be repeated here.

[0068] Based on the same inventive concept, other embodiments of the present invention also provide a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it can be used to perform the method in any of the embodiments, or to run the system in the embodiments.

[0069] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDRSDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs, functional modules, etc. that implement the above methods), computer instructions, etc., and the aforementioned computer programs, computer instructions, etc., can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.

[0070] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.

[0071] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.

[0072] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.

[0073] Based on the same inventive concept, other embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can be used to perform the method in any of the embodiments, or to run the system in the embodiments.

[0074] Computer-readable media include computer storage media and communication media, wherein communication media include any medium that facilitates the transfer of computer programs from one place to another. Storage media can be any available medium accessible to a general-purpose or special-purpose computer. An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and storage medium can reside in an ASIC. Alternatively, the ASIC can reside in a user device. Of course, the processor and storage medium can also exist as separate components in a communication device.

[0075] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0076] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0077] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0078] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0079] Based on the same inventive concept as the above embodiments, this invention provides an application example. Testing was conducted on the publicly available large-scale 3D object detection dataset, Waymo Open Dataset. This dataset was divided into training, validation, and test sets. The dataset has two difficulty levels, LEVEL1 and LEVEL2, with LEVEL2 encompassing LEVEL1. The difficulty level is determined by the annotator and the quantized data of the object. Specifically, all point clouds without 3D annotations are first ignored. Points with 3D annotations are classified as LEVEL2 if they are labeled as hard by the annotator or if the number of points within the bounding box is less than 5; the remaining points are classified as LEVEL1. Table 1 compares the performance of this invention with existing 3D object detection methods on the publicly available Waymo Open Dataset dataset. As shown in Table 1, it can be seen that the method provided by the above embodiments of this invention significantly outperforms other methods on the benchmark model.

[0080] Table 1

[0081]

[0082] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the essence of the present invention. The above preferred features can be used in any combination without conflict.

Claims

1. A two-stage 3D target detection method, characterized in that, Including Phase 1 and Phase 2; The first stage includes: Extracting multi-scale bottom-up feature maps based on the Pillar feature map from the original airborne LiDAR point cloud: Obtain raw LiDAR point cloud L-type converter and point cloud Pillar converter and sparse feature extractor based on 2D CNN ; The original LiDAR point cloud L is input into the point cloud Pillar converter. The sparse Pillar feature representation P of the point cloud L is obtained; The sparse Pillar feature representation P is input into the feature extractor. Generate multi-scale bottom-up feature maps, i.e., sparse and dense bottom-up backbone features. ,in, It is characterized by a low-level, sparse skeletal structure. These are the backbone features of dense high-level semantics; Based on the multi-scale bottom-up feature maps, a lateral connection layer is constructed to guide the feature pyramid (FPN) formed by the multi-scale bottom-up feature maps to the region candidate network (RPN), generating high-quality 3D region candidates. Specifically: Design a lateral connection layer that integrates the backbone features of dense high-level semantics. Upsampling yields a high-resolution top-down feature map, which is then compared with a bottom-up feature map of the same spatial size. The feature maps are fused to form high-resolution feature maps rich in semantic information. Used for detecting large objects; The feature map Further upsampling yields high-resolution top-down feature maps rich in semantic information, which are then fused with bottom-up feature maps of the same spatial size. Forming new feature maps Used to detect small objects; The feature map Using the same center-based detection head, 3D region candidates are generated in their respective feature maps for different sizes of the target categories. This refers to bounding box regression and classification; where small objects are classified at low resolution. Detection of large objects at high resolution Upper detection; The second stage includes: Based on the multi-scale bottom-up feature map Based on the lateral connection layer, a pooling feature map of a specified resolution is constructed to refine the 3D region candidates in the bird's-eye view BEV space, generating the final 3D object detection result.

2. The two-stage 3D target detection method according to claim 1, characterized in that, Constructing the loss function ,in, For heatmap loss, and These are the center position deviation, height position offset, target size offset, and target orientation loss functions relative to their respective target ground truth values. These are the weighting coefficients.

3. The two-stage 3D target detection method according to claim 1, characterized in that, The construction of a pooled feature map of a specified resolution based on the multi-scale bottom-up feature map and the lateral connection layer includes: The feature map Further upsampling yields high-resolution top-down feature maps rich in semantic information, which are then fused with bottom-up feature maps of the same spatial size. Forming new feature maps ; The feature map Further upsampling yields a high-resolution top-down feature map rich in semantic information, which is then fused with a bottom-up feature map of the same spatial size. Forming new feature maps ; The or A pooling feature map of selectable size is used for RoI pooling of the 3D region candidates.

4. The two-stage 3D target detection method according to claim 3, characterized in that, The refinement of the 3D region candidates in the bird's-eye view BEV space to generate the final 3D object detection result includes: Candidate 3D regions Projecting onto the bird's-eye view BEV space yields rotated 2D region candidates. ; Candidate 2D regions From the pooling feature map or Candidate features of the region were extracted using 2D RoI pooling. The pooling feature point size is G. G.

5. The two-stage 3D target detection method according to claim 4, characterized in that, For the 3D region candidate Perform regression and confidence estimation, and calculate the bounding box regression loss. and category confidence loss Construct the loss function .

6. A two-stage 3D target detection system, characterized in that, include: The initial feature extraction module extracts multi-scale bottom-up feature maps based on the point cloud Pillar from the original airborne LiDAR point cloud. Specifically: Obtain raw LiDAR point cloud L-type converter and point cloud Pillar converter and sparse feature extractor based on 2D CNN ; The original LiDAR point cloud L is input into the point cloud Pillar converter. The sparse Pillar feature representation P of the point cloud L is obtained; The sparse Pillar feature representation P is input into the feature extractor. Generate multi-scale bottom-up feature maps, i.e., sparse and dense bottom-up backbone features. ,in, It is characterized by a low-level, sparse skeletal structure. These are the backbone features of dense high-level semantics; The 3D region proposal generation module constructs a lateral connection layer based on the multi-scale bottom-up feature maps, guiding the feature pyramid (FPN) formed by the multi-scale bottom-up feature maps to the region candidate network (RPN) to generate high-quality 3D region candidates. Specifically: Design a lateral connection layer that integrates low-resolution feature maps of dense high-level semantics. Upsampling yields a high-resolution top-down feature map, which is then compared with a bottom-up feature map of the same spatial size. The feature maps are fused to form high-resolution feature maps rich in semantic information. Used for detecting large objects; The feature map Further upsampling yields high-resolution top-down feature maps rich in semantic information, which are then fused with bottom-up feature maps of the same spatial size. Forming new feature maps Used to detect small objects; The feature map Using the same center-based detection head, 3D region candidates are generated in their respective feature maps for different sizes of the target categories. This refers to bounding box regression and classification; where large objects are classified at low resolution. Detection of small objects at high resolution Upper detection; The target detection module constructs a pooled feature map of a specified resolution based on the multi-scale bottom-up feature map and the lateral connection layer. This map is used to refine the 3D region candidates in the bird's-eye view BEV space and generate the final 3D target detection result.

7. A terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it can be used to perform the method of any one of claims 1-5.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, this program can be used to perform the method of any one of claims 1-5.