A 3D target detection method and system based on environment perception

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By preprocessing the point cloud data of AGV vehicles and performing global map registration, combined with the joint calibration of cameras and LiDAR, and using 2D target detection to filter point clouds and establish simulation models, the problems of high computational load and high label production cost in existing 3D target detection are solved, and efficient 3D target detection is achieved.

CN116203576BActive Publication Date: 2026-06-19XI AN JIAOTONG UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: XI AN JIAOTONG UNIV
Filing Date: 2023-03-07
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing 3D object detection methods suffer from problems such as insufficient recognition performance, high computational load, slow processing speed, and high cost of 3D label production, making it impossible to conduct large-scale 3D object detection training.

Method used

An environment-aware approach is adopted. The point cloud data of the AGV is preprocessed to construct a static global map and register it with the point cloud data. The camera and LiDAR are used for joint calibration. The point cloud is filtered using 2D target detection results, and a simulation model is built to expand the point cloud. The data is then uploaded to the cloud via MQTT communication for perception fusion.

Benefits of technology

It reduces computational load, improves detection accuracy, and solves the problems of high computational load and high label production cost in 3D target detection, thus achieving efficient 3D target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116203576B_ABST

Patent Text Reader

Abstract

This invention discloses a 3D target detection method and system based on environmental perception. It constructs a static global map based on laser SLAM and registers it with the obtained point cloud data to determine the position information of the AGV (Automated Guided Vehicle) in the static global map. The registered 3D point cloud information is projected onto a 2D pixel space, and the mapping relationship between the 2D pixel coordinates and the 3D point cloud is saved to obtain fused data from the camera and LiDAR. The obstacle categories and 2D detection boxes are obtained, and the fused data is searched in reverse to filter out the point cloud information contained in the 2D detection boxes. Simulation models of different obstacles are established and converted into corresponding template point clouds. The pose transformation of the template point clouds is then used to replace and expand the real point clouds. The target point cloud in the obtained static global map is uploaded to the cloud as a single-machine perception result for perception fusion, realizing 3D target detection based on environmental perception. This invention does not require real 3D labels but relies on easily obtained 2D target detection results for training.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of embedded environment perception technology, specifically relating to a 3D target detection method and system based on environment perception. Background Technology

[0002] The hardware foundation of an environmental perception system consists of various sensors and their combinations, while the core of the software lies in the perception algorithm. Environmental perception involves an autonomous intelligent agent collecting information about its surrounding environment using multiple sensors and, based on a pre-defined perception algorithm, identifying important target information in the environment, such as static and dynamic obstacles.

[0003] Different sensors possess different characteristics, each with its own advantages and disadvantages, and are therefore suitable for different tasks. Cameras are the most commonly used sensors in perception systems, excelling in extracting rich texture and color information, making them suitable for target classification. However, their disadvantages include weak distance perception and significant susceptibility to lighting conditions. LiDAR, to some extent, compensates for the shortcomings of cameras, accurately perceiving the distance and shape of objects, thus making it suitable for mid-to-close-range target detection and ranging. However, its disadvantages include high cost and difficulty in mass production. In this invention, information acquired by the camera is used for 2D target detection. The 2D target detection results are then combined with point cloud data collected by the LiDAR for 3D target detection.

[0004] Traditional 3D object detection methods are mainly based on manual feature extraction; the workflow of traditional detection algorithms is summarized as follows:

[0005] 1. Select the region of interest, which is the area that may contain objects;

[0006] 2. Extract features from regions that may contain objects;

[0007] 3. Detect and classify the extracted features.

[0008] It has the following disadvantages:

[0009] 1. The recognition effect is not good enough, and the accuracy rate is not high;

[0010] 2. The computational load is relatively large, and the processing speed is slow;

[0011] 3. Multiple correct identification results may be generated.

[0012] With the rise of convolutional neural networks (CNNs), new ideas have emerged in the field of object detection. CNN-based object detection algorithms mainly follow two technical development paths: anchor-based and anchor-free methods. Anchor-based methods include one-stage and two-stage detection algorithms (two-stage algorithms generally have higher accuracy than one-stage algorithms, but one-stage algorithms are faster). Two-stage detection algorithms mainly consist of the following two stages: Stage 1: Generating region proposals from the image; Stage 2: Generating the final object bounding box from the region proposals.

[0013] Anchor-based object detection algorithms have the following drawbacks:

[0014] 1. The size, number, and aspect ratio of anchors have a significant impact on detection performance (by changing these hyperparameters, Retinanet improved AP by 4% on the COCO benchmark). Therefore, the detection performance of anchor-based detection is very sensitive to the size, number, and aspect ratio of anchors.

[0015] 2. Fixed anchors greatly impair the universality of detectors, requiring anchors to be resized and have different aspect ratios for different tasks;

[0016] 3. In order to match the ground truth bounding boxes, a large number of anchors need to be generated, but most of the anchors are marked as negative samples during training, which causes a severe imbalance of samples.

[0017] 4. During training, the network needs to calculate the IOU between all anchors and ground truth boxes, which consumes a lot of memory and time.

[0018] Disadvantages of the anchor-free method:

[0019] 1. Extreme imbalance between positive and negative samples

[0020] 2. Semantic ambiguity (overlapping center points of two targets): Currently, these two issues are mostly mitigated using Focus Loss and FPN, but they are not truly resolved.

[0021] 3. The detection results are unstable, and more methods need to be designed for re-weighting. Summary of the Invention

[0022] The technical problem to be solved by the present invention is to provide a 3D target detection method and system based on environment perception, which addresses the shortcomings of the prior art and solves the technical problem that the production cost of real 3D labels in real scenes is high and large-scale 3D target detection training cannot be carried out when using weak supervision.

[0023] The present invention adopts the following technical solution:

[0024] A 3D target detection method based on environmental perception includes the following steps:

[0025] S1. Preprocess the point cloud data of the AGV in the current direction of travel;

[0026] S2. Construct a static global map based on laser SLAM and register it with the point cloud data obtained in step S1 to determine the position information of the AGV in the static global map.

[0027] After steps S3 and S2 are completed, the camera and LiDAR are jointly calibrated, the registered 3D point cloud information is projected onto the 2D pixel space, and a 6D data structure is used to store the mapping relationship between the 2D pixel coordinates and the 3D point cloud to obtain the fusion data of the camera and LiDAR; 2D target detection is performed based on the RGB images collected by the AGV vehicle to obtain the category of obstacles and the 2D detection box, and the fusion data is searched in reverse to filter out the point cloud information contained in the 2D detection box;

[0028] S4. Establish simulation models of different obstacles and convert them into corresponding template point clouds. Then, transform the pose of the template point cloud to replace the real point cloud and expand the target object point cloud obtained in step S3.

[0029] S5. Using MQTT communication, the target point cloud in the static global map obtained in step S4 is uploaded to the cloud as a single-machine perception result for perception fusion, thereby realizing 3D target detection based on environmental perception.

[0030] Specifically, in step S1, the preprocessing of the point cloud data of the AGV in its current direction of travel is as follows:

[0031] A pass-through filtering method is used to prune point cloud data. A threshold is set on a specified dimension of the point cloud to divide the data in the corresponding dimension into those within the threshold range and those outside the threshold range. Point cloud data outside the threshold range is then removed.

[0032] Furthermore, a threshold method is used to remove ground point cloud data.

[0033] Specifically, in step S2, the location information incorporates odometer and matching positioning information. The odometer information is integrated with the IMU to synthesize positioning information. The positioning information is then matched with a static map to maintain the deviation between the odometer positioning and the actual positioning.

[0034] Specifically, step S3 is as follows:

[0035] Two-dimensional object detection methods are used to locate 2D object boxes in RGB images and to classify the contents of the 2D object boxes.

[0036] The collected point cloud data is mapped onto a 2D plane using camera parameters. Then, the point cloud within the 2D detection box is filtered according to the bounding boxes of the 2D object detection to obtain the point cloud of the ROI region.

[0037] Furthermore, for interfering point clouds, the depth range of the target point cloud is estimated, the point clouds in the ROI region are filtered, and then the point clouds are clustered.

[0038] Further filtering includes effective depth range filtering, clustering operations, and point cloud cluster scoring mechanisms.

[0039] Furthermore, the 2D object detection results include the top-left corner information of the detected box, the size of the detection box, the confidence score, and the timestamp.

[0040] The target point cloud includes the specific position of the current AGV in the global coordinate system, the positioning corners of the 3D detection box, the number of point clouds, the confidence level, and the obstacle category.

[0041] Specifically, in step S4, simulation models of different types of obstacles are established using CAD, and then converted into point clouds. The simulated point clouds are used as template point clouds for the corresponding types of obstacles. Feature matching is performed between the detected point clouds and the template point clouds to obtain the category of the real point cloud and its pose relative to the template point cloud. The template point cloud is adjusted to the position of the real point cloud through relative pose, and then the adjusted template point cloud is used for subsequent detection work.

[0042] Secondly, embodiments of the present invention provide a target detection system based on environmental perception, comprising:

[0043] The data module preprocesses the point cloud data of the AGV in its current direction of travel;

[0044] The map module constructs a static global map based on laser SLAM and registers it with the point cloud data obtained from the data module to determine the position information of the AGV in the static global map.

[0045] The filtering module performs joint calibration of the camera and LiDAR, projects the registered 3D point cloud information onto the 2D pixel space, and uses a 6D data structure to store the mapping relationship between the 2D pixel coordinates and the 3D point cloud, thus obtaining the fused data of the camera and LiDAR; it performs 2D target detection based on the RGB images collected by the AGV, obtains the category of obstacles and 2D detection boxes, and performs reverse search of the fused data to filter out the point cloud information contained in the 2D detection boxes;

[0046] The expansion module establishes simulation models of different obstacles and converts them into corresponding template point clouds. Then, the template point clouds are transformed to replace the real point clouds, thus expanding the target object point clouds obtained by the filtering module.

[0047] The detection module uses MQTT communication to upload the target point cloud in the static global map obtained by the expansion module as a single-machine perception result to the cloud for perception fusion, thereby realizing 3D target detection based on environmental perception.

[0048] Compared with the prior art, the present invention has at least the following beneficial effects:

[0049] A 3D target detection method based on environmental perception first performs unified processing on the point cloud, removing irrelevant points to reduce computational load. By constructing a static global map and registering it with the point cloud information on a vehicle, the vehicle's position and distance relationships with other target objects in the global context can be determined. Global localization allows for more accurate cropping of the ground point cloud. After joint calibration of the LiDAR and camera, the 3D point cloud information is projected onto a 2D plane to obtain fused data from the camera and LiDAR. Then, based on the target bounding boxes obtained from 2D target detection, the true point cloud containing the target object is filtered out from the fused data. Target point clouds are further filtered using clustering and scoring methods. In the absence of true 3D labels, the 3D target detection task is completed by using 2D target detection to assist in filtering the point cloud. In practical applications, the problem of low LiDAR beamwidth may be encountered, resulting in an overly sparse true point cloud of the target. A simulation model of the target object is built using CAD and converted into point cloud data. The simulation category is determined based on the category information obtained from 2D detection. Feature matching is performed between the filtered real point cloud and the simulated point cloud to obtain relative pose information. Then, the simulated point cloud is adjusted to the position of the target point cloud to expand the real point cloud. Finally, the target obstacle information perceived by the vehicle is uploaded to a cloud server and fused with target obstacle information perceived by other vehicles to construct a global obstacle information system.

[0050] Furthermore, a pass-through filter is used to crop the point cloud data, retaining only the point cloud within the camera's field of view. After joint calibration by the camera and LiDAR, the point cloud can be projected onto the RGB image acquired by the camera to obtain fused data. The fused data only contains point cloud information within the camera's field of view, and this portion of the point cloud accounts for a small fraction of the total point cloud. By cropping the point cloud, invalid point cloud information can be avoided, reducing the computational load.

[0051] Furthermore, since the data collected by lidar includes target obstacle point clouds and ground point cloud information, and ground point cloud information can interfere with obstacle detection and occupy a large proportion, a thresholding method can be used to crop the effective point cloud within the camera's field of view, remove redundant ground point clouds, and retain only the target object point cloud information above the ground.

[0052] Furthermore, by matching this positioning information with a static map and maintaining the deviation between the odometer positioning and the actual positioning, global positioning can be obtained, which can provide support for point cloud cropping.

[0053] Furthermore, using a 2D object detection algorithm, 2D bounding boxes and categories of obstacles are detected in the RGB images captured by the camera. After the point cloud processing operations described above, we have obtained a set of point clouds of target objects on the ground in the camera's field of view. Through joint calibration using the camera and LiDAR, the point clouds and RGB images are fused to obtain fused data. Based on the 2D bounding boxes obtained from the 2D object detection, point clouds within these bounding boxes are selected from the fused data. In this way, through multi-sensor data fusion, we can use 2D object detection to guide 3D obstacle detection, solving the problem of high cost and difficulty in creating 3D labels when directly training 3D object detection in practical applications.

[0054] Furthermore, the point cloud filtered in the previous step contains both target and interfering point clouds. To perform clustering on these point clouds, a tolerance threshold needs to be set. Since the tolerance threshold is proportional to distance, point clouds at different distances require different tolerance thresholds. However, this method might lead to point clouds of the same object being clustered into two different objects. Therefore, before performing clustering, prior knowledge can be used to determine the approximate depth threshold range of the target point cloud, and interfering point clouds in the foreground and background can be removed. This allows for more accurate clustering of the target point cloud, improving detection accuracy.

[0055] Furthermore, in the point cloud filtered by the 2D bounding boxes, we only retain those point clouds whose depth values fall within the effective depth range. This further filters the point cloud, reducing the number of point clouds participating in clustering calculations and eliminating most interfering point clouds (such as foreground occlusion and background point clouds). However, some interfering point clouds may still remain in the remaining point cloud. To more accurately find the target point cloud, we can score it based on the difference between the centroid height of the point cloud cluster and the preset height of the detection category, as well as the intersection-union ratio (IUU) of the minimum bounding box and the detection box projected back into the 2D image. We then linearly combine these scores according to the importance of each prior knowledge point to obtain the final complete scoring mechanism. The point cloud with the highest final score is the target point cloud we need to find. Through this scoring mechanism, we can more accurately filter the target point cloud, thus obtaining accurate 3D object detection results.

[0056] Furthermore, the environmental perception results are aggregated to obtain 2D target detection results and corresponding 3D target detection results. The aggregated single-vehicle perception results are then uploaded to the perception fusion system, which can calculate the global perception results.

[0057] Furthermore, since clustering operations cannot obtain the pose information of the target point cloud, we create simulation models of different categories of obstacles using CAD, which are then converted into template point clouds. We perform feature matching between the clustered target point cloud and the corresponding template point cloud to obtain the relative pose of the real point cloud. By adjusting the template point cloud to the position of the real point cloud using the relative pose, we can solve the problem of sparse target object point clouds caused by the low line count of the LiDAR.

[0058] It is understandable that the beneficial effects of the second aspect mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here.

[0059] In summary, this invention achieves 3D object detection by using 2D object detection for point cloud filtering, thus solving the problem that 3D object detection in real-world project scenarios relies on real 3D labels.

[0060] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0061] Figure 1 This invention relates to a stand-alone sensing system;

[0062] Figure 2 This is a schematic diagram of the communication process of the present invention;

[0063] Figure 3 This is a 2D object detection result image from Scene 1;

[0064] Figure 4 This is a 3D object detection result image from Scene 1;

[0065] Figure 5 This is a schematic diagram illustrating the annotation of 2D and 3D object detection on the fused data in Scene 2. Detailed Implementation

[0066] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0067] In the description of this invention, it should be understood that the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0068] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0069] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Additionally, the character " / " in this document generally indicates that the preceding and following objects have an "or" relationship.

[0070] It should be understood that although terms such as first, second, third, etc., may be used in the embodiments of the present invention to describe the preset range, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from one another. For example, without departing from the scope of the embodiments of the present invention, the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.

[0071] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."

[0072] The accompanying drawings illustrate various structural schematic diagrams according to embodiments disclosed in this invention. These drawings are not to scale, and some details have been enlarged for clarity, and some details may have been omitted. The shapes of the various regions and layers shown in the drawings, as well as their relative sizes and positional relationships, are merely exemplary and may deviate from reality due to manufacturing tolerances or technical limitations. Furthermore, those skilled in the art can design regions / layers with different shapes, sizes, and relative positions as needed.

[0073] This invention provides a 3D target detection method based on environmental perception. It employs a multi-sensor perception fusion approach. Based on RGB images captured by a camera, a 2D target detection algorithm accurately identifies the bounding box of an object in a two-dimensional plane. The point cloud data from the same moment is then mapped onto the two-dimensional plane to obtain the point cloud corresponding to the 2D target detection bounding box. The distance from the target object to the camera is then calculated based on the lens transmission principle. The point clouds at this distance are clustered to filter out the point clouds of the true target. Because the LiDAR beam used has a low beam, the collected point cloud is relatively small. Furthermore, in some specific scenarios, obstacle categories are clearly defined. Different types of obstacles are modeled using CAD and then converted into point cloud templates for the corresponding categories. Feature matching is performed between the clustered real point clouds and the point cloud templates to obtain the category of the real point cloud and its pose relative to the template point cloud. The template point cloud is adjusted to the position of the real point cloud using relative pose, and the adjusted template point cloud is used for subsequent detection. This invention not only obtains the orientation of the real point cloud but also solves the problem of sparse target object point clouds. Finally, the single-machine perception results are uploaded to the cloud for perception fusion.

[0074] Please see Figure 1 The software architecture of the single-machine environment perception system proposed in this invention consists of two parts: single-machine localization and environment perception running on a single AGV. Environment perception is further divided into 2D target detection and 3D target detection. Finally, the single-machine perception results are uploaded to the cloud for perception fusion via MQTT communication.

[0075] Please see Figure 2 The sensors and perception algorithms of a single AGV communicate via ros_topic. The final perception results of the single machine are uploaded to the cloud server for perception fusion via MQTT.

[0076] This invention provides a 3D target detection method based on environmental perception, applicable to different types of AGVs, comprising the following steps:

[0077] S1. Unified processing of point clouds;

[0078] First, the camera node inputs the RGB image into the 2D object detection model to obtain the 2D bounding boxes and category information of the targets in the image. Then, joint calibration of the LiDAR and camera is performed, fusing the RGB data and point cloud data obtained from the camera node and point cloud node to obtain fused data. Next, the fused data, along with the 2D object detection bounding boxes and target category information, are input into the 3D object detection module for point cloud filtering, ultimately generating 3D object detection bounding boxes.

[0079] Point clouds primarily provide the location information of a target object in a real-world scene.

[0080] Point cloud processing includes point cloud pruning and threshold setting to remove redundant and interfering point clouds, reduce computational load, and improve detection accuracy.

[0081] Point clouds are the raw input data for the 3D object detection stage. For this task, which primarily involves detecting and locating obstacles in the current direction of the vehicle's movement, only the point cloud data in that direction needs to be processed; point clouds in the opposite direction can be directly removed. Similarly, for point clouds perpendicular to the direction of movement, only those within the range (-4, 4) need to be retained. Point cloud pruning uses a pass-through filter. A threshold range is set on a specified dimension of the point cloud, dividing the data in that dimension into those within and those outside the threshold range, thus allowing for selection of whether to filter.

[0082] For 3D object detection tasks, ground point cloud segmentation is necessary. Since the AGV (Automated Guided Vehicle) travels on roads, a significant portion of the point cloud map obtained from LiDAR scanning belongs to the ground. In the task, these ground point clouds interfere with the detection process. Ground point clouds are not the targets to be detected, and their large proportion can easily lead to misidentification of 3D objects. Furthermore, in subsequent point cloud clustering, grouping multiple objects in nearby point clouds or combining ground and object point clouds into one category may increase the number of clusters for more sparse point clouds in the distance, thus increasing the difficulty of target point cloud segmentation. Equally important, the presence of ground point clouds increases the number of point clouds involved in subsequent calculations, increasing time complexity. To remove ground point clouds, a thresholding method can be used. A threshold is directly given for the LiDAR's Z-axis, and points below the threshold are considered ground points. Since this threshold is affected by the LiDAR's installation location and the undulations during vehicle movement, to improve robustness, a certain threshold error is allowed in the Z-axis direction of the planar model in practice.

[0083] S2, Positioning;

[0084] First, a static global map of laser SLAM is constructed, and then it is registered with the scanned point cloud on the AGV to determine the position information of the AGV in the scene.

[0085] In addition, it integrates odometer and matching positioning information. The odometer information is integrated with the IMU to synthesize positioning information. This positioning information is matched with a static map to maintain the deviation between the odometer positioning and the actual positioning, thereby obtaining global positioning to support point cloud clipping.

[0086] S3, 2D / 3D collaborative target detection;

[0087] Obstacle categories and 2D bounding boxes are obtained through 2D object detection, and point clouds are filtered using the 2D bounding boxes.

[0088] Because the point cloud data acquired by LiDAR is relatively sparse and has a much lower resolution than RGB images, 2D object detectors are more reliable for object detection and classification. However, there is a lack of large-scale real point cloud data to train 3D object detection. Therefore: First, mature 2D object detection methods are used to locate 2D bounding boxes in RGB images and classify their contents; then, the acquired point cloud data is mapped to a 2D plane using camera parameters, and the point clouds falling within the 2D bounding boxes are extracted and expanded into a region with a pyramidal base. Essentially, the 3D search space of this pyramidal region is much smaller than the entire visual area captured by LiDAR, resulting in significantly lower processing time complexity. Now, each pyramidal region contains the 3D point cloud of the object to be detected, and the object's category has been determined. However, the point cloud in this region includes not only the target object's point cloud but also some foreground and background interference point clouds.

[0089] For interfering point clouds, the depth range of the target point cloud is first estimated, and the point clouds in the pyramid region are filtered. Then, the point clouds are clustered. Because the target object category has already been obtained during 2D target detection, the point clouds are further filtered based on the prior information of the category height and the IOU between the 2D bounding boxes corresponding to the point cloud clusters and the 2D target detection boxes, thus obtaining the final target point cloud. In practical scenarios, for 3D target detection, due to the lack of a large-scale LiDAR point cloud training dataset, deep learning methods are not used in the entire 3D target detection process. Instead, low-computation point cloud processing and point cloud clustering operations are used. This can make full use of prior information to improve the accuracy of 3D target detection, while also using a relatively lightweight network model to shorten inference time, ensuring that high-precision results and real-time requirements can be met even on low-computing hardware.

[0090] 2D Object Detection: Since the ultimate goal of the algorithm is deployment on hardware platforms, considering the computing power limitations of the platform and the real-time requirements of the inference algorithm, a lightweight network was designed based on the existing 2D object detection model. The network parameters were quantized as the final model weights, and model inference and NMS post-processing were performed in separate threads. Ultimately, 2D object detection achieved a certain level of real-time performance, reaching the lower bound of algorithm deployment under limited computing power. To stably detect obstacles of the desired category, a dataset of fewer than 10,000 images was created and labeled with their categories. 2D object detection obtains the 2D bounding boxes and categories of objects in the scene, providing point cloud filtering conditions and prior information for 3D bounding boxes.

[0091] 3D object detection: First, the point cloud is mapped to a 2D plane according to camera parameters. Then, based on the bounding boxes of 2D object detection, the point cloud within the 2D bounding boxes is selected to obtain the point cloud of the Region of Interest (ROI). This process filters out most invalid point clouds. However, converting the 3D point cloud to a 2D coordinate system for filtering inevitably results in the initial ROI point cloud containing points outside the target object (due to foreground occlusion or background). Further fine-tuning is then performed to obtain a more accurate target point cloud.

[0092] In 3D object detection, after unified processing, ground point clouds are removed. Therefore, in the 3D cone region, there are no connections between ground point clouds, and objects are naturally separated in physical space. That is, points belonging to the same object are clustered together, while point clouds belonging to different objects are separated. Compared to clustering of 2D images, clustering of 3D point clouds is much easier. The purpose of clustering is to separate different objects in the 3D point cloud. A clustering algorithm based on Euclidean distance is used to determine whether points belong to the same object based on the distance between them.

[0093] Because the characteristics of point clouds collected by LiDAR are: denser point clouds at close range and sparser point clouds at distant range, clustering algorithms based on Euclidean distance have different tolerance thresholds for clustering point clouds at different distances from the LiDAR, and this threshold is proportional to the distance. However, direct clustering has some problems: the algorithm may apply different tolerance thresholds to the point cloud of the same object, resulting in two objects being clustered.

[0094] Therefore, before clustering, the approximate depth threshold range of the target point cloud is first determined based on prior information, and interfering point clouds from the foreground and background are removed. The depth threshold of the target object point cloud is derived from the lens transmission principle. Given the focal length f, the 2D bounding box height h, and the actual height H of the target object, the depth threshold is calculated using the lens transmission principle formula: Calculate the estimated depth value D of the car, and form an effective depth interval [Dl, D+l] around this value, where l is a relatively small depth value.

[0095] From the point cloud filtered by 2D bounding boxes, only points whose depth values fall within the valid depth range are retained. This further filters the point cloud, reducing the number of points participating in clustering calculations and eliminating most interfering points (foreground occlusion and background points), resulting in more accurate 3D target detection. By filtering the target point cloud, the 3D position and distribution information of the target are obtained. Next is the clustering operation, which uses the Euclidean distance clustering algorithm. The main idea is to use a Euclidean radius threshold r to perform clustering. By calculating the distance between two points, if the distance is less than the threshold r, the two points belong to the same cluster. All points are traversed until all points are clustered into different clusters, completing the clustering process.

[0096] The point cloud filtering operation prior to clustering determines the effective depth range of the target object based on prior information, which can eliminate most invalid point clouds. However, some interfering point clouds still exist and require further filtering. Current point cloud filtering scoring mechanisms assume that the target object is primarily in the foreground. Among the separated point clouds, those closer to the LiDAR coordinate origin are more likely to be the target point cloud cluster. However, there might be cases where two objects are very close, but the target object is farther from the LiDAR coordinate origin. Therefore, an alternative scoring mechanism is adopted. Since only obstacle categories are relatively fixed in the scene, and these object categories have different heights, the point clouds are further filtered based on the centroid height of the point clouds, based on this prior information. In 2D target detection, the target category has already been obtained. After clustering, the centroid height of the point cloud cluster can be calculated. Comparing the centroid height with a preset height threshold yields the target object category. The confidence level of different point cloud clusters is obtained by comparing the point cloud cluster category calculated based on the centroid height with the true target category. Point cloud clusters with the same category have high confidence, and vice versa. In addition, for occluded objects, the minimum 2D bounding box of the occluded object and the detection box generally do not have a large overlap, less than 50%. Therefore, the closer the intersection-union ratio of the minimum bounding box and the detection box of the separated objects projected back into the 2D image is to 100%, the more likely it is to be the target object.

[0097] Finally, taking both factors into account, a linear combination is performed based on the importance of each prior knowledge point to obtain the final complete scoring mechanism. The point cloud with the highest final score is the target point cloud that needs to be found.

[0098] The task of 3D object detection is to further filter out the point cloud of the target object from the point cloud selected from the 2D bounding box, including effective depth interval filtering, clustering operation, and point cloud cluster scoring mechanism.

[0099] S4 and 3D target detection use CAD models;

[0100] In target detection tasks, obstacle categories are relatively fixed. CAD models of different obstacles are pre-built, converted into corresponding template point clouds, and then the pose transformation of the template point clouds is used to replace the real point clouds.

[0101] In some scenarios, the categories of obstacles are relatively fixed, and corresponding CAD models generally exist. In 3D object detection, clustering is needed to separate the target point cloud from the sparse point cloud. The problem is that clustering operations cannot obtain the pose of the obstacles.

[0102] Simulation models of different obstacle categories are created in advance using CAD, and then converted into point clouds. These simulated point clouds serve as template point clouds for each obstacle category. Feature matching is performed between the detected point clouds and the template point clouds to obtain the category of the real point cloud and its pose relative to the template point cloud. The template point cloud is then adjusted to the position of the real point cloud using relative pose analysis, and the adjusted template point cloud is used for subsequent detection. This approach not only obtains the orientation of the real point cloud but also solves the problem of sparse point clouds caused by the low line count of the LiDAR.

[0103] S5, Communication Process.

[0104] The AGV system is based on the ROS environment. Communication between different modules is achieved through a publish-subscribe mechanism (ros_topic) to exchange data. The individual perception results obtained need to be reported to the cloud for perception fusion. Since the AGV and the server are not on the same local area network, the lightweight MQTT communication protocol is used to report the individual perception results.

[0105] The sensors and detection algorithms on the AGV communicate via ROS nodes. The environmental perception component uses 2D object detection to assist 3D object detection, accurately acquiring object information. The real point cloud is then expanded using a template point cloud converted from a CAD model. Finally, the single-machine perception results are uploaded to the cloud for perception fusion via MQTT communication.

[0106] The entire system is based on the ROS environment. Communication between the sensor module and the perception algorithm module, which consists of 2D and 3D object detection, is achieved through a publish-subscribe mechanism called ros_topic. Image nodes include RGB images.

[0107] Point cloud nodes include: point cloud information after unified processing.

[0108] The 2D object detection node includes: the top left corner information of the box obtained by object detection, the size of the detection box, the confidence score, and the timestamp.

[0109] The 3D object detection node includes: the specific position of the current vehicle in the global coordinate system. This includes the local corner points of the 3D detection box, the number of point clouds, the confidence score, and the obstacle category.

[0110] The perception results from a single device only include some perception information near the current vehicle. By reporting the perception results from the device to the perception fusion module, the perception information from different vehicles can be evaluated and fused to form global perception information for subsequent scheduling and path planning. Therefore, the reported result is the perception information from the single device, including:

[0111] Information about the 3D bounding box: diagonal points of the bounding box, center point of the bounding box, size of the bounding box, obstacle type, number of points in the cloud, and confidence level;

[0112] Information about the 2D bounding box: the diagonal points of the 2D bounding box, and the confidence level.

[0113] The reporting uses MQTT communication because the bicycle and the server are not on the same local area network and need to be transmitted over the network. MQTT is a lightweight publish-subscribe network protocol that can transmit messages between devices and is one of the most widely used IoT communication protocols today, enabling accurate and real-time message delivery.

[0114] In another embodiment of the present invention, an environment-aware target detection system is provided. This system can be used to implement the above-mentioned environment-aware 3D target detection method. Specifically, the environment-aware target detection system includes a data module, a map module, a filtering module, an expansion module, and a detection module.

[0115] The data module preprocesses the point cloud data of the AGV in its current direction of travel.

[0116] The map module constructs a static global map based on laser SLAM and registers it with the point cloud data obtained from the data module to determine the position information of the AGV in the static global map.

[0117] The filtering module performs joint calibration of the camera and LiDAR, projects the registered 3D point cloud information onto the 2D pixel space, and uses a 6D data structure to store the mapping relationship between the 2D pixel coordinates and the 3D point cloud, thus obtaining the fused data of the camera and LiDAR; it performs 2D target detection based on the RGB images collected by the AGV, obtains the category of obstacles and 2D detection boxes, and performs reverse search of the fused data to filter out the point cloud information contained in the 2D detection boxes;

[0118] The expansion module establishes simulation models of different obstacles and converts them into corresponding template point clouds. Then, the template point clouds are transformed to replace the real point clouds, thus expanding the target object point clouds obtained by the filtering module.

[0119] The detection module uses MQTT communication to upload the target point cloud in the static global map obtained by the expansion module as a single-machine perception result to the cloud for perception fusion, thereby realizing 3D target detection based on environmental perception.

[0120] In another embodiment of the present invention, a terminal device is provided, comprising a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to achieve a corresponding method flow or corresponding function. The processor described in this embodiment of the present invention can be used for the operation of a 3D target detection method based on environmental perception, including:

[0121] The point cloud data of the AGV's current travel direction is preprocessed; a static global map based on laser SLAM is constructed and registered with the obtained point cloud data to determine the AGV's position information in the static global map; the camera and LiDAR are jointly calibrated, the registered 3D point cloud information is projected onto 2D pixel space, and a 6D data structure is used to store the mapping relationship between 2D pixel coordinates and 3D point cloud, obtaining the fused data of the camera and LiDAR; 2D target detection is performed based on the RGB images collected by the AGV to obtain the obstacle category and 2D detection box, and the fused data is searched in reverse to filter out the point cloud information contained in the 2D detection box; simulation models of different obstacles are established and converted into corresponding template point clouds, and the pose transformation of the template point cloud is used to replace the real point cloud to expand the obtained target object point cloud; the target point cloud in the obtained static global map is uploaded to the cloud as a single-machine perception result using MQTT communication to achieve perception fusion, realizing 3D target detection based on environmental perception.

[0122] In another embodiment of the present invention, a storage medium is also provided, specifically a computer-readable storage medium (memory). This computer-readable storage medium is a memory device in a terminal device used to store programs and data. It is understood that the computer-readable storage medium here can include both the built-in storage medium in the terminal device and extended storage media supported by the terminal device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, this storage space also stores one or more instructions suitable for loading and execution by a processor. These instructions can be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here can be high-speed RAM or non-volatile memory, such as at least one disk storage device.

[0123] One or more instructions stored in a computer-readable storage medium can be loaded and executed by a processor to implement the corresponding steps of the environment-aware 3D target detection method in the above embodiments; one or more instructions in the computer-readable storage medium are loaded and executed by the processor to perform the following steps:

[0124] The point cloud data of the AGV's current travel direction is preprocessed; a static global map based on laser SLAM is constructed and registered with the obtained point cloud data to determine the AGV's position information in the static global map; the camera and LiDAR are jointly calibrated, the registered 3D point cloud information is projected onto 2D pixel space, and a 6D data structure is used to store the mapping relationship between 2D pixel coordinates and 3D point cloud, obtaining the fused data of the camera and LiDAR; 2D target detection is performed based on the RGB images collected by the AGV to obtain the obstacle category and 2D detection box, and the fused data is searched in reverse to filter out the point cloud information contained in the 2D detection box; simulation models of different obstacles are established and converted into corresponding template point clouds, and the pose transformation of the template point cloud is used to replace the real point cloud to expand the obtained target object point cloud; the target point cloud in the obtained static global map is uploaded to the cloud as a single-machine perception result using MQTT communication to achieve perception fusion, realizing 3D target detection based on environmental perception.

[0125] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0126] Please see Figure 3 The result is the 2D object detection result in Scene 1, which includes people and AGV vehicles.

[0127] Please see Figure 4 The figure shows the 3D object detection results obtained by filtering the 2D object detection auxiliary point cloud in Scene 1. The gray bars in the figure represent the 3D object detection results. As can be seen from the 3D object detection results, this invention can complete the 3D object detection task even in the absence of 3D labels.

[0128] Please see Figure 5 The figure shows the 2D and 3D object detection results displayed on the fused data in scenario 2. The 2D box in the figure represents the 2D object detection result, and the 3D box represents the 3D object detection result; indicating that the present invention can complete the 3D object detection task in different scenarios.

[0129] In summary, the present invention provides a 3D target detection method and system based on environmental perception. It employs 2D target detection and combines data from camera and LiDAR fusion to achieve accurate target location by filtering real point clouds. Unlike traditional methods, the present invention does not require real 3D labels but relies on easily obtainable 2D target detection results for training.

[0130] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0131] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0132] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0133] In the embodiments provided by this invention, it should be understood that the disclosed devices / terminals and methods can be implemented in other ways. For example, the device / terminal embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0134] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0135] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0136] If the integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.

[0137] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0138] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0139] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0140] The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.

Claims

1. A 3D object detection method based on environmental perception, characterized in that, Includes the following steps: S1. Preprocess the point cloud data of the AGV in the current direction of travel; S2. Construct a static global map based on laser SLAM and register it with the point cloud data obtained in step S1 to determine the position information of the AGV in the static global map. S3. After step S2 is completed, the camera and LiDAR are jointly calibrated. The registered 3D point cloud information is projected onto the 2D pixel space, and a 6D data structure is used to store the mapping relationship between the 2D pixel coordinates and the 3D point cloud, thus obtaining the fused data of the camera and LiDAR. Based on the RGB images collected by the AGV, 2D target detection is performed to obtain the obstacle category and 2D detection box. Then, the fused data is searched in reverse to filter out the point cloud information contained in the 2D detection box, specifically: Two-dimensional object detection methods are used to locate 2D object boxes in RGB images and to classify the contents of the 2D object boxes. The collected point cloud data is mapped onto a two-dimensional plane through camera parameters, and then the point cloud within the 2D detection box is filtered according to the Boundingboxes of the 2D object detection to obtain the point cloud of the ROI region. Estimate the depth range of the target point cloud, filter the point cloud in the ROI region, and then perform clustering operations on the point cloud; The screening process includes effective depth range screening, clustering operations, and point cloud cluster scoring mechanisms. The 2D object detection results include the top-left corner information of the detected box, the size of the detection box, the confidence score, and the timestamp; The target point cloud includes the specific position of the current AGV in the global coordinate system, the positioning corners of the 3D detection box, the number of points, the confidence level, and the obstacle category; S4. Establish simulation models of different obstacles and convert them into corresponding template point clouds. Then, transform the pose of the template point clouds to replace the real point clouds. Expand the target object point cloud obtained in step S3. Establish simulation models of different types of obstacles using CAD and convert them into point clouds. Use the simulation point clouds as template point clouds for the corresponding types of obstacles. Perform feature matching between the detected point clouds and the template point clouds to obtain the category of the real point cloud and its pose relative to the template point cloud. Adjust the template point cloud to the position of the real point cloud through relative pose and then use the adjusted template point cloud for subsequent detection work. S5. Using MQTT communication, the target point cloud in the static global map obtained in step S4 is uploaded to the cloud as a single-machine perception result for perception fusion, thereby realizing 3D target detection based on environmental perception. 2.The environment-aware based 3D object detection method of claim 1, wherein, In step S1, the preprocessing of the point cloud data of the AGV in the current direction of travel is specifically as follows: A pass-through filtering method is used to prune point cloud data. A threshold is set on a specified dimension of the point cloud to divide the data in the corresponding dimension into those within the threshold range and those outside the threshold range. Point cloud data outside the threshold range is then removed.

3. The 3D target detection method based on environmental perception according to claim 2, characterized in that, A threshold method was used to remove ground point cloud data. 4.The environment-aware based 3D object detection method of claim 1, wherein, In step S2, the location information incorporates odometer and matching positioning information. The odometer information is integrated with the IMU to synthesize positioning information. The positioning information is then matched with a static map to maintain the deviation between the odometer positioning and the actual positioning.

5. An environment perception based target detection system, characterized in that, include: The data module preprocesses the point cloud data of the AGV in its current direction of travel; The map module constructs a static global map based on laser SLAM and registers it with the point cloud data obtained from the data module to determine the position information of the AGV in the static global map. The filtering module performs joint calibration of the camera and LiDAR, projects the registered 3D point cloud information onto 2D pixel space, and stores the mapping relationship between 2D pixel coordinates and 3D point cloud using a 6D data structure to obtain the fused data of the camera and LiDAR. Based on the RGB images acquired by the AGV, 2D target detection is performed to obtain the obstacle category and 2D detection box, and the fused data is then searched in reverse to filter out the point cloud information contained in the 2D detection box. Specifically: Two-dimensional object detection methods are used to locate 2D object boxes in RGB images and to classify the contents of the 2D object boxes. The collected point cloud data is mapped onto a two-dimensional plane through camera parameters, and then the point cloud within the 2D detection box is filtered according to the Boundingboxes of the 2D object detection to obtain the point cloud of the ROI region. Estimate the depth range of the target point cloud, filter the point cloud in the ROI region, and then perform clustering operations on the point cloud; The screening process includes effective depth range screening, clustering operations, and point cloud cluster scoring mechanisms. The 2D object detection results include the top-left corner information of the detected box, the size of the detection box, the confidence score, and the timestamp; The target point cloud includes the specific position of the current AGV in the global coordinate system, the positioning corners of the 3D detection box, the number of points, the confidence level, and the obstacle category; The expansion module establishes simulation models of different obstacles and converts them into corresponding template point clouds. The template point clouds are then transformed to replace the real point clouds. The target object point clouds obtained from the filtering module are expanded by establishing simulation models of different types of obstacles using CAD, converting them into point clouds, and using these simulation point clouds as template point clouds for the corresponding obstacle types. Feature matching is performed between the detected point clouds and the template point clouds to obtain the category of the real point cloud and its pose relative to the template point cloud. The template point cloud is then adjusted to the position of the real point cloud using relative pose, and the adjusted template point cloud is used for subsequent detection work. The detection module uses MQTT communication to upload the target point cloud in the static global map obtained by the expansion module as a single-machine perception result to the cloud for perception fusion, thereby realizing 3D target detection based on environmental perception.