A cooperative confidence fusion perception method applied to automatic driving

By employing time alignment, local and global confidence correction, and Softmax fusion methods, this study addresses the issue of decreased detection performance caused by different feature types in vehicle-road cooperative perception, achieving more efficient cooperative perception and improving the robustness and accuracy of autonomous driving systems.

CN119027934BActive Publication Date: 2026-06-30SHENZHEN AUTOMOTIVE RES INST BEIJING INST OF TECH (SHENZHEN RES INST OF NAT ENG LAB FOR ELECTRIC VEHICLES)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN AUTOMOTIVE RES INST BEIJING INST OF TECH (SHENZHEN RES INST OF NAT ENG LAB FOR ELECTRIC VEHICLES)
Filing Date
2024-08-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing vehicle-road cooperative perception solutions suffer from different feature types among different intelligent sensing agents during feature-level fusion, leading to decreased detection performance, loss of semantic features in visual images, and difficulty in assessing the confidence level of perception results from different intelligent sensing agents.

Method used

The detection results of multiple intelligent sensors are processed by time alignment, the environmental perception capability of each sensor is evaluated and local confidence is corrected, a BEV spatial confidence map is constructed for global correction, and the final detection box and confidence are fused using the cross-union ratio and temperature-adjusted Softmax method.

Benefits of technology

It eliminates model differences between different agents, fully leverages the advantages of data features from each modality, improves the accuracy and reliability of collaborative perception, avoids the weakening of visual image features, and generates more accurate and consistent detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119027934B_ABST
    Figure CN119027934B_ABST
Patent Text Reader

Abstract

This invention discloses a collaborative confidence fusion perception method for autonomous driving, belonging to the field of autonomous driving. This method eliminates the differences in perception models between different intelligent agents, makes full use of the advantages of sensor data from different modalities, and avoids the operation of forcibly converting visual images that are not good at extracting BEV features into BEV features in order to achieve feature unification. Based on fully leveraging the advantages of different modal data features, learning to adapt to the features that specific models are good at, and the inherent collaborative advantages of collaborative perception, it achieves better collaborative perception results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of autonomous driving technology, specifically relating to a collaborative confidence fusion perception method for autonomous driving. Background Technology

[0002] With the dual circulation strategy of domestic and international markets generating new opportunities and challenges, China has become the world's largest automobile market. The continuous development of electrification and intelligentization is driving the cyclical growth of supply and demand, while the rapid progress of connectivity and sharing is gradually unfolding a grand blueprint for China's automotive industry, with autonomous driving riding this wave. Perception, as one of the most crucial upstream conditions for autonomous driving, has also developed rapidly. However, it still lags behind "unsupervised" level autonomous driving. The fact that perception's robustness, accuracy, and timeliness have not met the requirements of high-level autonomous driving is one of the key reasons for this gap.

[0003] At present, a lot of work has been done on the basis of single-vehicle perception, with a variety of autonomous driving sensors and efficient technical routes emerging one after another. However, single-vehicle perception has a fatal flaw: the perception range is limited, and there are unavoidable situations such as field of view obstruction and sensor failure in specific environments. In order to solve these problems and deal with such situations, the vehicle needs intelligent perception agents with different perspectives and in different areas, such as roadside intelligent perception agents, other vehicle intelligent perception agents, and the vehicle itself to cooperate in perception.

[0004] The essence of collaborative perception is the fusion of perception from different intelligent sensing agents. The mainstream technical solution for fusion perception is feature-level fusion. This solution suffers from the problem that different intelligent sensing agents have different feature types (visual features, BEV features, sparse queries). This can lead to different types of features being ineffective in untrained framework models, and may even reduce detection performance. To avoid this problem, the main approach in the field is to unify feature forms. Most existing vehicle-road cooperative perception solutions are based on fusing multimodal features using BEV feature forms. However, BEV features require accurate coordinate positions, which makes it difficult for visual images, which are good at semantic understanding but not deep understanding, to fully demonstrate their effectiveness within the framework, resulting in the loss of semantic features, i.e., a weakening of the capabilities of visual images.

[0005] Current mainstream collaborative perception algorithms unify features to the BEV (Browser Active Volume) feature form. This is because voxels from LiDAR point clouds can accurately detect 3D bounding boxes under the BEV feature form, but this leads to the loss of semantic features in visual images. Considering this issue, we can skip the black-box model and use specialized training models with different data types and feature forms to directly achieve collaborative perception through result-level fusion. However, a core problem with result-level fusion is how to reasonably evaluate the confidence level of the perception results from different intelligent agents. Summary of the Invention

[0006] In order to solve the technical problems existing in the background art, the present invention aims to provide a collaborative confidence fusion perception method for autonomous driving.

[0007] To solve the technical problem, the technical solution of the present invention is as follows:

[0008] S1: Perform time alignment processing on the three-dimensional target detection results of multiple original intelligent sensing objects to obtain the time-aligned detection results of the intelligent sensing objects;

[0009] This step involves time synchronization among multiple sensors to ensure that detection results from different sensors can be compared and fused at the same point in time, laying the foundation for subsequent processing.

[0010] S2: Evaluate the environmental perception capability of each intelligent sensor using the time-aligned 3D target detection results, and locally correct the confidence level of its detection results to obtain the corrected detection results and confidence level. In this step, by analyzing the time-aligned results, the performance of each sensor can be evaluated, and the confidence level of the detection results can be locally corrected using the evaluation results to ensure that the detection results of each intelligent sensor are more reliable.

[0011] S3: Based on the corrected detection results and confidence levels, offline maps, and coordinates of intelligent sensing objects, construct a bird's-eye view BEV spatial confidence map and generate corresponding confidence information for each region; according to the spatial confidence map and the detection results of other time-aligned intelligent sensing objects, perform global confidence correction on the detection results of individual intelligent sensing objects to obtain the globally corrected detection results and confidence levels;

[0012] In this step, a spatial confidence map is constructed by combining the detection results with static environmental information (such as an offline map). Based on this confidence map and the results from other sensors, a global confidence correction is performed, resulting in more comprehensive and reliable detection results.

[0013] S4: For overlapping detection boxes, the degree of overlap is evaluated using the intersection-union ratio (IoU), similar target boxes are aggregated, and the confidence scores of the globally corrected detection results are fused using the temperature-regulated Softmax method to obtain the final detection boxes and confidence scores.

[0014] In this step, overlapping boxes are evaluated, similar boxes are merged, and the confidence scores of the merged boxes are processed to generate the final, optimized detection boxes and their confidence scores. Temperature-adjusted Softmax helps to balance the contributions of each box, resulting in a smoother and more reasonable final confidence score value.

[0015] After obtaining the final detection frame and confidence level, the system can use this information to perform a series of subsequent steps such as target tracking, decision-making, data recording and display. These steps can continue to drive the system's target management and interaction with the environment.

[0016] Furthermore, step S1 includes:

[0017] Linear interpolation is used to align 3D detection results at different timestamps. The following is the process of time alignment using linear interpolation, which can simplify complex motion patterns:

[0018]

[0019] It is the three-dimensional target detection result of the sensing agent at timestamp t2;

[0020] It is the three-dimensional target detection result of the sensing agent at timestamp t1;

[0021] R t It is the 3D target detection result of the perceptual agent after time alignment to the timestamp t, t2 <t1<t;

[0022] Linear interpolation method using time weighting coefficients The difference between the detection results multiplied by the timestamps t1 and t2 The detection result at time stamp t is time-compensated to obtain the detection result at time stamp t after compensation. This method assumes that the motion of the object between times t2, t1, and t is linear. The faster the update frequency of the perception result, the better the alignment effect of this method.

[0023] Furthermore, step S2 specifically includes:

[0024] Before applying confidence correction to the 3D detection results of each intelligent agent, it is necessary to evaluate the environmental perception capability of each agent:

[0025] Ab = f(E, M, S)

[0026] Among them, environmental perception ability A b ∈[0,1), the environmental parameter is E, the perception model is M, and the sensor configuration is S;

[0027] The n sets of 3D bounding boxes detected by n intelligent sensors typically contain multiple 3D bounding boxes. To weaken the overconfident but poorly perceptual abilities of the intelligent sensors and enhance the relatively conservative but highly perceptual abilities, it is necessary to perform local confidence adjustments on each of the n sets of 3D bounding boxes to achieve a perception effect that matches the entire collaborative system.

[0028]

[0029] The locally corrected single confidence level C i ′∈[0,1), single confidence level C before local correction i ∈[0,1), the minimum confidence value C of the three-dimensional target bounding box group of a single intelligent perceptron before local correction. min Maximum value C max .

[0030] Furthermore, step S3 specifically includes:

[0031] The collaborative perception system for road testing needs to construct a BEV spatial confidence map based on the offline map and the coordinates of each intelligent sensor, adjust the regional confidence of each intelligent sensor, conduct an objective evaluation of the detection results of the intelligent sensor, and globally correct and clear the results detected by some intelligent sensors that have difficulty in detection.

[0032] The detected 3D target bounding box is located within a certain region of the intelligent sensing agent. The agent determines whether there is any occlusion within this region and adjusts the confidence level of its detection results for that region accordingly. This applies to the collaborative relationship between intelligent sensing agents A, B, and C. This refers to the parameters used for global correction of the detection results of target C.

[0033] Furthermore, step S4 specifically includes:

[0034] To obtain more accurate and consistent detection results, the Intersection over Union (IoU) ratio is used to measure the degree of overlap between the detection results of different sensors. When the detection results of multiple intelligent sensors overlap significantly within the three-dimensional bounding box (i.e., the IoU value is high), these detection regions are considered to represent the same target object. The specific calculation formula is as follows:

[0035]

[0036] Where A is a detection target box of a smart sensor, and B is a detection target box of another smart sensor. The aggregation of multiple detection target boxes of a single smart sensor is calculated internally by the smart sensor itself. This method calculates the aggregation of target boxes among the 3D target detection results of multiple smart sensors.

[0037] Then, these overlapping regions are aggregated, and the 3D target boxes are fused using the corrected confidence probability coefficients to obtain a more accurate detection result. The method used is a variant of the Softmax function: temperature-regulated Softmax. When the temperature parameter is 0.2, it enhances the influence of high-confidence (>0.5) results while suppressing the influence of low-confidence (<=0.5) results. The specific formula is explained below:

[0038]

[0039] Where T is the temperature parameter, T∈(0,∞); by adjusting the temperature T, the sparsity of the weights can be controlled. By adjusting T=0.20, the effect of enhancing high-confidence results can be achieved, while weakening the effect of low-confidence results can be weakened.

[0040] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement any of the above-described collaborative confidence fusion perception methods for autonomous driving.

[0041] A computer-readable storage medium storing a computer program that, when executed by a processor, implements any of the above-described collaborative confidence fusion perception methods for autonomous driving.

[0042] Compared with the prior art, the advantages of the present invention are as follows:

[0043] This invention eliminates model differences between different intelligent agents and leverages the strengths of different modal sensor data. It avoids the forced conversion of visual images that are not good at extracting BEV features into BEV features in order to achieve feature unification. Based on fully leveraging the advantages of different modal data features, the advantages of specific models being good at learning features adapted to their own models, and the systematic adjustment advantages of collaborative perception, it achieves better collaborative perception results. Attached Figure Description

[0044] Figure 1 The main flowchart of a collaborative confidence fusion perception method for autonomous driving according to the present invention;

[0045] Figure 2 1. Analysis of field-of-view occlusion and confidence level of intelligent driving vehicles;

[0046] Figure 3 BEV space confidence map. Detailed Implementation

[0047] The specific implementation of the present invention is described below with reference to embodiments:

[0048] It should be noted that the structures, proportions, sizes, etc. shown in this specification are only used to complement the content disclosed in the specification for those skilled in the art to understand and read, and are not intended to limit the conditions under which the present invention can be implemented. Any modifications to the structure, changes in the proportions, or adjustments to the size, without affecting the effects and objectives that the present invention can produce, should still fall within the scope of the technical content disclosed in the present invention.

[0049] Furthermore, the terms such as "upper," "lower," "left," "right," "middle," and "one" used in this specification are merely for clarity of description and are not intended to limit the scope of the invention. Any changes or adjustments to their relative relationships, without substantially altering the technical content, should also be considered within the scope of the invention.

[0050] Example 1:

[0051] S1: Time alignment of intelligent sensing entity detection results

[0052] In 3D target detection, the detection results at different timestamps may have errors due to object movement, sensor delay, etc. In order to perform time alignment, this invention uses linear interpolation to align the 3D detection results at different timestamps. The following is the process of time alignment using linear interpolation, which can simplify complex motion patterns.

[0053]

[0054] It is the three-dimensional target detection result of the sensing agent at timestamp t2;

[0055] It is the three-dimensional target detection result of the sensing agent at timestamp t1;

[0056] R t It is the 3D target detection result of the perceptual agent after time alignment to the timestamp t, t2 <t1<t。

[0057] Linear interpolation method using time weighting coefficients The difference between the detection results multiplied by the timestamps t1 and t2 The detection result at time stamp t is time-compensated to obtain the detection result at time stamp t. This method assumes that the motion of the object between times t2, t1, and t is linear. The faster the perception result is updated, the better the alignment effect of this method.

[0058] S2: Local confidence correction of detection results for a single intelligent sensor

[0059] Because there is no unified standard for calculating confidence scores in the field of autonomous driving object detection, the confidence scores of 3D bounding boxes detected by each intelligent perception agent do not match those detected by other intelligent agents. For example, in the same scene, there might be an aggressive but inaccurate perception agent and a conservative but inaccurate one. In this case, there may be a significant difference between the 3D bounding boxes with an 80% confidence score obtained by these two agents. Furthermore, the inaccurate perception agent might even dominate the aggregation process due to overconfidence, reducing the accuracy of the final result. Therefore, to achieve accurate and effective collaborative perception, a unique confidence evaluation system is needed to adapt to different environments, perception models, and sensor configurations.

[0060] Before applying confidence correction to the 3D detection results of each intelligent agent, it is necessary to evaluate the environmental perception capability of each agent:

[0061] A b =f(E,M,S)

[0062] Among them, environmental perception ability A b ∈[0,1), the environmental parameter is E, the perception model is M, and the sensor configuration is S.

[0063] The n sets of 3D bounding boxes detected by n intelligent sensors typically contain multiple 3D bounding boxes. To weaken the overconfident but poorly perceptual abilities of the intelligent sensors and enhance the relatively conservative but highly perceptual abilities, it is necessary to perform local confidence adjustments on each of the n sets of 3D bounding boxes to achieve a perception effect that matches the entire collaborative system.

[0064]

[0065] The locally corrected single confidence level C i ′∈[0,1), single confidence level C before local correction i ∈[0,1), the minimum confidence value C of the three-dimensional target bounding box group of a single intelligent perceptron before local correction. min Maximum value C min .

[0066] S3: BEV Spatial Confidence Map Construction and Global Confidence Correction

[0067] After the aforementioned corrections, the detection results of each intelligent sensing agent met the requirements for participating in the collaborative sensing system in this scenario. However, due to the inherent limitations of relatively independent intelligent sensing agents, such as... Figure 2 The No. 3 intelligent driving vehicle in the test has a lot of surrounding vehicles at close range, resulting in serious obstruction of its field of vision. Therefore, the vehicle is very prone to missing or misdetecting pedestrians during road tests. Figure 2The No. 2 intelligent driving vehicle in the picture will inevitably have a problem with low confidence in detecting the area to the right of the No. 3 intelligent driving vehicle's driving direction.

[0068] Therefore, as Figure 3 As shown, the collaborative perception system for road testing needs to construct a BEV spatial confidence map based on the offline map and the coordinates of each intelligent sensor. It then adjusts the regional confidence of each intelligent sensor, objectively evaluates the detection results, and globally corrects or even removes the results from intelligent sensors that may have difficulty detecting certain objects. The correction principle mainly involves determining which region the detected 3D target box falls within, whether the intelligent sensor has any occlusion within that region, and the confidence of the detection result for that region. As shown in the left figure, in the collaborative relationship between intelligent sensors A, B, and C… This refers to the global correction parameter for the detection results of target C.

[0069] S4: Box Aggregation and Confidence Fusion

[0070] For the same object to be detected, the detection model typically outputs a set of overlapping bounding box candidate objects. For example, as shown in the figure above, both A and B may detect vehicle C and assign it a corresponding 3D target box. To obtain more accurate and consistent detection results, the Intersection over Union (IoU) ratio can be used to measure the degree of overlap between the detection results of different sensors. When the detection results of multiple sensors overlap significantly (i.e., the IoU value is high), these detection regions can be considered to represent the same target object. The specific calculation formula is as follows:

[0071]

[0072] Where A is a detection target box of a smart sensor, B is a detection target box of another smart sensor, and the aggregation of multiple detection target boxes of a single smart sensor is calculated internally by the smart sensor itself.

[0073] These overlapping regions are then aggregated, and the 3D target boxes are fused using the corrected confidence probability coefficients to obtain a more accurate detection result. The method used in this invention is a variant of the Softmax function: Temperature-scaled Softmax. This method can enhance the influence of high confidence (>0.5) results while suppressing the influence of low confidence (<=0.5) results when the temperature parameter is 0.2. The specific formula is explained below:

[0074]

[0075] Where T is a temperature parameter, T∈(0,∞). By adjusting the temperature T, the sparsity of the weights can be controlled. This invention adjusts T=0.20 to achieve the effect of enhancing high-confidence results and weakening low-confidence results.

[0076] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0077] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0078] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0079] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0080] The preferred embodiments of the present invention have been described in detail above. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.

[0081] Many other changes and modifications can be made without departing from the concept and scope of this invention. It should be understood that this invention is not limited to the specific embodiments, and the scope of this invention is defined by the appended claims.

Claims

1. A method for cooperative confidence fusion perception applied to autonomous driving, characterized in that, The method includes: S1: Perform time alignment processing on the three-dimensional target detection results of multiple original intelligent sensing objects to obtain the time-aligned detection results of the intelligent sensing objects; S2: Evaluate the environmental perception capability of each intelligent sensor using the time-aligned 3D target detection results, and locally correct the confidence of its detection results to obtain the corrected detection results and confidence. S3: Based on the corrected detection results and confidence levels, offline maps, and coordinates of intelligent sensing objects, construct a bird's-eye view BEV spatial confidence map and generate corresponding confidence information for each region; according to the spatial confidence map and the detection results of other time-aligned intelligent sensing objects, perform global confidence correction on the detection results of individual intelligent sensing objects to obtain the globally corrected detection results and confidence levels; S4: For overlapping detection boxes, the degree of overlap is evaluated using the intersection-union ratio (IoU), similar target boxes are aggregated, and the confidence scores of the globally corrected detection results are fused using the temperature-regulated Softmax method to obtain the final detection boxes and confidence scores. Step S2 specifically includes: Before applying confidence correction to the 3D detection results of each intelligent agent, it is necessary to evaluate the environmental perception capability of each agent: , wherein the environmental perception capability , the environmental parameter is E, the perception model is M, and the sensor configuration is S; Depend on Detected by a smart sensing agent A set of 3D bounding boxes typically contains multiple 3D bounding boxes. To weaken the overconfident but poorly perceptual intelligence and enhance the relatively conservative but highly perceptual intelligence, it is necessary to... Each group of 3D target bounding boxes undergoes local confidence level correction to achieve a perception effect that matches the entire collaborative system. , Among them, the individual confidence level after local correction Single confidence level before local correction Minimum confidence level of a single intelligent sensor's 3D target bounding box group before local correction. Maximum value .

2. The collaborative confidence fusion perception method for autonomous driving according to claim 1, characterized in that, Step S1 includes: The 3D detection results at different timestamps are aligned using linear interpolation, as detailed below: , in: For the intelligent sensing entity at timestamp The results of three-dimensional target detection; For the intelligent sensing entity at timestamp The results of three-dimensional target detection; Align the intelligent sensing entity with a timestamp over time. The results of three-dimensional target detection; Linear interpolation method through the and According to time weighting coefficients respectively and Weighted summation, thus affecting the timestamp The method compensates for the detection results; it assumes that the object is in , , The motion within a time period is linear, and the alignment effect is better when the perception result is updated more frequently.

3. The collaborative confidence fusion perception method for autonomous driving according to claim 1, characterized in that, Step S3 specifically includes: The collaborative perception system for road testing needs to construct a BEV spatial confidence map based on the offline map and the coordinates of each intelligent sensor, adjust the regional confidence of each intelligent sensor, conduct an objective evaluation of the detection results of the intelligent sensor, and globally correct and clear the results detected by some intelligent sensors that have difficulty in detection. The detected 3D target bounding box is located in which part of the intelligent sensing object's area? The intelligent sensing object determines whether there is any occlusion in this area, and adjusts the confidence of the intelligent sensing object's detection results in this area accordingly.

4. The collaborative confidence fusion perception method for autonomous driving according to claim 1, characterized in that, Step S4 specifically includes: To obtain more accurate and consistent detection results, the Intersection over Union (IoU) ratio is used to measure the degree of overlap between the detection results of different sensors. When the detection results of multiple intelligent sensors overlap significantly within the 3D target bounding box (i.e., the IoU value is high), these detection regions are considered to represent the same target object. The specific calculation formula is as follows: , Where A is a detection target box of a smart sensor, and B is a detection target box of another smart sensor. The aggregation of multiple detection target boxes of a single smart sensor is calculated internally by the smart sensor itself. This method calculates the aggregation of target boxes among the 3D target detection results of multiple smart sensors. Then, the overlapping regions are aggregated, and the 3D target boxes are fused using the corrected confidence probability coefficients to obtain a more accurate detection result. The method used is a variant of the Softmax function: temperature-regulated Softmax. When the temperature parameter is 0.2, it enhances the influence of high-confidence results while suppressing the influence of low-confidence results. The specific formula is explained below: , Where T is the temperature parameter. By adjusting the temperature T, the sparsity of the weights can be controlled. By adjusting T=0.20, the effect of enhancing high-confidence results and weakening low-confidence results can be achieved.

5. A computer device, characterized in that, The system includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements a collaborative confidence fusion perception method for autonomous driving as described in any one of claims 1 to 4.

6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements a cooperative confidence fusion perception method for autonomous driving as described in any one of claims 1 to 4.