Multi-modal fusion 3D target detection method, device, equipment and storage medium
By fusing historical frames and global query vectors through self-attention and deformable attention mechanisms, the feature extraction of images and point clouds is simplified, solving the complexity and redundancy problems of existing multimodal fusion 3D object detection and achieving efficient and robust 3D object detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHUHAI KUWA TECHNOLOGY CO LTD
- Filing Date
- 2025-09-24
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multimodal fusion 3D target detection methods suffer from problems such as model complexity, redundancy, long processing time, and high resource consumption. Furthermore, the multimodal feature fusion process is repetitive, which fails to effectively improve detection performance and efficiency.
A self-attention mechanism is adopted to fuse historical frames and learnable global query vectors. Through image and point cloud feature extraction, deformable attention mechanism and multi-head attention mechanism are used for feature fusion, which is simplified into a parallel dual-branch architecture to directly generate 3D detection results.
It improves detection performance, reduces missed detections in point cloud blind spots, enhances robustness in occluded scenarios, simplifies the architecture to improve inference efficiency, and enhances system robustness and tracking capabilities, making it suitable for real-time autonomous driving scenarios.
Smart Images

Figure CN121033809B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of target detection technology, specifically relating to a multimodal fusion 3D target detection method, device, equipment, and storage medium. Background Technology
[0002] Autonomous vehicles require continuous environmental perception to obtain the distribution of obstacles, among which 3D object detection is a crucial functional module. It can predict information such as the position and size of surrounding obstacles. Generally, autonomous vehicles are equipped with multiple sensors, including cameras and LiDAR, and each sensor has its own advantages and disadvantages. Therefore, the detection performance of a single mode is not entirely satisfactory. This has prompted vehicles to use multimodal sensors as inputs, making full use of the advantages of each mode, which has greatly improved 3D object detection.
[0003] Current multimodal fusion methods are mainly divided into feature-level fusion and proposal-level fusion. Feature-level fusion methods involve constructing a unified feature space and then fusing features from different modalities into that unified feature space.
[0004] The paper MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Model 3D Detection proposes a novel multi-modal detection framework that first employs object-level feature fusion, followed by cross-modal feature fusion through an attention mechanism. For example... Figure 1 As shown, the workflow of MV2DFusion is as follows:
[0005] First, image features and point cloud features are extracted using independent images and point cloud network backbones.
[0006] Then, using these features, two detection results, namely 2D detection vectors and 3D detection vectors, are generated using 2D detectors and 3D detectors respectively.
[0007] Then, based on the features and detection results, image query vectors and point cloud query vectors are generated. The image query vector is based on the 2D detection vector obtained by the 2D detector and the depth prediction result. The image query vector is projected onto 3D space to obtain the 3D query vector of the image. Since the point cloud itself is represented by 3D space coordinates, no other transformation is required for the point cloud query vector.
[0008] Finally, a decoder is designed to fuse the two target query vectors with the historical frame target query vectors through an attention mechanism, effectively utilizing historical information. Then, the features of the two modalities are fused separately to generate 3D prediction results.
[0009] MV2DFusion has the following problems:
[0010] 1) Existing methods require detectors for both images and point clouds, and the requirements for the detectors are high, resulting in complex models.
[0011] 2) In the process of multimodal feature fusion, two independent object detectors are required to pre-generate the object query vector. For the image branch, a 2D detector and an image query vector generator are required to jointly generate the image query vector. For the point cloud branch, a sparse detector is required. The fusion process is relatively redundant and is fused twice. First, the two object query vectors are integrated with the object query vector of the historical frame and fused once through self-attention. Then, cross-attention is used to fuse the features of the two modalities.
[0012] The image target Query (image query vector) and the point cloud target Query (point cloud query vector) already contain certain target information. Then, the two original features are fused again through cross-modal processing, which involves some repetition.
[0013] 3) The overall process of MV2DFusion is relatively complicated and redundant, takes a long time, and consumes a lot of resources. Summary of the Invention
[0014] To address the aforementioned technical problems, this invention proposes a multimodal fusion method, apparatus, device, and storage medium for 3D target detection.
[0015] To achieve the above objectives, the technical solution of the present invention is as follows:
[0016] First, this invention discloses a multimodal fusion 3D target detection method, comprising:
[0017] Step S1: Acquire the image data and point cloud data of the current frame;
[0018] Step S2: The target query vector of the historical frame and the learnable global query vector are fused through a self-attention mechanism to form the target query vector of the current frame;
[0019] Step S3: Extract features from the acquired image data to obtain the image features of the current frame;
[0020] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0021] Feature extraction is performed on the collected point cloud data to obtain the point cloud features of the current frame, and sparse point cloud query vectors are obtained from the point cloud features.
[0022] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0023] Step S4: Concatenate the image features to be fused and the point cloud features to be fused, and input them into the feedforward neural network to obtain the updated target query vector of the current frame.
[0024] Step S6: Perform detection based on the updated target query vector of the current frame to obtain the 3D detection result of the current frame.
[0025] Based on the above technical solution, the following improvements can be made:
[0026] As a preferred embodiment, step S3 includes:
[0027] Image processing branch: Extracts features from the acquired image data to obtain the image features of the current frame;
[0028] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0029] Point cloud processing branch: Extract features from the collected point cloud data, obtain the point cloud features of the current frame, and obtain sparse point cloud query vectors from the point cloud features.
[0030] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0031] The image processing branch and the point cloud processing branch process the corresponding image data and point cloud data in parallel and independently.
[0032] As a preferred approach, the learnable global query vector is obtained through the following steps:
[0033] Step A: Preset a set of global multi-point coordinates covering the 3D scene space;
[0034] Step B: Construct a learnable coding layer whose parameters can be dynamically updated through neural network training;
[0035] Step C: Input the global multi-point coordinates into the learnable encoding layer to generate the corresponding embedding vectors, which are the learnable global query vectors.
[0036] As a preferred option, 3D target detection methods also include:
[0037] Step S5: Correct the updated target query vector of the current frame;
[0038] Furthermore, step S6 is: to perform detection based on the corrected target query vector of the current frame to obtain the 3D detection result of the current frame.
[0039] Secondly, this invention discloses a multimodal fusion 3D target detection device, comprising:
[0040] The acquisition module is used to acquire image data and point cloud data of the current frame;
[0041] The fusion module is used to fuse the target query vector of historical frames and the learnable global query vector through a self-attention mechanism, and use it as the target query vector of the current frame.
[0042] The feature extraction module is used to extract features from the acquired image data and obtain the image features of the current frame;
[0043] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0044] Feature extraction is performed on the collected point cloud data to obtain the point cloud features of the current frame, and sparse point cloud query vectors are obtained from the point cloud features.
[0045] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0046] The stitching module is used to stitch together the image features to be fused and the point cloud features to be fused, and input them into the feedforward neural network to obtain the updated target query vector of the current frame.
[0047] The detection module is used to perform detection based on the updated target query vector of the current frame and obtain the 3D detection result of the current frame.
[0048] As a preferred embodiment, the feature extraction module includes: a parallel and independently configured image processing branch and a point cloud processing branch;
[0049] Image processing branches include:
[0050] An image feature extractor is used to extract features from acquired image data and obtain the image features of the current frame.
[0051] The image feature extractor projects the target query vector of the current frame onto the image, then uses a deformable attention mechanism to obtain the offset point and weight of the projection point on the image, obtains the image features at the offset point position, and obtains the image features to be fused by weighting.
[0052] Point cloud processing branches include:
[0053] A point cloud feature extractor is used to extract features from the collected point cloud data and obtain the point cloud features of the current frame.
[0054] A point cloud query vector generator is used to obtain sparse point cloud query vectors from the point cloud features of the current frame.
[0055] The point cloud feature extractor is used to calculate the similarity between the target query vector of the current frame and the point cloud query vector of the current frame through a multi-head attention mechanism, and to obtain the point cloud features to be fused based on the similarity.
[0056] As a preferred approach, the learnable global query vector is obtained through a global query vector acquisition module, which includes:
[0057] The global multi-point coordinate preset unit is used to preset a set of global multi-point coordinates covering the three-dimensional scene space;
[0058] Learnable coding layer building units are used to construct learnable coding layers whose parameters can be dynamically updated through neural network training.
[0059] The global query vector generation unit is used to input global multi-point coordinates into the learnable encoding layer and generate corresponding embedding vectors, which are the learnable global query vectors.
[0060] As a preferred embodiment, the 3D target detection device also includes:
[0061] The correction module is used to correct the target query vector of the updated current frame;
[0062] Furthermore, the detection module is used to perform detection based on the corrected target query vector of the current frame to obtain the 3D detection result of the current frame.
[0063] Thirdly, the present invention discloses a computing device, comprising:
[0064] One or more processors;
[0065] Memory;
[0066] And one or more programs, wherein the one or more programs are stored in memory and configured to be executed by one or more processors, the one or more programs including instructions for any of the above-described multimodal fusion 3D object detection methods.
[0067] Fourthly, the present invention discloses a storage medium storing one or more computer-readable programs, the one or more programs including instructions adapted to be loaded by a memory and executed by any of the above-described multimodal fusion 3D target detection methods.
[0068] This invention discloses a multimodal fusion method, apparatus, device, and storage medium for 3D target detection, which has the following beneficial effects:
[0069] First, detection performance has been improved.
[0070] 1) Reduce missed detections in point cloud blind spots.
[0071] By using target query vectors from historical frames to compensate for positional changes caused by vehicle motion, and combining them with learnable global query vectors to detect new targets, the problem of blind spot detection caused by relying solely on point cloud query vectors is solved.
[0072] The target detected by the target query vector in the historical frame can be directly traced back to the previous frame, realizing the integration of detection and tracking, and avoiding the complexity of repeated association of data from multiple frames.
[0073] 2) Adapt modal characteristics to improve feature utilization.
[0074] Image processing branch: By dynamically capturing projection point offsets and weights through projection and deformable-attention mechanisms, the inaccuracy of monocular depth prediction and occlusion problems are avoided, and effective features are extracted by utilizing image continuity.
[0075] Point cloud processing branch: Based on the multi-head attention mechanism, the similarity between the target query vector and the point cloud query vector is calculated, focusing on the effective location information in the sparse point cloud, reducing invalid feature extraction, and improving computational efficiency.
[0076] 3) Enhance robustness in occlusion scenarios.
[0077] Image feature fusion does not rely on depth prediction. It uses deformable attention to adaptively adjust the sampling position, which alleviates the depth prediction bias caused by occlusion and improves the feature accuracy in complex scenes.
[0078] Second, efficiency and robustness optimization.
[0079] 1) Simplify the architecture and improve reasoning efficiency.
[0080] It eliminates the need for independent 2D / 3D detectors, directly constructing target representations from target query vectors and global query vectors in historical frames, thus reducing model complexity and computation time.
[0081] The parallel dual-branch architecture and single-step feature splicing and fusion avoid the redundant process of "two-step fusion" in existing methods, reduce resource consumption, and are suitable for real-time autonomous driving scenarios.
[0082] 2) Modal independence enhances system robustness.
[0083] Image and point cloud features are stitched together and fused. The two are equal and do not affect each other. When one mode fails, the other mode can still work independently, thus improving the system's fault tolerance.
[0084] Third, the tracking and generalization capabilities are enhanced.
[0085] 1) Explicitly link historical information to simplify the tracking process.
[0086] The target query vector of the current frame directly contains the location information of the historical frames. The detection results can be traced through the target query vector of the historical frames, which naturally supports target tracking. There is no need to design an additional tracking module, thus achieving a seamless connection between detection and tracking.
[0087] 2) Learnable global query vectors enhance the ability to detect new targets.
[0088] The system presets global multi-point coordinates covering a 3D scene, generates a global query vector through a learnable coding layer, dynamically adapts to different scenes, enhances the detection capability of newly emerging targets, and improves the system's generalization ability. Attached Figure Description
[0089] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0090] Figure 1 This is a schematic diagram of MV2DFusion.
[0091] Figure 2 This is a flowchart of a 3D target detection method provided in an embodiment of the present invention.
[0092] Figure 3 This is a schematic diagram illustrating the acquisition of image features at the offset point position according to an embodiment of the present invention. Detailed Implementation
[0093] The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0094] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0095] The expression “includes” is an “open-ended” expression, which means that there is a corresponding component or step, and should not be interpreted as excluding additional components or steps.
[0096] To achieve the objectives of this invention, some embodiments of the multimodal fusion 3D target detection method, such as Figure 2 As shown, 3D target detection methods include:
[0097] Step S101: Acquire the image data and point cloud data of the current frame;
[0098] Step S102: The target query vector of the historical frame (i.e., historical frame Query) and the learnable global query vector (i.e., learnable Query) are fused through a self-attention mechanism to serve as the target query vector of the current frame (i.e., target Query).
[0099] Step S103: Extract features from the acquired image data to obtain the image features of the current frame;
[0100] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0101] Feature extraction is performed on the collected point cloud data to obtain the point cloud features of the current frame, and sparse point cloud query vectors (i.e., point cloud query) are obtained from the point cloud features.
[0102] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0103] Step S104: The image features to be fused and the point cloud features to be fused are concatenated and input into the feedforward neural network to obtain the updated target query vector of the current frame.
[0104] Step S105: Correct the updated target query vector of the current frame;
[0105] Step S106: Perform detection based on the corrected target query vector of the current frame to obtain the 3D detection result of the current frame.
[0106] Furthermore, step S103 includes:
[0107] Image processing branch: Extracts features from the acquired image data to obtain the image features of the current frame;
[0108] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0109] Point cloud processing branch: Extract features from the collected point cloud data, obtain the point cloud features of the current frame, and obtain sparse point cloud query vectors from the point cloud features.
[0110] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0111] The image processing branch and the point cloud processing branch process the corresponding image data and point cloud data in parallel and independently.
[0112] Furthermore, the learnable global query vector is obtained through the following steps:
[0113] Step A: Preset a set of global multi-point coordinates covering the 3D scene space;
[0114] Step B: Construct a learnable coding layer whose parameters can be dynamically updated through neural network training;
[0115] Step C: Input the global multi-point coordinates into the learnable encoding layer to generate the corresponding embedding vectors, which are the learnable global query vectors.
[0116] It is worth noting that the query vector in this invention is called Query.
[0117] Since this invention abandons the generation of image query vectors, simply using point cloud query vectors would lead to missed detections in point cloud blind spots. Therefore, historical frame queries and learnable queries are fused through a self-attention mechanism to serve as the target query for the current frame.
[0118] The historical frame query already possesses relatively good positional information from the previous frame. In the current frame, it compensates for positional changes caused by vehicle motion by calculating the pose transformation matrix from the vehicle's pose in the previous frame to its pose in the current frame. Simultaneously, a learnable query is added as a supplement to detect newly appearing targets. This method allows targets detected by the historical frame query to be directly traced back to targets in the previous frame, thus achieving target tracking.
[0119] To address the different characteristics of images and point clouds, this invention employs different fusion methods. Images have continuity but cannot provide sufficiently good depth information, while point clouds have accurate 3D positions but are sparse.
[0120] For the image processing branch, generating image query vectors relies on image depth prediction. However, current monocular image depth prediction performance is poor, and target occlusion issues can lead to inaccurate depth predictions, resulting in potentially inaccurate generated image query vectors. Therefore, this invention only extracts image features. The target query of the current frame is then projected onto the image, and a deformable attention mechanism is used to obtain the offset points and weights of the projected points on the image. Image features at the offset point positions are then obtained, and finally, the image features to be fused for each target query are obtained through weighted summation. Figure 3 As shown.
[0121] For the point cloud processing branch, since the point cloud location is relatively accurate, the point cloud query vector already has good positional information. However, due to the sparsity of the point cloud, directly extracting information from the point cloud features may yield some useless information, causing unnecessary waste of resources. Therefore, this invention uses a multi-head attention mechanism to calculate the similarity between the target query and the point cloud query, and obtains the point cloud features to be fused for each target query based on the similarity.
[0122] For features independently obtained from images and point clouds, a stitching method is used to combine them, which has the following effects: 1) the features of both are of equal status; 2) the two do not affect each other, and if one modality has a problem, the other modality can still be used normally.
[0123] The concatenated vector is input into a 3D detection head to obtain the 3D detection result.
[0124] In other embodiments, the present invention discloses a multimodal fusion 3D target detection device, comprising:
[0125] The acquisition module is used to acquire image data and point cloud data of the current frame;
[0126] The fusion module is used to fuse the target query vector of historical frames and the learnable global query vector through a self-attention mechanism, and use it as the target query vector of the current frame.
[0127] The feature extraction module is used to extract features from the acquired image data and obtain the image features of the current frame;
[0128] The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting.
[0129] Feature extraction is performed on the collected point cloud data to obtain the point cloud features of the current frame, and sparse point cloud query vectors are obtained from the point cloud features.
[0130] The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity.
[0131] The stitching module is used to stitch together the image features to be fused and the point cloud features to be fused, and input them into the feedforward neural network to obtain the updated target query vector of the current frame.
[0132] The detection module is used to perform detection based on the updated target query vector of the current frame and obtain the 3D detection result of the current frame.
[0133] Furthermore, the feature extraction module includes: a parallel and independently configured image processing branch and a point cloud processing branch;
[0134] Image processing branches include:
[0135] An image feature extractor is used to extract features from acquired image data and obtain the image features of the current frame.
[0136] The image feature extractor projects the target query vector of the current frame onto the image, then uses a deformable attention mechanism to obtain the offset point and weight of the projection point on the image, obtains the image features at the offset point position, and obtains the image features to be fused by weighting.
[0137] Point cloud processing branches include:
[0138] A point cloud feature extractor is used to extract features from the collected point cloud data and obtain the point cloud features of the current frame.
[0139] A point cloud query vector generator is used to obtain sparse point cloud query vectors from the point cloud features of the current frame.
[0140] The point cloud feature extractor is used to calculate the similarity between the target query vector of the current frame and the point cloud query vector of the current frame through a multi-head attention mechanism, and to obtain the point cloud features to be fused based on the similarity.
[0141] Furthermore, the learnable global query vector is obtained through a global query vector acquisition module, which includes:
[0142] The global multi-point coordinate preset unit is used to preset a set of global multi-point coordinates covering the three-dimensional scene space;
[0143] Learnable coding layer building units are used to construct learnable coding layers whose parameters can be dynamically updated through neural network training.
[0144] The global query vector generation unit is used to input global multi-point coordinates into the learnable encoding layer and generate corresponding embedding vectors, which are the learnable global query vectors.
[0145] Furthermore, the 3D target detection device also includes:
[0146] The correction module is used to correct the target query vector of the updated current frame;
[0147] Furthermore, the detection module is used to perform detection based on the corrected target query vector of the current frame to obtain the 3D detection result of the current frame.
[0148] Furthermore, it should be noted that the multimodal fusion 3D target detection method provided in the above embodiments is only illustrated by the division of the above functional modules when performing 3D target detection. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the multimodal fusion 3D target detection device can be divided into different functional modules to complete all or part of the functions described above.
[0149] Furthermore, the embodiments of the multimodal fusion 3D target detection device and the multimodal fusion 3D target detection method provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.
[0150] In other embodiments, the present invention discloses a computing device comprising:
[0151] One or more processors;
[0152] Memory;
[0153] And one or more programs, wherein the one or more programs are stored in memory and configured to be executed by one or more processors, the one or more programs including instructions for any of the above-described multimodal fusion 3D object detection methods.
[0154] In other embodiments, the present invention discloses a storage medium storing one or more computer-readable programs, the programs including instructions adapted to be loaded by memory and executed by any of the above-described multimodal fusion 3D target detection methods.
[0155] This invention discloses a multimodal fusion method, apparatus, device, and storage medium for 3D target detection, which has the following beneficial effects:
[0156] First, detection performance has been improved.
[0157] 1) Reduce missed detections in point cloud blind spots.
[0158] The target query vector of historical frames (carrying the target position information of the previous frame) is used to compensate for the position change caused by the vehicle's movement. Combined with the learnable global query vector, new targets are detected, which solves the problem of blind spot and missed detection caused by simply relying on point cloud query vectors.
[0159] The target detected by the target query vector in the historical frame can be directly traced back to the previous frame, realizing the integration of detection and tracking, and avoiding the complexity of repeated association of data from multiple frames.
[0160] 2) Adapt modal characteristics to improve feature utilization.
[0161] Image processing branch: By dynamically capturing projection point offsets and weights through projection and deformable-attention mechanisms, the inaccuracy of monocular depth prediction and occlusion problems are avoided, and effective features are extracted by utilizing image continuity.
[0162] Point cloud processing branch: Based on the multi-head attention mechanism, the similarity between the target query vector and the point cloud query vector is calculated, focusing on the effective location information in the sparse point cloud, reducing invalid feature extraction, and improving computational efficiency.
[0163] 3) Enhance robustness in occlusion scenarios.
[0164] Image feature fusion does not rely on depth prediction. It uses deformable attention to adaptively adjust the sampling position, which alleviates the depth prediction bias caused by occlusion and improves the feature accuracy in complex scenes.
[0165] Second, efficiency and robustness optimization.
[0166] 1) Simplify the architecture and improve reasoning efficiency.
[0167] It eliminates the need for independent 2D / 3D detectors (as required by existing technologies like MV2DFusion, which requires pre-generated image query vectors and point cloud query vectors), and directly constructs target representations using target query vectors and global query vectors from historical frames, reducing model complexity and computation time.
[0168] The parallel dual-branch architecture (image and point cloud branches are processed independently) + single feature stitching and fusion avoids the redundant process of "two fusions" in existing methods, reduces resource consumption, and is suitable for real-time autonomous driving scenarios.
[0169] 2) Modal independence enhances system robustness.
[0170] Image and point cloud features are stitched together and fused. The two are equal and do not affect each other. When one mode fails, the other mode can still work independently, thus improving the system's fault tolerance.
[0171] Third, the tracking and generalization capabilities are enhanced.
[0172] 1) Explicitly link historical information to simplify the tracking process.
[0173] The target query vector of the current frame directly contains the location information of the historical frames. The detection results can be traced through the target query vector of the historical frames, which naturally supports target tracking. There is no need to design an additional tracking module, thus achieving a seamless connection between detection and tracking.
[0174] 2) Learnable global query vectors enhance the ability to detect new targets.
[0175] The system presets global multi-point coordinates covering the 3D scene, generates a global query vector through a learnable coding layer, dynamically adapts to different scenes, enhances the detection capability of newly appearing targets (such as sudden obstacles), and improves the system's generalization ability.
[0176] In summary, this invention can effectively improve detection performance, optimize efficiency and robustness, and enhance tracking and generalization capabilities. It is particularly suitable for 3D target detection tasks in autonomous driving scenarios that require high real-time performance, accuracy, and environmental adaptability.
[0177] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the present invention. Various changes and modifications can be made to the present invention without departing from its spirit and scope. All such changes and modifications fall within the scope of the present invention as claimed, which is defined by the appended claims and their equivalents.
Claims
1. A multi-modal fusion 3D object detection method, characterized in that, The method comprises the following steps: Step S1: collecting image data of a current frame and point cloud data of the current frame; Step S2: fusing a target query vector of a historical frame and a learnable global query vector through a self-attention mechanism as a target query vector of the current frame; Step S3: extracting features from the collected image data to obtain image features of the current frame; projecting the target query vector of the current frame onto the image, using a deformable attention mechanism to obtain offset points and weights of the projection points on the image, obtaining image features of the offset point positions, and obtaining image features to be fused through the weights; extracting features from the collected point cloud data to obtain point cloud features of the current frame, and obtaining a sparse point cloud query vector from the point cloud features; calculating the similarity between the target query vector of the current frame and the point cloud query vector of the current frame through a multi-head attention mechanism, and obtaining point cloud features to be fused according to the similarity; Step S4: concatenating the image features to be fused and the point cloud features to be fused, and inputting them into a feedforward neural network to obtain an updated target query vector of the current frame; Step S6: detecting based on the updated target query vector of the current frame to obtain a 3D detection result of the current frame.
2. The 3D object detection method of claim 1, wherein, The step S3 comprises: an image processing branch: extracting features from the collected image data to obtain image features of the current frame; projecting the target query vector of the current frame onto the image, using a deformable attention mechanism to obtain offset points and weights of the projection points on the image, obtaining image features of the offset point positions, and obtaining image features to be fused through the weights; a point cloud processing branch: extracting features from the collected point cloud data to obtain point cloud features of the current frame, and obtaining a sparse point cloud query vector from the point cloud features; calculating the similarity between the target query vector of the current frame and the point cloud query vector of the current frame through a multi-head attention mechanism, and obtaining point cloud features to be fused according to the similarity; The image processing branch and the point cloud processing branch independently process the corresponding image data and point cloud data in parallel.
3. The 3D object detection method of claim 1, wherein, The learnable global query vector is obtained through the following steps: Step A: presetting a set of global multi-point coordinates covering a three-dimensional scene space; Step B: constructing a learnable encoding layer, parameters of which can be dynamically updated through neural network training; Step C: inputting the global multi-point coordinates into the learnable encoding layer to generate corresponding embedding vectors, and the embedding vectors are the learnable global query vectors.
4. The 3D object detection method of claim 1, wherein, The 3D target detection method further comprises: Step S5: correcting the updated target query vector of the current frame; and step S6 is further based on the corrected target query vector of the current frame to detect to obtain a 3D detection result of the current frame.
5. The multi-modal fused 3D object detection apparatus, characterized in that, The method comprises the following steps: a collecting module for collecting image data of a current frame and point cloud data of the current frame; a fusion module for fusing a target query vector of a historical frame and a learnable global query vector through a self-attention mechanism as a target query vector of the current frame; a feature extraction module for extracting features from the collected image data to obtain image features of the current frame; The target query vector of the current frame is projected onto the image, and then the deformable attention mechanism is used to obtain the offset point and weight of the projection point on the image. The image features at the offset point position are obtained, and the image features to be fused are obtained by weighting. Feature extraction is performed on the collected point cloud data to obtain the point cloud features of the current frame, and sparse point cloud query vectors are obtained from the point cloud features. The similarity between the target query vector and the point cloud query vector of the current frame is calculated through a multi-head attention mechanism, and the point cloud features to be fused are obtained based on the similarity. The stitching module is used to stitch together the image features to be fused and the point cloud features to be fused, and input them into the feedforward neural network to obtain the updated target query vector of the current frame. The detection module is used to perform detection based on the updated target query vector of the current frame and obtain the 3D detection result of the current frame.
6. The 3D object detection apparatus according to claim 5, wherein The feature extraction module includes: a parallel and independently configured image processing branch and a point cloud processing branch; The image processing branch includes: An image feature extractor is used to extract features from acquired image data and obtain the image features of the current frame. The image feature extractor projects the target query vector of the current frame onto the image, then uses a deformable attention mechanism to obtain the offset point and weight of the projection point on the image, obtains the image features at the offset point position, and obtains the image features to be fused by weighting. The point cloud processing branch includes: A point cloud feature extractor is used to extract features from the collected point cloud data and obtain the point cloud features of the current frame. A point cloud query vector generator is used to obtain sparse point cloud query vectors from the point cloud features of the current frame. The point cloud feature extractor is used to calculate the similarity between the target query vector of the current frame and the point cloud query vector of the current frame through a multi-head attention mechanism, and to obtain the point cloud features to be fused based on the similarity.
7. The 3D object detection apparatus according to claim 5, wherein The learnable global query vector is obtained through a global query vector acquisition module, which includes: The global multi-point coordinate preset unit is used to preset a set of global multi-point coordinates covering the three-dimensional scene space; Learnable coding layer building units are used to construct learnable coding layers whose parameters can be dynamically updated through neural network training. The global query vector generation unit is used to input the global multi-point coordinates into the learnable encoding layer to generate the corresponding embedding vector, which is the learnable global query vector.
8. The 3D object detection apparatus according to claim 5, wherein, The 3D target detection device also includes: The correction module is used to correct the target query vector of the updated current frame; Furthermore, the detection module is further configured to perform detection based on the corrected target query vector of the current frame to obtain the 3D detection result of the current frame.
9. A computing device, characterized by include: One or more processors; Memory; And one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, and the one or more programs include instructions for the multimodal fusion 3D target detection method according to any one of claims 1-4.
10. Storage medium, characterized in that The storage medium stores one or more computer-readable programs, the programs including instructions adapted to be loaded by memory and executed as described in any of claims 1-4 for the multimodal fusion 3D target detection method.