Automatic driving cooperative perception method and device based on multi-modal large model
By extracting text and image features from point cloud data using a multimodal large model and fusing them, the problem of poor semantic association representation ability of data fusion results between vehicles is solved, thereby improving the accuracy and robustness of collaborative perception in autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2024-06-14
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, the data fusion results between vehicles have poor ability to represent the semantic association between data from different sensors, resulting in insufficient perception capabilities for vehicle collaborative perception.
An autonomous driving collaborative perception method based on a multimodal large model is adopted. By processing the point cloud data of the master vehicle, text information and image features are extracted, and the Text Transformer and Swin Transformer modules are used for feature extraction and fusion. Combined with the information of the target vehicle, multi-terminal collaborative perception is achieved.
It improves the accuracy and robustness of collaborative perception among multiple terminal vehicles, enhances the representation capability of perception features, and alleviates bandwidth congestion and high latency issues in the communication process.
Smart Images

Figure CN118887632B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous driving, and in particular to an autonomous driving cooperative perception method and apparatus based on a multimodal large model. Background Technology
[0002] With the rapid development of technology, intelligent connected vehicles have become an important part of modern transportation. Intelligent connected vehicles have functions such as complex environment perception, intelligent decision-making, and collaborative control. By equipping themselves with advanced on-board sensors, controllers, actuators and other devices, and integrating modern communication and network technologies, they can realize intelligent information exchange and sharing between vehicles, roads, people and the cloud. Among these, collaborative perception is the key to realizing the functions of intelligent connected vehicles, and data fusion is an important part of realizing collaborative perception.
[0003] Collaborative perception technology refers to achieving a more comprehensive and accurate perception of the surrounding environment through information sharing and cooperation among vehicles. This technology can compensate for the limitations of a single vehicle's perception capabilities, improving the safety and efficiency of the overall transportation system. Data fusion, on the other hand, involves effectively integrating and processing information from different sensors and vehicles to improve the reliability and accuracy of perception results.
[0004] In related technologies, vehicle-to-vehicle (V2V) communication and vehicle-to-infrastructure (V2I) communication are commonly used to obtain information about surrounding vehicles and roads in real time. However, due to issues such as delays and packet loss in information transmission between vehicles, the real-time performance and accuracy of vehicle perception results are affected. Moreover, during collaborative perception, there may be differences and conflicts in the information between different vehicles. However, the data fusion results obtained by traditional multimodal data fusion methods have poor ability to represent the semantic associations between data from different sensors, resulting in poor perception capabilities when using the fusion results to perform collaborative perception tasks. Summary of the Invention
[0005] This invention provides an autonomous driving cooperative perception method and device based on a multimodal large model, which solves the problem that the data fusion results obtained by the prior art have poor ability to represent the semantic association between data from different sensors, resulting in poor perception ability of vehicle cooperative perception, and improves the accuracy and robustness of cooperative perception among multiple terminal vehicles.
[0006] This invention provides an autonomous driving cooperative perception method based on a multimodal large model, applied to a master vehicle, wherein the master vehicle deploys a multimodal large model, including:
[0007] The point cloud data of the main vehicle is processed by the multimodal large model to obtain text information; text features are extracted from the text information, image features are extracted from the image data of the main vehicle, and depth map features are extracted from the depth map corresponding to the point cloud data.
[0008] The depth map features and the image features are fused according to the text features to obtain a first fused feature; the first fused feature and the features of the object to be detected sent by the target end are fused to obtain a second fused feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end;
[0009] Perform multi-terminal collaborative perception visual tasks based on the second fusion feature.
[0010] According to the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention, the master vehicle is also equipped with a Text Transformer module and two Swing Transformer modules.
[0011] The steps of extracting text features from the text information, extracting image features from the image data of the main vehicle, and extracting depth map features from the depth map corresponding to the point cloud data include:
[0012] The text features are extracted from the text information using the Text Transformer module; the image features are extracted from the image data using a Swing Transformer module; and the depth map features are extracted from the depth map using another Swing Transformer module.
[0013] According to the present invention, an autonomous driving cooperative perception method based on a multimodal large model is provided, wherein the target end is equipped with the multimodal large model;
[0014] The features of the object to be detected are obtained through the following steps:
[0015] Send a text request to the target terminal based on V2X technology;
[0016] The system receives the features of the object to be detected, its location coordinates, and its timestamp sent by the target terminal; the location coordinates and timestamp are used to align the data between the master vehicle and the target terminal.
[0017] According to the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention, the depth map features are obtained through the following steps:
[0018] The point cloud data is processed sequentially by voxelization, densification, smoothing, compression, and projection to obtain the depth map features.
[0019] According to the present invention, an autonomous driving cooperative perception method based on a multimodal large model is provided, wherein fusing the depth map features and the image features according to the text features to obtain a first fused feature includes:
[0020] The first fusion feature is calculated using the following formula:
[0021]
[0022] in, As the first fusion feature, I D For depth map features, I v Image features; μ(I D The weights corresponding to the depth map features, γ(I) v ) represents the deviation term corresponding to the image features. For text features; where μ(I) D ) and γ(I v It is generated by a multilayer perceptron (MLP) structure.
[0023] According to the present invention, an autonomous driving cooperative perception method based on a multimodal large model is provided. After processing the point cloud data of the master vehicle through the multimodal large model to obtain text information, the method further includes: inputting the point cloud data into a pointclipv2 model to obtain the label of the object to be detected in the point cloud data. The label is used to represent at least one of the position, range and category of the object to be detected.
[0024] The present invention also provides an autonomous driving cooperative perception device based on a multimodal large model, comprising:
[0025] The feature extraction module is used to process the point cloud data of the master vehicle using a multimodal large model deployed on the master vehicle to obtain text information; extract text features from the text information; extract image features from the image data of the master vehicle; and extract depth map features from the depth map corresponding to the point cloud data.
[0026] The feature fusion module is used to fuse the depth map features and the image features according to the text features to obtain a first fused feature; and to fuse the first fused feature and the features of the object to be detected sent by the target end to obtain a second fused feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end;
[0027] The execution module performs a multi-terminal collaborative perception visual task based on the second fused feature.
[0028] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the autonomous driving cooperative perception method based on a multimodal large model as described above.
[0029] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the autonomous driving cooperative perception method based on a multimodal large model as described above.
[0030] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the autonomous driving cooperative perception method based on a multimodal large model as described above.
[0031] The autonomous driving cooperative perception method and device based on a multimodal large model provided by this invention extracts text information from the point cloud data of the master vehicle through a multimodal large model, extracts text features from the text information, extracts image features from the image data, and extracts depth map features from the depth map corresponding to the point cloud data. Then, the depth map features and image features are fused according to the text features. The first fused feature is fused with the features of the object to be detected sent by the target end to obtain the second fused feature, which improves the representation ability of the perception features. Finally, the multi-terminal cooperative perception vision task is performed based on the second fused feature, which improves the accuracy and robustness of cooperative perception among multiple terminal vehicles. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0033] Figure 1 This is one of the flowcharts of the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention.
[0034] Figure 2 This is a schematic diagram of the process of fusing text features with depth map features and image features provided by the present invention.
[0035] Figure 3 This is a schematic diagram of the operation of the visual encoder provided by the present invention.
[0036] Figure 4 This is a schematic diagram of the process of converting point cloud data into a depth map provided by the present invention.
[0037] Figure 5This is a schematic diagram of the operation of the fusion unit provided by the present invention.
[0038] Figure 6 This is a schematic diagram of vehicle-to-vehicle and vehicle-to-infrastructure cooperative perception for autonomous driving provided by the present invention.
[0039] Figure 7 This is the second flowchart of the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention.
[0040] Figure 8 This is a schematic diagram of the structure of the autonomous driving cooperative perception device based on a multimodal large model provided by the present invention.
[0041] Figure 9 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0043] The following is combined with Figures 1-8 This invention describes the autonomous driving cooperative perception method and apparatus based on a multimodal large model.
[0044] Figure 1 This is one of the flowcharts illustrating the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention, such as... Figure 1 As shown, this autonomous driving cooperative perception method based on a multimodal large model is applied to the master vehicle, which deploys a multimodal large model and includes the following steps:
[0045] Step 110: Process the point cloud data of the main vehicle using a multimodal large model to obtain text information; extract text features from the text information, extract image features from the image data of the main vehicle, and extract depth map features from the depth map corresponding to the point cloud data.
[0046] In this step, the multimodal large model deployed on the master vehicle includes a pointclipv2 network. The master vehicle is also equipped with cameras to collect RGB image data of the surrounding environment, which includes other vehicles, traffic equipment, pedestrians or buildings at a target distance from the master vehicle, as well as weather information, pedestrian flow information or ground road condition information, etc.
[0047] In this embodiment, the main vehicle is also equipped with a lidar for collecting point cloud data of the surrounding environment.
[0048] In this embodiment, the point cloud data includes objects, and text information is generated using the point cloud data. The text information includes description information and corresponding tags of the objects. For example, the text information corresponding to the traffic light in the point cloud data is "The traffic light is red, with 30 seconds remaining".
[0049] In this embodiment, corresponding text features and image features can be extracted from text information and image data by a feature encoder. For example, the text information is encoded by a text feature encoder to obtain corresponding text features, the image data is visually encoded by a visual encoder 1 to obtain corresponding image features, and depth map features are extracted from the depth map corresponding to the point cloud data by a visual encoder 2.
[0050] In this embodiment, the depth map can be obtained by performing one or more operations on the point cloud data, including voxelization, densification, smoothing, compression, and projection.
[0051] In this embodiment, after processing the point cloud data of the master vehicle using a multimodal large model to obtain text information, the method further includes: inputting the point cloud data into the pointclipv2 model to obtain the label of the object to be detected in the point cloud data, wherein the label is used to represent at least one of the position, range and category of the object to be detected.
[0052] Step 120: Fuse the depth map features and image features according to the text features to obtain the first fused feature; fuse the first fused feature and the features of the object to be detected sent by the target end to obtain the second fused feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end.
[0053] In this step, the target end can be either the coordinating end or the road end, or it can include both the coordinating end and the road end.
[0054] In this embodiment, the target device can send the features of the object to be detected to the vehicle host device through vehicle-to-everything (V2X) wireless communication technology.
[0055] In this embodiment, the object to be detected may include the main vehicle information and the surrounding information of the main vehicle; the vehicle information includes the vehicle's position, passenger information, and driving status information (whether it is driving and whether it is driverless, etc.).
[0056] In this embodiment, a multimodal large model, a camera, and a lidar are also deployed on the target vehicle. The camera on the target vehicle is used to collect vehicle information of the master vehicle and RGB image data of its surrounding environment. The surrounding environment includes other vehicles, traffic equipment, pedestrians, or buildings that are at a target distance from the master vehicle, as well as weather information, pedestrian flow information, or ground road condition information.
[0057] In this embodiment, the main vehicle is also equipped with a lidar to collect point cloud data of the surrounding environment; the point cloud data is processed using a multimodal large model on the target vehicle to obtain the corresponding text information.
[0058] In this embodiment, the target vehicle extracts image features from the captured RGB image data, extracts depth map features from the depth map corresponding to the point cloud data, extracts text features from the text information, and then uses the text features to guide the fusion of image features and depth map features to obtain the corresponding features of the object to be detected.
[0059] In this embodiment, the master vehicle guides the fusion of point cloud and RGB image through text features to obtain a first fusion feature, which improves the limitations of single sensor data perception and enhances the semantic association between point cloud data and RGB image data, thus better promoting comprehensive perception in collaborative perception. After receiving the features of the object to be detected sent by the target vehicle, the master vehicle fuses the first fusion feature with the features of the object to be detected to obtain a multi-terminal fusion feature, namely the corresponding second fusion feature. The second fusion feature contains the image information collected by the master vehicle and the target vehicle respectively, and has a stronger semantic representation capability.
[0060] Step 130: Perform a multi-terminal collaborative perception visual task based on the second fusion feature.
[0061] In this step, after the master vehicle obtains the collaborative features from other terminals, it uses a feature stitching method to fuse the features of different terminals. The features of multiple terminals are simultaneously placed into a one-dimensional feature channel and added directly. The stitched features are then input into the detection head to complete the collaborative perception vision task for autonomous driving of intelligent vehicles.
[0062] In this embodiment, the aforementioned visual task includes one or more of object detection, image segmentation, and image classification.
[0063] Correspondingly, the aforementioned detection head includes one or more of the following: object detection model, image segmentation model, and image classification model.
[0064] The autonomous driving cooperative perception method based on a multimodal large model provided in this invention extracts text information from the point cloud data of the master vehicle using a multimodal large model, extracts text features from the text information, extracts image features from the image data, and extracts depth map features from the depth map corresponding to the point cloud data. Then, the depth map features and image features are fused according to the text features. The first fused feature is fused with the features of the object to be detected sent by the target end to obtain a second fused feature, which improves the representation ability of the perception features. Finally, a multi-terminal cooperative perception vision task is performed based on the second fused feature, which improves the accuracy and robustness of cooperative perception among multiple terminal vehicles.
[0065] In some embodiments, the master vehicle is also deployed with a Text Transformer module and two Swing Transformer modules; extracting text features from text information, extracting image features from image data of the master vehicle, and extracting depth map features from the depth map corresponding to the point cloud data includes: extracting text features from text information using the Text Transformer module; extracting image features from image data using one Swing Transformer module; and extracting depth map features from the depth map using the other Swing Transformer module.
[0066] Figure 2 This is a schematic diagram of the process for fusing text feature-guided depth map features and image features provided by the present invention. Figure 2 In the illustrated embodiment, the feature extraction network consists of three encoders: two visual encoders and one text encoder. For the text encoder, we use a model to obtain linguistic features ψ. text Simultaneously, image features ψ are extracted using two visual encoders based on SwingTransformer Blocks. D and ψ v Then, a fusion unit is inserted into the data processing chain to integrate the extracted features, and the first fused feature is output after passing through the fusion unit.
[0067] Figure 3 This is a schematic diagram of the operation of the visual encoder provided by the present invention. Figure 3In the embodiment shown, the VisionEncoder is composed of multiple stacked Swing Transformer Blocks, which are connected together by D blocks, where D represents the depth. Each block contains specific components to perform specific functions. The feature extraction process is as follows: First, the depth map and RGB image are input into the convolutional layer (Conv) to extract features. Through convolution operations, the model can learn and recognize specific patterns or structures in the image. Then, the corresponding visual encoding features ψ are extracted through multiple Swing Transformer Blocks.
[0068] In this embodiment, LayerNorm normalizes each feature dimension of each sample, making the mean of each feature 0 and the variance 1, thereby helping to improve the training effect and generalization ability of the model; MSA (attention mechanism) focuses on important regions or features when processing images, which is very useful for capturing key information in images, especially when processing complex or noisy images; in addition, MLP (multilayer perceptron) is used to further process and transform features during the encoding process, and can extract higher-level features by learning the complex relationship between input and output; Swin TransformerBlock captures local and global contextual information in images by segmenting the image into small patches (similar to the receptive field in a convolutional neural network) and applying a self-attention mechanism on these patches. Swin TransformerBlock also introduces windowing and shift operations to learn features at different scales and locations.
[0069] The autonomous driving cooperative perception method based on a multimodal large model provided in this invention extracts text features from text information using a TextTransformer module; extracts image features from image data using a Swing Transformer module; and extracts depth map features from depth maps using another Swing Transformer module. This method can efficiently acquire text features, image features, and depth map features, and provides reliable data support for subsequent feature fusion.
[0070] In some embodiments, a multimodal large model is deployed on the target end; the features of the object to be detected are obtained through the following steps: sending a text request to the target end based on V2X technology; receiving the features of the object to be detected, the location coordinates and the timestamp sent by the target end; the location coordinates and the timestamp are used to align the data between the master vehicle and the target end.
[0071] In this embodiment, the multimodal large model deployed on the target vehicle includes the PointClipv2 network.
[0072] In this embodiment, the master vehicle sends a text request to the cooperating vehicle and roadside units via V2X communication technology. The text requests the cooperating vehicle and roadside units to send information about the target features to be detected to the master vehicle. The cooperating vehicle and roadside units then transmit the target features they have collected about the master vehicle to the master vehicle via V2X communication, and also transmit the location coordinates and timestamps for data alignment.
[0073] The autonomous driving cooperative perception method based on a multimodal large model provided in this invention improves bandwidth congestion and high latency during communication by having the master vehicle transmit "focus" requirements via text information when multiple terminals use V2X communication technology for information sharing, and cooperating vehicle and road terminals transmit regional features about the "focus" to the master vehicle.
[0074] In some embodiments, depth map features are obtained by performing voxelization, compaction, smoothing, compression, and projection on point cloud data in sequence to obtain depth map features.
[0075] In this embodiment, the voxelization process specifically includes:
[0076] (1) For different M views to be projected, create a zero-initialized 3D mesh, denoted as G∈RH×W×D, where H, W, and D represent the spatial resolution of the 3D mesh, and D represents the depth dimension perpendicular to the view plane.
[0077] (2) Normalize the 3D coordinates of the input point cloud to [0, 1], and project the point P = (x, y, z) onto the voxels in the mesh using the following formula:
[0078] G([sHx],[sWy],[Dz])=z;
[0079] Where s∈(0,1) represents the scaling factor, which is used to adjust the size of the projected shape; it should be noted that for multiple points projected onto the same voxel, the minimum depth value can be specified.
[0080] In this embodiment, the densification process specifically includes: densifying the mesh using local minimum pooling operations, and redistributing each voxel in G by the minimum voxel value within the local spatial window, while retaining the minimum depth value relative to mean and max pooling.
[0081] In this embodiment, the smoothing process specifically includes: shape smoothing and noise filtering using a nonparametric Gaussian kernel, represented by a 3D mesh.
[0082] In this embodiment, the extrusion process specifically includes: extruding the depth dimension of G, extracting the minimum value of each depth channel as the value of each pixel position, and repeating this process multiple times (e.g., three times) as the RGB intensity.
[0083] Figure 4 This is a schematic diagram of the process for converting point cloud data into depth maps provided by the present invention. Figure 4 In the illustrated embodiment, point cloud data is sequentially processed through voxelization, densification, smoothing, and compression (dimensionality compression) to synthesize a depth map that closely resembles reality. In this process, firstly, an irregular point cloud is voxelized to generate a sparse 3D mesh. Then, pooling is used to densify the mesh, followed by Gaussian smoothing to eliminate noise. Finally, compression is performed in the depth dimension to obtain a 2D image from the 3D mesh. This provides reliable data support for subsequent feature extraction by a visual encoder based on Swin TransformerBlocks, while also improving visual encoding efficiency.
[0084] The autonomous driving collaborative perception method based on a multimodal large model provided in this invention obtains depth map features by sequentially processing point cloud data through voxelization, densification, smoothing, compression, and projection. This provides reliable data for subsequent fusion of depth map features and image features, and enhances the semantic association between point cloud and RGB image data.
[0085] In some embodiments, fusing depth map features and image features based on text features to obtain a first fused feature includes: calculating the first fused feature using the following formula:
[0086]
[0087] in, As the first fusion feature, I D For depth map features, I v Image features; μ(I D The weights corresponding to the depth map features, γ(I) v ) represents the deviation term corresponding to the image features. For text features; where μ(I) D ) and γ(I v It is generated by a multilayer perceptron (MLP) structure; the purpose of using MLP is to enhance the channels of the original image features to match the channels of the text features.
[0088] Figure 5 This is a schematic diagram of the operation of the fusion unit provided by the present invention. Figure 5 In the embodiment shown, the depth map ψ D A nonlinear transformation using MLP is employed to convert two-dimensional features into one-dimensional features, which are then mapped onto a feature space. Depth map features are used as the dominant element in the fusion process, while text features ψ are... textBy expanding the spatial dimension of the language feature vector through a spatial duplication operation, new text features ψ′ are obtained. text To ensure that it maintains the same resolution as the visual features, RGB image features ψ v After undergoing MLP nonlinear transformation, the image is mapped to the feature space, serving as a supplement to the reconstruction of high-quality images during the fusion process. Finally, the depth map ψ... D and image features ψ v The first fused feature ψ is output after fusion. f .
[0089] The autonomous driving cooperative perception method based on multimodal large model provided in this invention enhances the channels of the original image features by MLP to match the channels of the text features, and expands the spatial dimension of the text feature vector by copying, so that the spatial dimension is consistent with the resolution of the visual features, thereby improving the reliability of the first fused feature.
[0090] Figure 6 This is a schematic diagram of vehicle-to-vehicle and vehicle-to-infrastructure cooperative perception for autonomous driving provided by the present invention. Figure 6 In the illustrated embodiment, the target vehicle includes both cooperating vehicles and roadside equipment. Communication technology enables information exchange between vehicles and between vehicles and roadside equipment. Each terminal converts the collected point cloud data into a depth map through voxelization, compaction, smoothing, and compression. The depth map and RGB image are then described using a multimodal large model, TinyLlama-1.1B, with the highlighted objects in the image set as the "focal point." A text encoder based on the texttransformer model is used to extract text features, and two Swing Transformer-based models are used. Blocks' visual encoder extracts depth map and RGB image features, inputs text features, depth map features, and RGB image features into the fusion unit, and uses text to guide the fusion of depth map features and RGB image features. The master vehicle sends a text request to the cooperating vehicle and roadside units via V2X technology. The text requests the cooperating vehicle and roadside units to send features about the "focus" to the master vehicle. The cooperating vehicle and roadside units transmit the "focus" features they have collected about the master vehicle to the master vehicle, and also transmit positioning coordinates and timestamps for data alignment. Features from different terminals are fused using feature stitching, and the final fused features are used to perform collaborative perception visual tasks, completing autonomous driving collaborative perception.
[0091] Figure 7 This is the second flowchart of the autonomous driving cooperative perception method based on a multimodal large model provided by the present invention. Figure 7In the illustrated embodiment, the method for multi-terminal autonomous driving collaborative perception is implemented through the following steps: S10, firstly, deploy a multimodal large model PointClipv2 on each vehicle and roadside; S20, acquire RGB image data from the main vehicle's camera and point cloud data from the LiDAR; S30, use the multimodal large model PointClipv2 to generate text information from the point cloud data, and label the detected objects in the point cloud with "focus" tags; S40, process the collected point cloud data through voxelization, compaction, smoothing, and compression projection into a depth map; S50, extract text features using a text encoder based on a text transformer model, and use two Swin Transformer-based... Blocks' visual encoder extracts depth map and RGB image features; text features, depth map features, and RGB image features are input into the fusion unit, and the text features are used to guide the fusion of depth map features and RGB image features; S60, the master vehicle sends a text request to the cooperating vehicle and roadside units via V2X technology. The text content requests the cooperating vehicle and roadside units to send features about the "focus" to the master vehicle. The cooperating vehicle and roadside units transmit the "focus" features about the master vehicle to the master vehicle, and transmit the positioning coordinates and timestamps for data alignment; S70, features from different terminals are fused using feature stitching, and the final fused features are input into the detection head to perform collaborative perception visual tasks, completing autonomous driving collaborative perception.
[0092] The following describes the autonomous driving cooperative perception device based on a multimodal large model provided by the present invention. The autonomous driving cooperative perception device based on a multimodal large model described below and the autonomous driving cooperative perception method based on a multimodal large model described above can be referred to in correspondence.
[0093] Figure 8 This is a schematic diagram of the structure of the autonomous driving cooperative perception device based on a multimodal large model provided by the present invention, as shown below. Figure 8 As shown, the autonomous driving cooperative perception device based on a multimodal large model includes: a feature extraction module 810, a feature fusion module 820, and an execution module 830.
[0094] The feature extraction module 810 is used to process the point cloud data of the master vehicle through the multimodal large model deployed on the master vehicle to obtain text information; extract text features from the text information; extract image features from the image data of the master vehicle; and extract depth map features from the depth map corresponding to the point cloud data.
[0095] The feature fusion module 820 is used to fuse depth map features and image features according to text features to obtain a first fused feature; and to fuse the first fused feature with the features of the object to be detected sent by the target end to obtain a second fused feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end.
[0096] Execution module 830 performs multi-terminal collaborative perception visual tasks based on the second fusion feature.
[0097] The autonomous driving cooperative perception device based on a multimodal large model provided in this invention extracts text information from the point cloud data of the master vehicle through the multimodal large model, extracts text features from the text information, extracts image features from the image data, and extracts depth map features from the depth map corresponding to the point cloud data. Then, it fuses the depth map features and image features according to the text features, and fuses the first fused feature with the features of the object to be detected sent by the target terminal to obtain the second fused feature, thereby improving the representation ability of the perception features. Finally, it performs a multi-terminal cooperative perception vision task based on the second fused feature, thereby improving the accuracy and robustness of cooperative perception among multiple terminal vehicles.
[0098] Figure 9 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 9 As shown, the electronic device may include a processor 910, a communication interface 920, a memory 930, and a communication bus 940, wherein the processor 910, the communication interface 920, and the memory 930 communicate with each other via the communication bus 940. The processor 910 can call logical instructions in the memory 930 to execute an autonomous driving cooperative perception method based on a multimodal large model. The method includes: processing point cloud data of the master vehicle using a multimodal large model to obtain text information; extracting text features from the text information, extracting image features from the image data of the master vehicle, and extracting depth map features from the depth map corresponding to the point cloud data; fusing the depth map features and image features according to the text features to obtain a first fused feature; fusing the first fused feature with the features of the object to be detected sent by the target end to obtain a second fused feature; wherein the target end includes at least one of the cooperative end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end; and performing a multi-end cooperative perception visual task based on the second fused feature.
[0099] Furthermore, the logical instructions in the aforementioned memory 930 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0100] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the autonomous driving cooperative perception method based on a multimodal large model provided by the above methods. The method includes: processing point cloud data of the master vehicle through a multimodal large model to obtain text information; extracting text features from the text information, extracting image features from the image data of the master vehicle, and extracting depth map features from the depth map corresponding to the point cloud data; fusing the depth map features and image features according to the text features to obtain a first fused feature; fusing the first fused feature with the features of the object to be detected sent by the target end to obtain a second fused feature; wherein the target end includes at least one of the cooperative end of the master vehicle and the road end, and the features of the object to be detected are determined based on the point cloud data and image data of the master vehicle collected by the target end; and performing a multi-end cooperative perception visual task based on the second fused feature.
[0101] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the autonomous driving cooperative perception method based on a multimodal large model provided by the above methods. This method includes: processing point cloud data of a master vehicle using a multimodal large model to obtain text information; extracting text features from the text information, extracting image features from image data of the master vehicle, and extracting depth map features from the depth map corresponding to the point cloud data; fusing the depth map features and image features according to the text features to obtain a first fused feature; fusing the first fused feature with the features of a target object sent by a target end to obtain a second fused feature; wherein the target end includes at least one of a cooperating end of the master vehicle and a road end, and the features of the target object are determined based on the point cloud data and image data of the master vehicle collected by the target end; and performing a multi-end cooperative perception visual task based on the second fused feature.
[0102] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0103] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0104] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. An autonomous driving cooperative perception method based on a multimodal large model, applied to a master vehicle, wherein the master vehicle deploys a multimodal large model, characterized in that, include: The point cloud data of the main vehicle is processed by the multimodal large model to obtain text information; text features are extracted from the text information, image features are extracted from the image data of the main vehicle, and depth map features are extracted from the depth map corresponding to the point cloud data. The depth map features and the image features are fused based on the text features to obtain a first fused feature; The first fusion feature and the object feature to be detected sent by the target end are fused to obtain the second fusion feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the object feature to be detected is determined based on the point cloud data and image data of the master vehicle collected by the target end; The step of fusing the depth map features and the image features based on the text features to obtain the first fused feature includes: The first fusion feature is calculated using the following formula: ; in, The first fusion feature, For depth map features, Image features; The weights corresponding to the depth map features The deviation term corresponding to the image features. For text features; where, and Generated by a multilayer perceptron (MLP) architecture; Perform multi-terminal collaborative perception visual tasks based on the second fusion feature.
2. The autonomous driving cooperative perception method based on a multimodal large model according to claim 1, characterized in that, The main vehicle is also equipped with a Text Transformer module and two Swing Transformer modules. The steps of extracting text features from the text information, extracting image features from the image data of the main vehicle, and extracting depth map features from the depth map corresponding to the point cloud data include: The text features are extracted from the text information using the Text Transformer module; the image features are extracted from the image data using a Swing Transformer module; and the depth map features are extracted from the depth map using another Swing Transformer module.
3. The autonomous driving cooperative perception method based on a multimodal large model according to claim 1, characterized in that, The target terminal is equipped with the multimodal large model; The features of the object to be detected are obtained through the following steps: Send a text request to the target terminal based on V2X technology; The system receives the features of the object to be detected, its location coordinates, and its timestamp sent by the target terminal; the location coordinates and timestamp are used to align the data between the master vehicle and the target terminal.
4. The autonomous driving cooperative perception method based on a multimodal large model according to claim 1, characterized in that, The depth map features are obtained through the following steps: The point cloud data is processed sequentially by voxelization, densification, smoothing, compression, and projection to obtain the depth map features.
5. The autonomous driving cooperative perception method based on a multimodal large model according to claim 1, characterized in that, After processing the point cloud data of the master vehicle using the multimodal large model to obtain text information, the method further includes: The point cloud data is input into the pointclipv2 model to obtain the label of the object to be detected in the point cloud data. The label is used to represent at least one of the position, range and category of the object to be detected.
6. An autonomous driving cooperative perception device based on a multimodal large model, characterized in that, include: The feature extraction module is used to process the point cloud data of the master vehicle using a multimodal large model deployed on the master vehicle to obtain text information; extract text features from the text information; extract image features from the image data of the master vehicle; and extract depth map features from the depth map corresponding to the point cloud data. The feature fusion module is used to fuse the depth map features and the image features based on the text features to obtain a first fused feature; The first fusion feature and the object feature to be detected sent by the target end are fused to obtain the second fusion feature; wherein, the target end includes at least one of the cooperating end of the master vehicle and the road end, and the object feature to be detected is determined based on the point cloud data and image data of the master vehicle collected by the target end; The feature fusion module is specifically used for: The first fusion feature is calculated using the following formula: ; in, The first fusion feature, For depth map features, Image features; The weights corresponding to the depth map features The deviation term corresponding to the image features. For text features; where, and Generated by a multilayer perceptron (MLP) architecture; The execution module performs a multi-terminal collaborative perception visual task based on the second fused feature.
7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the autonomous driving cooperative perception method based on a multimodal large model as described in any one of claims 1 to 5.
8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the autonomous driving cooperative perception method based on a multimodal large model as described in any one of claims 1 to 5.
9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the autonomous driving cooperative perception method based on a multimodal large model as described in any one of claims 1 to 5.