A method, device, medium and equipment for managing automatic driving data
By using a multimodal large model to label and store autonomous driving data in OneData format, the problems of redundancy and heterogeneous data in autonomous driving data management are solved, achieving efficient and reliable data management and improving the intelligence and safety of autonomous driving systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DONGFENG MOTOR GRP
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
The existing autonomous driving data management system suffers from storage redundancy, inefficient retrieval, data and model incompatibility, inefficient and error-prone tag management, and difficulty in unified management of multi-source heterogeneous data. It cannot meet the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, thus affecting the intelligence of autonomous driving.
By using a pre-trained multimodal large model to perform label recognition on autonomous driving data, entity labels are generated through data alignment, format conversion and label extraction. The OneData data format is used for storage and data association is constructed to achieve standardized and normalized management of multi-source heterogeneous data.
It improves the efficiency of entity label generation, reduces labeling costs and error rates, and enables unified management and efficient storage of multi-source data. It meets the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, and enhances the intelligence and safety of autonomous driving systems.
Smart Images

Figure CN122240574A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous driving data management technology, and in particular to a method, apparatus, medium and device for managing autonomous driving data. Background Technology
[0002] As autonomous driving enters the second half of its mass production iteration (i.e., shifting from R&D verification to large-scale mass production), addressing the long-tail problem of autonomous driving based on big data and large-scale modeling, and building a highly efficient and low-cost data acquisition and management system, have become core drivers of autonomous driving algorithm development. The long-tail problem of autonomous driving mainly manifests as insufficient coverage of low-frequency, highly complex edge scenarios (such as non-motorized vehicles going against traffic, running red lights, and sudden lane changes). Although these scenarios account for a very small percentage, they pose extremely high risks and are a significant bottleneck restricting the safe deployment of autonomous driving. Efficient data management is the fundamental support for solving this problem.
[0003] However, the autonomous driving data management system suffers from a combination of defects, including redundant storage, inefficient retrieval, incompatibility between data and models, inefficient and error-prone label management, and difficulty in unified management of multi-source heterogeneous data. This makes it impossible to meet the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, thus restricting the solution of the long-tail problem by autonomous driving algorithms and affecting the intelligence of autonomous driving.
[0004] Therefore, a safe and efficient method for managing autonomous driving data is needed to solve the aforementioned technical problems. Summary of the Invention
[0005] To address the problems existing in the prior art, embodiments of the present invention provide a method, apparatus, medium, and device for managing autonomous driving data, in order to solve or partially solve the technical problems in the prior art that make it difficult to effectively manage autonomous driving data, fail to meet the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, thereby restricting the solution of the long-tail problem in autonomous driving algorithms and affecting the intelligence of vehicle autonomous driving.
[0006] A first aspect of the present invention provides a method for managing autonomous driving data, the method comprising: Using a pre-trained multimodal large model, label recognition is performed on autonomous driving data to obtain entity labels for each frame of autonomous driving data, and corresponding entity labels are labeled for each frame of driving data. Based on the entity labels and preset data formats of each frame of autonomous driving data, data processing is performed on each frame of autonomous driving data, and the processed autonomous driving data is stored. Establish data associations for each frame of stored autonomous driving data, and access data based on these data associations.
[0007] In the above scheme, before performing label recognition on autonomous driving data using a pre-trained multimodal large model, the method further includes: Perform data alignment operations on each frame of autonomous driving data; Based on the data input format of a multimodal large model, the format of each frame of autonomous driving data after alignment is converted.
[0008] In the above scheme, before performing label recognition on autonomous driving data using a pre-trained multimodal large model, the method further includes: Based on the label types required for training the autonomous driving algorithm, the corresponding label extraction prompt words are determined for the multimodal large model.
[0009] In the above scheme, the step of using a pre-trained multimodal large model to perform label recognition on autonomous driving data includes: Each frame of autonomous driving data after format conversion is jointly encoded to convert different types of autonomous driving data into corresponding initial feature vectors; A cross-modal attention mechanism is used to fuse the initial feature vectors to obtain the fused feature vectors corresponding to the initial feature vectors. Based on the fused feature vector, the multimodal large model is used to perform scene understanding on the core elements of the autonomous driving scenario, and the scene understanding result is obtained. Under the constraints of the tag extraction prompts, entity labels for each frame of autonomous driving data are output based on the scene understanding results.
[0010] In the above scheme, the preset data format is the Onedata integrated data format; the data processing of each frame of autonomous driving data based on the entity tags of each frame of autonomous driving data and the preset data format includes: For each frame of autonomous driving data, the entity tag is integrated with the metadata of the autonomous driving data to obtain a tag metadata file that conforms to the preset data format; the metadata includes: its own identity identifier, storage path, collection time, and synchronization identity identifier; The raw data of the autonomous driving data is bound to the tag metadata file to obtain the bound data; The hierarchical directory structure is determined according to the preset data format, and the bound data is written to the corresponding directory according to the hierarchical directory structure.
[0011] In the above scheme, binding the raw data of the autonomous driving data with the tag metadata file to obtain bound data includes: The target file name is generated according to a preset file name format, which includes data type, acquisition location, acquisition time, and frame number information. Configure the same target file name for the raw data of the autonomous driving data and the corresponding tag metadata file to complete the binding of the raw data and the tag metadata file.
[0012] In the above scheme, the step of constructing data associations for the stored frames of autonomous driving data includes: Obtain the lineage information of each frame of autonomous driving data that has been stored. The lineage information includes the generation link of the autonomous driving data and the identity identifier of the dependent data of the autonomous driving data. The identity identifier of the dependent data is the identity identifier of the upstream data that the autonomous driving data frame depends on for generating the autonomous driving data frame. Using the stored frames of autonomous driving data as nodes, attribute information is labeled for the nodes based on the generation link of the autonomous driving data, and direct lineage association edges of the nodes are constructed based on the dependent data identity identifiers of the autonomous driving data. Extract the synchronization identity identifier from each frame of autonomous driving data, group each frame of autonomous driving data based on the synchronization identity identifier, and construct indirect lineage association edges for each node located in the same group; The data lineage graph of the autonomous driving data is determined based on the node, the direct lineage association edge of the node, and the indirect lineage association edge of the node. The data lineage graph is used to present the association relationship between each frame of autonomous driving data.
[0013] A second aspect of the present invention provides an apparatus for managing autonomous driving data, the apparatus comprising: The recognition unit is used to perform label recognition on autonomous driving data using a pre-trained multimodal large model to obtain entity labels for each frame of autonomous driving data. The processing unit is used to label each frame of driving data with corresponding entity tags, process each frame of autonomous driving data based on the entity tags of various autonomous driving data and the preset data format, and store the processed autonomous driving data. The construction unit is used to build data associations for each frame of stored autonomous driving data and to access data based on the data associations.
[0014] A third aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the steps of the method described in any of the first aspects.
[0015] A fourth aspect of the present invention provides a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the steps of the method described in any of the first aspects.
[0016] This invention provides a method, apparatus, medium, and device for managing autonomous driving data. The method includes: using a pre-trained multimodal large model to perform label recognition on autonomous driving data, obtaining entity labels for each frame of autonomous driving data, and labeling each frame of driving data with corresponding entity labels; processing each frame of autonomous driving data based on the entity labels and a preset data format, and storing the processed autonomous driving data; constructing data association relationships for the stored frames of autonomous driving data, and accessing data based on the data association relationships; thus, using a multimodal large model to automatically identify labels for autonomous driving data replaces the traditional manual labeling method, significantly improving the efficiency of entity label generation, reducing labeling costs and error rates, and improving label management efficiency; and according to a preset data format... According to the format, each frame of autonomous driving data carrying entity tags is uniformly formatted and stored, realizing the standardization and normalization of multi-source heterogeneous data, enabling data from different sources and of different types to be managed and used uniformly under the same system; data association relationships are constructed for each frame of autonomous driving data that have been stored, and data access is performed based on the data association relationships. There is no need to copy, migrate, or redundantly store the original data. The target data can be quickly located, retrieved, and integrated simply through the association relationships, which significantly reduces storage redundancy and improves data retrieval and access efficiency. This meets the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, and provides stable, efficient, and high-quality data support for autonomous driving algorithms to solve long-tail scenario problems, which is conducive to improving the intelligence and safety of autonomous driving systems. Attached Figure Description
[0017] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings: Figure 1 A schematic flowchart of a method for managing autonomous driving data according to an embodiment of the present invention is shown; Figure 2 Another schematic diagram of a method for managing autonomous driving data according to an embodiment of the present invention is shown; Figure 3 A schematic diagram of a device for managing autonomous driving data is shown according to an embodiment of the present invention. Detailed Implementation
[0018] Exemplary embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
[0019] This invention provides a method for managing autonomous driving data, such as... Figure 1 As shown, the method mainly includes the following steps: S110 uses a pre-trained multimodal large model to perform label recognition on autonomous driving data, obtains entity labels for each frame of autonomous driving data, and annotates each frame of driving data with corresponding entity labels.
[0020] Autonomous driving data is multi-source and heterogeneous, including various types such as text data, image data, audio data, and radar point cloud data. Each type of autonomous driving data contains multiple frames. To process this multi-source heterogeneous data into a standard format that can be recognized and analyzed by a multimodal large-scale model, eliminating data bias and redundancy, and providing high-quality data support for subsequent input and inference of the multimodal large-scale model, thus avoiding label recognition errors due to the disorder of the original data, in one implementation, before performing label recognition on the autonomous driving data using a pre-trained multimodal large-scale model, the method further includes: Perform data alignment operations on each frame of autonomous driving data; Based on the data input format of a multimodal large model, the format of each frame of autonomous driving data after alignment is converted.
[0021] In one implementation, before performing label recognition on the autonomous driving data using a pre-trained multimodal large model, the method further includes: Based on the label types required for training autonomous driving algorithms, corresponding labels are determined for multimodal large models to extract prompt words.
[0022] Specifically, due to differences in acquisition time and spatial coordinates between different sensors, it is necessary to use timestamp synchronization and spatial coordinate calibration techniques to accurately match data from different modalities such as images, point clouds, and text (for example, a "pedestrian" in an image can correspond to the 3D coordinates of a pedestrian in a point cloud, and also to the description "pedestrian ahead" in text), as follows: First, align the timestamps. For example, change the timestamps of the camera, LiDAR, and vehicle bus (text) to the format "year-month-day hour:minute:second:millisecond" (e.g., 2026-03-06 14:25:30.123) to avoid time discrepancies caused by different formats; Because LiDAR has a stable acquisition frequency and high time accuracy, the time of LiDAR data can be selected as the time reference. The timestamps of the image and text data are compared with the timestamps of the LiDAR data. If the deviation is within a time threshold (e.g., 10ms), they can be determined to be the same time. If the deviation is too large, an intermediate interpolation method can be used to generate a timestamp that matches the LiDAR time, and then this timestamp can be assigned to the corresponding text and image data.
[0023] Then, spatial alignment is performed on the text data, image data, audio data, and radar point cloud data. The spatial alignment method is as follows: When installing cameras and lidar on a vehicle, calibrate the position of the sensors relative to the vehicle; for example, if the camera is directly above the front of the vehicle and the lidar is in the middle of the front of the vehicle, obtain the position parameters of the sensors for each frame. The 3D point cloud data (the coordinates of the lidar itself) collected by the lidar is converted into coordinates in the vehicle coordinate system through position parameters (for example, if the lidar detects a point 10 meters in front of it, it will be converted to 10 meters in front of the vehicle). Image data coordinate transformation: The pixel position of the target in the image (e.g., the top left corner of the image x=500, y=300) is converted into three-dimensional coordinates in the vehicle coordinate system (e.g., after conversion, it is 8 meters in front of the vehicle and 1.5 meters to the left). The spatial description in the text command (such as "there is a traffic light 50 meters ahead") is combined with the vehicle's own positioning data to convert it into a position in the vehicle coordinate system (such as 50 meters ahead in the vehicle coordinate system), ensuring that the text description corresponds to the position of the image and point cloud.
[0024] After aligning the frames of autonomous driving data, data cleaning is performed to remove blurry, damaged, and invalid data (such as images of cameras obscured by mud or fragments of missing point cloud data) in order to obtain valid autonomous driving data.
[0025] After alignment, a unified synchronization identifier (sync_id) is assigned to all multi-source data (such as images, point clouds, and text) in the same time and scene, so that images, point clouds, text and other data in the same scene can be associated using the synchronization identifier.
[0026] Then, the effective autonomous driving data is formatted according to the input specifications of the multimodal large model, converting it into a format that the multimodal large model can directly read, recognize, and process, while retaining the core information of the data (such as the target in the image and the coordinates of the point cloud), laying the foundation for subsequent cross-modal fusion inference.
[0027] Image data format conversion includes: The aligned image data is uniformly adjusted to a fixed size required by the multimodal large model (e.g., the size can be awkwardly 640×480, 1280×720, or can be adjusted according to needs), while keeping the image ratio unchanged during adjustment; The color channels of the image data are uniformly converted to the format required by the model (e.g., converted to RGB channels), and the channel values are normalized to between 0 and 1 to reduce the computational pressure of large multimodal models. Convert the raw format of image data (such as JPG, PNG) into an encoded format that the model can directly read (e.g., convert to Tensor format).
[0028] Format conversion of point cloud data, including: The aligned point cloud data (containing hundreds of thousands or even millions of 3D points) is sampled through random downsampling or voxel downsampling to reduce the number of points (e.g., to tens of thousands of points), retain the 3D coordinates of key targets such as pedestrians, vehicles, and roads, and remove redundant points; Extract the core features of the point cloud (such as the distance, intensity, and normal direction of the points), and the model recognizes the 3D contour of the target; The processed point cloud data is converted from the original PCD format to a format that the model can read (such as PointCloud Tensor format), while unifying the coordinate range (normalizing the 3D coordinates to a fixed interval).
[0029] Format conversion of text data, including: The aligned text instructions are parsed to remove invalid characters (such as garbled text and irrelevant symbols) and retain the core semantics (such as "ahead", "pedestrian", "crossing the road"). The parsed text is converted into a fixed-length semantic code through a text encoder, and the multimodal large model can then understand the meaning of the text through the semantic code. The generated semantic codes are converted into a Tensor format acceptable to multimodal large models to ensure compatibility with the encoding formats of images and point clouds, facilitating subsequent cross-modal fusion.
[0030] Format conversion of audio data, including: The aligned audio data will have a unified sampling rate, number of channels, and bit depth to avoid the model being unable to recognize it due to inconsistent parameters. Extract the core semantic features of the audio (such as Mel frequency cepstral coefficients), which are the core semantic features of the audio, to help multimodal large models recognize speech content (such as "attention pedestrian"). The extracted core semantic features are converted into a Tensor format that the model can read, maintaining consistency with the format of other modal data.
[0031] Finally, it is necessary to determine the dedicated label extraction prompts (Prompts) for autonomous driving. Based on the label requirements for training autonomous driving algorithms, the types of labels that the guidance model needs to recognize should be clarified. For example, label extraction prompts could be: identifying the type, location, and motion state of obstacles in images and point clouds; parsing traffic scene descriptions in text commands; and finally outputting entity labels.
[0032] After processing the autonomous driving data as described above, in one implementation, a pre-trained multimodal large model is used to perform label recognition on the autonomous driving data, including: Joint encoding of multiple types of autonomous driving data is performed to convert different types of autonomous driving data into corresponding initial feature vectors; A cross-modal attention mechanism is used to fuse the initial feature vectors to obtain the fused feature vectors corresponding to the initial feature vectors. Based on fused feature vectors, a multimodal large model is used to understand the core elements of autonomous driving scenarios and obtain scenario understanding results. Under the constraint of tag extraction prompts, entity labels for each frame of autonomous driving data are output based on the scene understanding results.
[0033] Specifically, multimodal large models need to be trained before use. A multimodal large model can include: an image encoder, a point cloud encoder, a text encoder, a speech encoder, a cross-modal fusion unit, and a decoder; among these, Image encoders can use lightweight convolutional neural networks (CNNs) to extract low-level texture features (such as road markings and vehicle outlines) and high-level semantic features (such as pedestrian poses and vehicle driving states) from images layer by layer through convolutional and pooling layers. The extracted features are then transformed into fixed-dimensional image feature vectors (initial feature vectors corresponding to the image data), providing a foundation for subsequent cross-modal fusion.
[0034] The point cloud encoder can adopt a lightweight point cloud network (PointNet) structure to meet the feature extraction requirements of point cloud data for autonomous driving. Through point cloud sampling, feature aggregation and other operations, it extracts the spatial location features, density features and shape features of the point cloud, and transforms the point cloud data into a feature vector (the initial feature vector corresponding to the point cloud data) with the same dimension as the image feature vector.
[0035] The text encoder can adopt a lightweight Transformer network structure to adapt to text data in autonomous driving scenarios. Through word embedding and self-attention mechanisms, it extracts semantic features of text data and transforms text into a fixed-dimensional feature vector (the initial feature vector corresponding to the text).
[0036] The speech encoder can adopt lightweight Mel-Frequency Cepstral Coefficients (MFCC) and Recurrent Neural Network (RNN) structures to adapt to speech data in autonomous driving scenarios (such as driver voice commands, environmental voice prompts, etc.). The frequency features of the speech are extracted through MFCC, and the temporal features of the speech are captured through RNN, which is finally transformed into a fixed-dimensional feature vector (the initial feature vector corresponding to the audio).
[0037] The cross-modal fusion engine can use a dual attention mechanism of self-attention and cross-attention to fuse four initial feature vectors: image, point cloud, text, and speech.
[0038] The decoder can use a lightweight Transformer decoder to meet the needs of accurate and efficient entity label generation in autonomous driving scenarios, and adopts a beam search strategy to generate pure semantic entity labels that fit the autonomous driving scenario.
[0039] The training process for a multimodal large model is as follows: Dataset preparation: Collect massive amounts of real-world autonomous driving scenario data of various types to construct a training set. The training set includes image training sets, point cloud training sets, text training sets, and audio training sets.
[0040] The data includes: image data (images captured by front / side-view cameras), point cloud data (point clouds captured by LiDAR), text data (label text, fault prompt text), and voice data (driver commands, environmental voice). All data is converted into bound data in OneData format. Entity labels must cover the core entities and scene information corresponding to each modality (such as labeling "vehicle" for images, "distance" for point clouds, "fault type" for text, and "command content" for voice), ensuring that the labeled content is consistent with the subsequent data binding label requirements of this system.
[0041] Single-modal pre-training: The image encoder is trained separately using the image training set, the point cloud encoder is trained separately using the point cloud training set, the text encoder is trained using the text training set, and the speech encoder is trained using the audio training set. This optimizes the feature extraction accuracy of each encoder (e.g., optimizing the visual feature recognition capability of the image encoder and optimizing the spatial feature capture capability of the point cloud encoder), ensuring that core effective features can be extracted for each modality.
[0042] During the training process of each of the above single-modal encoders, a corresponding loss function is introduced. After each iteration, the loss value of the loss function is used to guide the optimization of the training parameters (such as learning rate, weights, and biases) of the corresponding single-modal encoder. Multimodal fusion training: A multimodal training set is constructed based on the feature vectors output by each pre-trained single-modal encoder. The cross-modal fusion machine is pre-trained using the multimodal training set. Through tasks such as comparative learning and modality alignment, the weight allocation of the cross-attention mechanism is optimized to achieve accurate understanding of the scene.
[0043] Decoder training: Based on the fused multimodal features, the decoder is pre-trained to optimize the beam search strategy parameters, enabling the model to initially grasp the generation logic of entity labels in autonomous driving scenarios and output entity labels that meet the requirements of the scenario.
[0044] During the training of the cross-modal fusion engine, a corresponding loss function is introduced. After each iteration, the loss value of the loss function is used to optimize the overall parameters of the cross-modal fusion engine (such as learning rate, weights, and biases). This process is repeated until convergence is achieved, thereby reducing misjudgments in label generation and ensuring the semantic accuracy of the output entity labels.
[0045] After the multimodal large model is trained, different types of encoders in the multimodal large model can be used to process the corresponding types of autonomous driving data. The features of the corresponding types of autonomous driving data can be extracted using each type of encoder, and the features can be converted into corresponding initial feature vectors.
[0046] For example, an image encoder can be used to extract feature vectors from image data, focusing on capturing visual information such as the outline and color of pedestrians and lane lines of roads; Point cloud feature vectors are extracted using a point cloud encoder, focusing on capturing spatial information such as the three-dimensional coordinates of pedestrians (10 meters in front of the vehicle and 0.5 meters to the left) and the three-dimensional outline of the vehicle. By using a text encoder, semantically encoded text feature vectors are extracted, focusing on capturing semantic information such as what is ahead, pedestrians, and crossing.
[0047] A cross-modal attention mechanism is used to fuse the feature vectors of each initial feature, resulting in a fused feature vector. This same process is performed on each frame of data to obtain the fused feature vector for each frame.
[0048] For example, during fusion, the pedestrian contour features of the image are associated with the pedestrian 3D coordinate features of the point cloud, and then combined with the pedestrian traversing semantic features of the text, and finally fused into a globally unified feature vector (fusion feature vector). The fusion feature vector contains visual information, spatial information, and semantic information.
[0049] Based on the constraints of tag-based prompts, the multimodal large model interprets the fused feature vectors, analyzes the core elements of the autonomous driving scene one by one, and obtains the recognition results. The recognition results are then parsed to obtain the entity labels for each frame of autonomous driving data. The recognition results can be in JSON format.
[0050] For example, for a certain driving scenario, the core elements of the scenario that the multimodal large model ultimately understands are: a main urban road, a sunny road, the vehicle is driving normally, there is a pedestrian about 10 meters ahead crossing the road, and there are no other obstacles or traffic lights around. This perfectly matches the actual data collection scenario and also meets the recognition requirements of the prompt words.
[0051] The final recognition result is: { "Basic Scene Information": { "Road Type": Urban Arterial Road Road conditions: Sunny, no standing water. "Data Collection Vehicle": TEST008 Data collection time: 2025-03-06 14:25:30.123 }, "Obstacle Information": { "Obstacle Type": Pedestrian, Quantity: 1 "Three-dimensional coordinates (vehicle coordinate system)": x=10.0m, y=0.5m, z=1.2m (z is height) }, "Dynamic Behavior Information": { "Pedestrian behavior": Crossing the road, Vehicle status: Driving normally, not changing lanes } } Finally, the recognition results are analyzed to obtain entity labels including scene type entity labels, target type entity labels, and behavior type entity labels. Scene-type entity tags can include: Road type label: Urban arterial road (corresponding to the "Road type" field in the JSON); Road surface condition labels: Sunny, No standing water (corresponding to the "Road Surface Condition" field in the JSON); Target entity tags include: Obstacle type label: Pedestrian (corresponding to the "Obstacle type" field in the JSON, the core target label); Obstacle quantity label: 1 (corresponding to the "Quantity" field in the JSON); Obstacle spatiotemporal coordinate labels: x=10.0m, y=0.5m, z=1.2m in the vehicle coordinate system (corresponding to the "3D coordinates" field in the JSON, specifying the target spatial location).
[0052] Behavioral entity tags include: Obstacle behavior label: Crossing the road (corresponds to the "Pedestrian Behavior" field in the JSON, describing pedestrian dynamics); Vehicle status labels: Normal driving, No lane change (corresponding to the "Vehicle Status" field in JSON, describing the status of the vehicle itself).
[0053] This allows for the identification of corresponding entity labels for each frame of autonomous driving data. Utilizing a multimodal large-scale model to identify entity labels replaces traditional manual annotation methods, significantly improving entity label generation efficiency, reducing annotation costs and error rates, and enhancing label management efficiency.
[0054] S111, based on the entity tags and preset data format of each frame of autonomous driving data, performs data processing on each frame of autonomous driving data, and stores the processed autonomous driving data.
[0055] To enable autonomous driving algorithms to quickly retrieve and correlate data based on dimensions such as tag type, scene, time, and vehicle, for example, by identifying the tag of a pedestrian crossing, all data packets for the same scenario can be retrieved with a single click, significantly improving data utilization efficiency. After identifying entity tags, each frame of autonomous driving data needs to be processed based on its entity tags and a preset data format, and the processed autonomous driving data is then stored.
[0056] In this invention, to reduce storage costs and avoid data redundancy, the preset data format is OneData. The core idea of OneData is to store only one copy of the original data and manage the data through metadata indexing. During writing, the data itself is not copied; only the data's storage path, its own identity identifier, entity tag, and synchronization identity identifier are recorded in the management system.
[0057] In one implementation, data processing is performed on each frame of autonomous driving data based on the entity tags and a preset data format, including: For each frame of autonomous driving data, the entity tag and the metadata of the autonomous driving data are integrated to obtain a tag metadata file that conforms to a preset data format; the metadata includes: its own identity identifier, storage path, collection time and synchronization identity identifier; The raw data of the autonomous driving data is bound to the tag metadata file to obtain the bound data; The hierarchical directory structure is determined according to the preset data format, and the bound data is written to the corresponding directory according to the hierarchical directory structure.
[0058] In addition, the metadata of autonomous driving data also includes sensor type, collection time and vehicle number (VIN code). When determining the hierarchical directory structure according to the preset data format, the target structure can be R&D domain-sensor type-collection time-vehicle number.
[0059] In one implementation, the raw data of autonomous driving data is bound to a tag metadata file to obtain bound data, including: Generate the target file name according to the preset file name format, which includes data type, acquisition location, acquisition time and frame number information; Configure the same target file name for the raw data of autonomous driving and the corresponding tag metadata file to complete the binding of the raw data and the tag metadata file.
[0060] In one implementation, storing the formatted autonomous driving data includes: Write the entity label, data identity identifier, storage path, and synchronization identity identifier of the autonomous driving data into the corresponding fields in the pre-built database.
[0061] Specifically, after identifying entity tags, the entity tags are integrated with the data's own identifier (data_ID), storage path, and collection time to generate a JSON file conforming to the OneData data format, which is the tag metadata file. The tag metadata file contains both entity tags and all the metadata necessary for data management, and is a standard format that the OneData architecture can directly recognize and use.
[0062] The raw autonomous driving data is bound to the tag metadata file to obtain bound data. For example, the raw autonomous driving data and the tag metadata file are given the same filename to complete the binding.
[0063] For example, suppose the file name of an image data is CAM_FRONT_20260306142530_001, then the original data of the image data is CAM_FRONT_20260306142530_001.jpg, and the tag metadata data of the image data is CAM_FRONT_20260306142530_001.json.
[0064] The hierarchical directory structure is determined according to the onedata format, and the bound data is categorized and stored according to the hierarchical directory structure, as follows: First, the four-level directory hierarchy is determined. A fixed four-level directory structure is adopted, from top to bottom: R&D domain root directory, data type directory, data collection date directory, and vehicle number directory; among them, the R&D domain root directory is used to uniformly store all autonomous driving R&D-related data.
[0065] Secondly, the naming of directories at all levels follows a unified standard to ensure that the system can automatically recognize and parse them: (1) Data type directory: Use fixed abbreviations for data types for naming, such as “CAM” for image data, “LIDAR” for point cloud data, “TEXT” for text data, and “AUDIO” for audio data; (2) Collection date directory: Use YYYYMMDD format for naming, that is, year (4 digits) + month (2 digits) + date (2 digits). For example, the data collected on March 6, 2025, will be named "20250306". (3) Vehicle number directory: It should be consistent with the unique number of the vehicle being collected. For example, if the vehicle number is TEST008, the corresponding directory name is "TEST008".
[0066] Then, the bound data is placed into the corresponding directories according to the above four-level directory hierarchy to complete the format processing of the autonomous driving data.
[0067] To improve data access efficiency, the entity tags, data identity identifiers, storage paths, and synchronization identity identifiers of the autonomous driving data will be written into the corresponding fields in the pre-built database.
[0068] For example, the original data of an image is CAM_FRONT_20260306142530_001.jpg, and the tag metadata data is CAM_FRONT_20260306142530_001.json. When storing this image data, an index record will be added to the database, as shown in Table 1. The entity tag, data identity identifier, storage path, and synchronization identity identifier of the image data will be written into the corresponding fields. Table 1
[0069] In this way, the original files, including the image data and tag metadata files, remain in their original hierarchical directories, unchanged in location, and are not copied. Only the storage path and ID are stored in the database to improve subsequent retrieval efficiency.
[0070] This step focuses on autonomous driving data, completing the OneData format organization of the data, realizing standardized binding of data and tags, hierarchical and orderly storage of data, and interconnected management of multi-source data. It solves the problems of messiness, redundancy, and inefficient retrieval in existing data management, and provides reliable support for subsequent data access and fusion analysis of the OneData system, adapting to the low-cost, high-efficiency, and high-reliability data management needs of the mass production iteration stage of autonomous driving.
[0071] S112, construct data association relationships for each frame of stored autonomous driving data, and access data based on the data association relationships.
[0072] In order to efficiently access autonomous driving data, data associations are built for each frame of stored autonomous driving data, and data access is performed based on these associations.
[0073] In one implementation, data associations are constructed for the stored frames of autonomous driving data, including... Obtain the lineage information of each frame of autonomous driving data that has been stored. The lineage information includes the generation link of the autonomous driving data and the identity identifier of the dependent data of the autonomous driving data. The identity identifier of the dependent data is the identity identifier of the upstream data that the autonomous driving data frame depends on for generating the autonomous driving data frame. Using the stored frames of autonomous driving data as nodes, attribute information is labeled for the nodes based on the generation link of the autonomous driving data, and direct lineage association edges of the nodes are constructed based on the dependent data identity identifiers of the autonomous driving data. Extract the synchronization identity identifier from each frame of autonomous driving data, group each frame of autonomous driving data based on the synchronization identity identifier, and construct indirect lineage association edges for each node located in the same group; The data lineage graph of the autonomous driving data is determined based on the node, the direct lineage association edge of the node, and the indirect lineage association edge of the node. The data lineage graph is used to present the association relationship between each frame of autonomous driving data.
[0074] Specifically, for each frame of autonomous driving data stored in OneData format, lineage metadata is obtained one by one. Simultaneously, the original data and tag metadata of the autonomous driving data are obtained to ensure the integrity and relevance of the lineage metadata. The lineage metadata includes the generation chain of each frame of autonomous driving data and the ID of dependent data. Taking a specific image data as an example, the implementation is as follows: Step 1: Determine the target data.
[0075] (1) Image data: data_id is CAM_FRONT_20260306142530_001, the original data is CAM_FRONT_20260306142530_001.jpg, and the tag metadata file is CAM_FRONT_20260306142530_001.json; Step 2, collect bloodline information: (1) Lineage information of image data (CAM_FRONT_20260306142530_001): Generation Link: The multimodal large model performs target detection and scene recognition on the original front view image (RAW_CAM_FRONT_20260306142530_001) to generate image data with entity labels and corresponding label metadata files; Dependent Data ID: None. This is because the data is raw, directly generated by the forward-facing camera, and does not depend on any upstream data.
[0076] Step 3: Extract the synchronization identity identifier from the autonomous driving data frame CAM_FRONT_20260306142530_001, for example, sync_id1, and group the point cloud data, text data and audio data with the synchronization identity identifier sync_id1 into the same group.
[0077] Then, taking the autonomous driving data frame CAM_FRONT_20260306142530_001 as an independent node, since this node does not have a dependent data ID, there is no need to build a direct lineage relationship edge for this node; it is only necessary to reflect the generation dependency relationship of "raw data → tag meta file information" within this node. Therefore, in the lineage meta information of node 1, the attribute information is clearly marked as: (CAM_FRONT_20260306142530_001.jpg) to generate the tag file (CAM_FRONT_20260306142530_001.json).
[0078] Using sync_id1 as the group identifier, form one group for association within the same scenario: Association group (sync_id1): Node 1 (image), Node 2 (point cloud); Construct indirect related edges: Within the same related group, construct indirect bloodline related edges between all nodes, and label them as "indirect related in the same scenario", without distinguishing data types.
[0079] This establishes the relationships between various autonomous driving data using a data lineage graph. Subsequently, when accessing data, there is no need to traverse massive files. By synchronizing the identity identifier or any data node, the complete scene data package can be obtained with one click. Furthermore, this access mode can provide complete multimodal input for autonomous driving algorithms, automatically providing the model with image (visual), point cloud (spatial), and text (semantic) data from the same scene, significantly improving the performance of autonomous driving models.
[0080] When accessing data, you can, for example Figure 2 As shown, the user sends an access request, which can carry standardized instructions, such as the data's own ID and synchronization ID, and then retrieves the completed scene data based on the data's own ID and synchronization ID.
[0081] This invention utilizes a multimodal large model for automatic label recognition of autonomous driving data, replacing traditional manual labeling methods. This significantly improves the efficiency of entity label generation, reduces labeling costs and error rates, and enhances label management efficiency. It uniformly processes and stores each frame of autonomous driving data carrying entity labels according to a preset data format, achieving standardization and normalization of multi-source heterogeneous data. This allows data from different sources and of different types to be managed and used uniformly within the same system. Furthermore, it constructs data association relationships for each frame of stored autonomous driving data and accesses data based on these relationships. This eliminates the need for copying, migrating, or redundantly storing original data; target data can be quickly located, retrieved, and integrated solely through these relationships. This significantly reduces storage redundancy and improves data retrieval and access efficiency. Therefore, it meets the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles during mass production iterations. It provides stable, efficient, and high-quality data support for autonomous driving algorithms to solve long-tail scenario problems, which is beneficial for improving the intelligence and safety of autonomous driving systems.
[0082] Based on the same inventive concept as in the foregoing embodiments, this embodiment also provides an autonomous driving data management device, such as... Figure 3 As shown, the device includes: The identification unit 31 is used to perform label recognition on autonomous driving data using a multimodal large model to obtain entity labels for each frame of autonomous driving data. The processing unit 32 is used to label each frame of driving data with corresponding entity tags, process the autonomous driving data of each frame carrying entity tags according to a preset data format, and store the processed autonomous driving data. The construction unit 33 is used to build data association relationships for each frame of stored autonomous driving data and to access data based on the data association relationships.
[0083] Since the apparatus described in the embodiments of this invention is used to implement the autonomous driving data management method of the embodiments of this invention, those skilled in the art can understand the specific structure and modifications of the apparatus based on the method described in the embodiments of this invention, and therefore will not be described in detail here. All apparatuses used in the methods of the embodiments of this invention fall within the scope of protection of this invention.
[0084] Based on the same inventive concept, this embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements any step of the method described above.
[0085] Based on the same inventive concept, this embodiment provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the methods described above.
[0086] Through one or more embodiments of the present invention, the present invention has the following beneficial effects or advantages: This invention provides a method, apparatus, medium, and device for managing autonomous driving data. The method includes: using a pre-trained multimodal large model to perform label recognition on autonomous driving data, obtaining entity labels for each frame of autonomous driving data, and labeling each frame of driving data with corresponding entity labels; processing each frame of autonomous driving data based on the entity labels and a preset data format, and storing the processed autonomous driving data; constructing data association relationships for the stored frames of autonomous driving data, and accessing data based on the data association relationships; thus, using a multimodal large model to automatically identify labels for autonomous driving data replaces the traditional manual labeling method, significantly improving the efficiency of entity label generation, reducing labeling costs and error rates, and improving label management efficiency; and according to a preset data format... According to the format, each frame of autonomous driving data carrying entity tags is uniformly formatted and stored, realizing the standardization and normalization of multi-source heterogeneous data, enabling data from different sources and of different types to be managed and used uniformly under the same system; data association relationships are constructed for each frame of autonomous driving data that have been stored, and data access is performed based on the data association relationships. There is no need to copy, migrate, or redundantly store the original data. The target data can be quickly located, retrieved, and integrated simply through the association relationships, which significantly reduces storage redundancy and improves data retrieval and access efficiency. This meets the low-cost, high-efficiency, and high-reliability data closed-loop requirements of autonomous vehicles in the mass production iteration stage, and provides stable, efficient, and high-quality data support for autonomous driving algorithms to solve long-tail scenario problems, which is conducive to improving the intelligence and safety of autonomous driving systems.
[0087] The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used in conjunction with the teachings herein. The required structure for constructing such systems is apparent from the above description. Furthermore, this invention is not directed to any particular programming language. It should be understood that the contents of the invention described herein can be implemented using various programming languages, and the above description of specific languages is for the purpose of disclosing the best mode of implementation of the invention.
[0088] Numerous specific details are set forth in the specification provided herein. However, it will be understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this specification.
[0089] Similarly, it should be understood that, in order to simplify this disclosure and aid in understanding one or more of the various aspects of the invention, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof. However, this method of disclosure should not be construed as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as reflected in the following claims, inventive aspects lie in fewer than all features of a single foregoing disclosed embodiment. Therefore, the claims following the detailed description are hereby expressly incorporated into this detailed description, wherein each claim itself is a separate embodiment of the invention.
[0090] Those skilled in the art will understand that modules in the device of the embodiments can be adaptively changed and placed in one or more devices different from that embodiment. Modules, units, or components in the embodiments can be combined into a single module, unit, or component, and further, they can be divided into multiple sub-modules, sub-units, or sub-components. Except where at least some of such features and / or processes or units are mutually exclusive, any combination can be used to combine all features disclosed in this specification (including the accompanying claims, abstract, and drawings) and all processes or units of any method or device so disclosed. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by an alternative feature that serves the same, equivalent, or similar purpose.
[0091] Furthermore, those skilled in the art will understand that although some embodiments herein include certain features included in other embodiments but not others, combinations of features from different embodiments are intended to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
[0092] The various component embodiments of the present invention can be implemented in hardware, or as software modules running on one or more processors, or a combination thereof. Those skilled in the art will understand that microprocessors or digital signal processors (DSPs) can be used in practice to implement some or all of the functions of some or all of the components of the gateway, proxy server, or system according to embodiments of the present invention. The present invention can also be implemented as a device or apparatus program (e.g., a computer program and computer program product) for performing part or all of the methods described herein. Such programs implementing the present invention can be stored on a computer-readable medium or can be in the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
[0093] It should be noted that the above embodiments are illustrative of the invention and not restrictive, and that those skilled in the art can devise alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be construed as limiting the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, and third, etc., does not indicate any order. These words can be interpreted as names.
[0094] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0095] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for managing autonomous driving data, characterized in that, The method includes: Using a pre-trained multimodal large model, label recognition is performed on autonomous driving data to obtain entity labels for each frame of autonomous driving data, and corresponding entity labels are labeled for each frame of driving data. Based on the entity labels and preset data formats of each frame of autonomous driving data, data processing is performed on each frame of autonomous driving data, and the processed autonomous driving data is stored. Establish data associations for each frame of stored autonomous driving data, and access data based on these data associations.
2. The method as described in claim 1, characterized in that, Before using a pre-trained multimodal large model to perform label recognition on autonomous driving data, the method further includes: Perform data alignment operations on each frame of autonomous driving data; Based on the data input format of a multimodal large model, the format of each frame of autonomous driving data after alignment is converted.
3. The method as described in claim 2, characterized in that, Before using a pre-trained multimodal large model to perform label recognition on autonomous driving data, the method further includes: Based on the label types required for training the autonomous driving algorithm, the corresponding label extraction prompt words are determined for the multimodal large model.
4. The method as described in claim 3, characterized in that, The method of using a pre-trained multimodal large model to perform label recognition on autonomous driving data includes: Each frame of autonomous driving data after format conversion is jointly encoded to convert different types of autonomous driving data into corresponding initial feature vectors; A cross-modal attention mechanism is used to fuse the initial feature vectors to obtain the fused feature vectors corresponding to the initial feature vectors. Based on the fused feature vector, the multimodal large model is used to perform scene understanding on the core elements of the autonomous driving scenario, and the scene understanding result is obtained. Under the constraints of the tag extraction prompts, entity labels for each frame of autonomous driving data are output based on the scene understanding results.
5. The method as described in claim 1, characterized in that, The preset data format is the Onedata integrated data format; the data processing of each frame of autonomous driving data based on the entity tags of each frame of autonomous driving data and the preset data format includes: For each frame of autonomous driving data, the entity tag is integrated with the metadata of the autonomous driving data to obtain a tag metadata file that conforms to the preset data format; the metadata includes: its own identity identifier, storage path, collection time, and synchronization identity identifier; The raw data of the autonomous driving data is bound to the tag metadata file to obtain the bound data; The hierarchical directory structure is determined according to the preset data format, and the bound data is written to the corresponding directory according to the hierarchical directory structure.
6. The method as described in claim 5, characterized in that, The step of binding the raw data of the autonomous driving data with the tag metadata file to obtain bound data includes: The target file name is generated according to a preset file name format, which includes data type, acquisition location, acquisition time, and frame number information. Configure the same target file name for the raw data of the autonomous driving data and the corresponding tag metadata file to complete the binding of the raw data and the tag metadata file.
7. The method as described in claim 1, characterized in that, The process of constructing data associations for each stored frame of autonomous driving data includes: Obtain the lineage information of each frame of autonomous driving data that has been stored. The lineage information includes the generation link of the autonomous driving data and the identity identifier of the dependent data of the autonomous driving data. The identity identifier of the dependent data is the identity identifier of the upstream data that the autonomous driving data frame depends on for generating the autonomous driving data frame. Using the stored frames of autonomous driving data as nodes, attribute information is labeled for the nodes based on the generation link of the autonomous driving data, and direct lineage association edges of the nodes are constructed based on the dependent data identity identifiers of the autonomous driving data. Extract the synchronization identity identifier from each frame of autonomous driving data, group each frame of autonomous driving data based on the synchronization identity identifier, and construct indirect bloodline association edges for each node located in the same group; The data lineage graph of the autonomous driving data is determined based on the node, the direct lineage association edge of the node, and the indirect lineage association edge of the node. The data lineage graph is used to present the association relationship between each frame of autonomous driving data.
8. A device for managing autonomous driving data, characterized in that, The device includes: The recognition unit is used to perform label recognition on autonomous driving data using a pre-trained multimodal large model to obtain entity labels for each frame of autonomous driving data. The processing unit is used to label each frame of driving data with corresponding entity tags, process each frame of autonomous driving data based on the entity tags of various autonomous driving data and the preset data format, and store the processed autonomous driving data. The construction unit is used to build data associations for each frame of stored autonomous driving data and to access data based on the data associations.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the steps of the method according to any one of claims 1-7.
10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method according to any one of claims 1-7.