Multimodal data retrieval method, device, equipment and computer readable storage medium
By constructing a vector database and a label database, and utilizing multilayer perceptron and contrastive learning techniques, multimodal data is mapped to a unified high-dimensional vector space, solving the problem of accurate retrieval of multimodal data in intelligent transportation and achieving efficient cross-modal data retrieval.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TUS CLOUD CONTROL (BEIJING) TECH LTD
- Filing Date
- 2026-02-11
- Publication Date
- 2026-06-30
AI Technical Summary
Existing multimodal data retrieval methods struggle to achieve accurate retrieval across modalities and multiple conditions, especially in intelligent transportation and vehicle-road-cloud integration, where rapid retrieval of massive, multi-source data is difficult to meet industry demands.
By constructing a vector database and a label database, multilayer perceptron (MLP) is used to map data from different modalities to a unified high-dimensional vector space. Through contrastive learning and multi-task joint training, accurate retrieval of cross-modal data is achieved.
It enables efficient and accurate retrieval of cross-modal data without relying on manually labeled text information, thus improving retrieval efficiency and accuracy.
Smart Images

Figure CN122309780A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of retrieval, and more particularly to the field of multimodal data retrieval technology. Background Technology
[0002] With the development of intelligent transportation and vehicle-road-cloud integration, cloud control platforms can collect traffic scene data from multiple sources, including roadside videos, vehicle-mounted videos, and roadside sensing terminals. This data often includes multiple modalities such as images, videos, and text descriptions, and is massive in scale. How to quickly retrieve valuable scene data from this massive amount of raw data is a core requirement in the industry.
[0003] Existing retrieval methods rely on manually annotated text information, which is insufficient for expressing complex semantics and often fails in the face of multimodal data. Therefore, how to efficiently achieve accurate retrieval across modalities and multiple conditions has become a pain point in current technological development and a problem that the industry urgently needs to solve. Summary of the Invention
[0004] This disclosure provides a multimodal data retrieval method, apparatus, device, and storage medium.
[0005] According to a first aspect of this disclosure, a multimodal data retrieval method is provided. The method includes: acquiring key information for retrieval;
[0006] Based on the key information used for retrieval, the target feature vector and target label are retrieved from the first preset database; Based on the target feature vector and the target label, retrieve the corresponding target multimodal data from the second preset database.
[0007] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the first preset database includes a vector database and a tag database; The step of retrieving the target feature vector and target label from the first preset database based on the key information used for retrieval includes: The key information is parsed to obtain the corresponding user feature vector and user tags used for retrieval; The target feature vector is obtained by retrieving the user feature vector from the vector database. The target tag is obtained by retrieving the tag from the tag database based on the user tag.
[0008] In addition to the aspects and any possible implementations described above, a further implementation is provided, wherein the step of retrieving the target feature vector from the vector database based on the user feature vector and retrieving the target tag from the tag database based on the user tag includes: First, the target feature vector and the corresponding candidate identifier set are retrieved from the vector database based on the user feature vector; then, the target identifier set and target label corresponding to the target feature vector and the user label are queried from the tag database in the candidate identifier set. or First, the target tag and its corresponding set of candidate identifiers are obtained by searching the tag database based on the user tag; then, the target identifier set and target feature vector corresponding to the target tag and the user feature vector are queried in the vector database.
[0009] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the first preset database is trained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; The preprocessed data is mapped to a unified high-dimensional vector; The unified high-dimensional vector is stored in the vector database of the first preset database.
[0010] In addition to the aspects and any possible implementations described above, a further implementation is provided, wherein the method further includes: Calculate the similarity between the high-dimensional vectors in the unified high-dimensional vector; Based on the similarity between the high-dimensional vectors in the unified high-dimensional vector, calculate the current contrast loss value corresponding to each high-dimensional vector in the unified high-dimensional vector. Based on the current comparison loss value, determine whether it is necessary to adjust the similarity between the high-dimensional vectors in the unified high-dimensional vector.
[0011] In addition to the aspects and any possible implementations described above, a further implementation is provided, wherein multiple optimization tasks are performed on the multimodal data to be trained to obtain the loss value corresponding to each optimization task; Based on the loss value corresponding to each optimization task, the total loss value corresponding to the multiple optimization tasks is obtained; Based on the total loss value, adjust the model optimization parameters corresponding to each optimization task.
[0012] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the first preset database is trained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; The video data in the preprocessed data is subjected to VLM analysis to generate multi-dimensional labels; The multi-dimensional labels are stored in the label database of the first preset database.
[0013] According to a second aspect of this disclosure, a multimodal data retrieval apparatus is provided. The apparatus includes: an acquisition module for acquiring key information for retrieval; The first retrieval module is used to retrieve the target feature vector and target label from the first preset database based on the key information used for retrieval. The second retrieval module is used to retrieve the corresponding target multimodal data in a second preset database based on the target feature vector and the target label.
[0014] According to a third aspect of this disclosure, an electronic device is provided. The electronic device includes a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement the method described above.
[0015] According to a fourth aspect of this disclosure, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the method according to a first aspect of this disclosure.
[0016] In this disclosure, by obtaining key information for retrieval, target feature vectors and target labels can be retrieved in a first preset database based on the key information for retrieval. Then, based on the target feature vectors and target labels, corresponding target multimodal data can be retrieved in a second preset database. In this way, after inputting the key information for retrieval, cross-modal accurate retrieval can be achieved efficiently, without relying on manually labeled text information for retrieval, thereby improving the accuracy and effectiveness of cross-modal data retrieval.
[0017] It should be understood that the description in the Summary of the Invention is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0018] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. The drawings are provided for a better understanding of the invention and are not intended to limit the scope of this disclosure. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein: Figure 1A flowchart of a multimodal data retrieval method according to an embodiment of the present disclosure is shown; Figure 2 A flowchart of another multimodal data retrieval method according to an embodiment of the present disclosure is shown; Figure 3 A block diagram of a multimodal data retrieval apparatus according to an embodiment of the present disclosure is shown; Figure 4 A block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure is shown. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0020] Furthermore, the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0021] Figure 1 A flowchart of a multimodal data retrieval method 100 according to an embodiment of the present disclosure is shown. Method 100 may include: Step 110: Obtain key information for retrieval; Key information can include the search criteria, the type of scenario, one or more words, phrases, and sentences.
[0022] Step 120: Retrieve the target feature vector and target label from the first preset database based on the key information used for retrieval; Step 130: Based on the target feature vector and the target label, retrieve the corresponding target multimodal data in the second preset database.
[0023] By acquiring key information for retrieval, target feature vectors and target labels can be retrieved in a first preset database based on the key information for retrieval. Then, based on the target feature vectors and target labels, corresponding target multimodal data can be retrieved in a second preset database. In this way, after inputting the key information for retrieval, accurate cross-modal retrieval can be achieved efficiently, without relying on manually labeled text information for retrieval, thus improving the accuracy and effectiveness of cross-modal data retrieval.
[0024] In some embodiments, the first preset database includes a vector database and a tag database; The step of retrieving the target feature vector and target label from the first preset database based on the key information used for retrieval includes: The key information is parsed to obtain the corresponding user feature vector and user tags used for retrieval; The target feature vector is obtained by retrieving the user feature vector from the vector database. The target tag is obtained by retrieving the tag from the tag database based on the user tag.
[0025] By automatically parsing the key information, the corresponding user feature vector and user tag for retrieval can be obtained. Then, by searching the vector database based on the user feature vector, a target feature vector with a high degree of similarity to the user feature vector can be obtained. And by searching the tag database based on the user tag, a target tag with a high degree of similarity or matching degree to the user tag can be obtained.
[0026] In some embodiments, the step of retrieving the target feature vector from the vector database based on the user feature vector and retrieving the target tag from the tag database based on the user tag includes: First, the target feature vector and the corresponding candidate identifier set are retrieved from the vector database based on the user feature vector; then, the target identifier set and target label corresponding to the target feature vector and the user label are queried from the tag database in the candidate identifier set. or First, the target tag and its corresponding set of candidate identifiers are obtained by searching the tag database based on the user tag; then, the target identifier set and target feature vector corresponding to the target tag and the user feature vector are queried in the vector database.
[0027] Each original data sample can be assigned a unique identifier (data_id), which will be used throughout the entire data flow process to ensure a one-to-one correspondence between the mapped feature vector, the generated tags, and the original materials. The vector database (FAISS) stores "feature vector - ID", and the tag repository (Elasticsearch) stores "tag set - ID". Thus, the user feature vector can be searched in the vector database first to obtain target feature vectors similar to the user feature vector and the set of candidate identifiers corresponding to the target feature vector. Then, the tag database is searched for the set of target identifiers corresponding to the target feature vector and the user tag, as well as the target tags corresponding to the set of target identifiers. That is, the search is performed first by feature vector and then by tag. or First, the user tag can be searched in the tag database to obtain the target tag and the candidate identifier set composed of the identifiers corresponding to the target tag; then, the target identifier set corresponding to the target tag and the user feature vector and the target feature vector corresponding to the target identifier set can be queried in the vector database.
[0028] In some embodiments, the first preset database is obtained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; It can collect multimodal training data such as images, videos, and text, and then perform cleaning and normalization processing.
[0029] The preprocessed data is mapped to a unified high-dimensional vector; The unified high-dimensional vector is stored in the vector database of the first preset database.
[0030] This approach maps data from different modalities (such as images, text, and videos) into a common semantic space, making comparison and matching between different modalities possible. This method introduces a mapping function (Multilayer Perceptron, MLP) to transform features from different modalities into the same space, achieving a unified semantic representation.
[0031] Suppose we have a mapping function M that is a multilayer perceptron (MLP) that projects the original feature vectors v extracted by each modality encoder into a unified shared space z: (Image feature mapping) (Text Feature Mapping) (Video Feature Mapping) In some embodiments, the method further includes: Calculate the similarity between the high-dimensional vectors in the unified high-dimensional vector; Based on the similarity between the high-dimensional vectors in the unified high-dimensional vector, calculate the current contrast loss value corresponding to each high-dimensional vector in the unified high-dimensional vector.
[0032] These are the unified high-dimensional vectors corresponding to the vector representations of images and text (feature vectors extracted by the encoder). It is the unified high-dimensional vector corresponding to the feature vector of the j-th text, and the current contrastive loss value is N is the total number of texts. This represents the similarity between the feature vectors of the image and the feature vectors of the text.
[0033] Based on the current comparison loss value, determine whether it is necessary to adjust the similarity between the high-dimensional vectors in the unified high-dimensional vector.
[0034] The similarity between high-dimensional vectors in a unified high-dimensional vector can be calculated using cosine similarity. Then, a contrastive learning method can be used to calculate the current contrastive loss value corresponding to each high-dimensional vector in the unified high-dimensional vector. Based on the current comparison loss value, it can be determined whether it is necessary to adjust the similarity between the high-dimensional vectors in the unified high-dimensional vector, that is, to ensure the accuracy of the similarity between the high-dimensional vectors.
[0035] In some embodiments, multiple optimization tasks are performed on the multimodal data to be trained to obtain the loss value corresponding to each optimization task; Multiple optimization tasks can be performed as follows: 1) Image-text comparison learning: Optimize the similarity between images and text.
[0036] 2) Video description generation: Generate natural language descriptions for videos.
[0037] 3) Visual Question Answering (VQA): Given an image and a question, generate an answer.
[0038] 4) Image generation: Generate corresponding images from text.
[0039] Based on the loss value corresponding to each optimization task, the total loss value corresponding to the multiple optimization tasks is obtained; Based on the total loss value, adjust the model optimization parameters corresponding to each optimization task.
[0040] By performing multiple optimization tasks on the multimodal data to be trained, the loss value corresponding to each optimization task can be obtained. Then, by weighted summation of the loss values corresponding to each optimization task, the total loss value corresponding to the multiple optimization tasks can be obtained. Furthermore, the model optimization parameters corresponding to each optimization task can be flexibly adjusted according to whether the total loss value is too large.
[0041] For example: Total loss value The calculation method can be as follows:
[0042] in, It is the loss value obtained after performing image-text contrast learning on the training multimodal data. It is the loss value obtained after performing video description generation on the training multimodal data. It is the loss value obtained after performing visual question answering on the training multimodal data.
[0043] In some embodiments, the first preset database is obtained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; The video data in the preprocessed data is subjected to VLM analysis to generate multi-dimensional labels; The multi-dimensional labels are stored in the label database of the first preset database.
[0044] For example, the multimodal data to be trained is as follows: A vehicle is traveling at a constant speed in the right lane of an urban road. A bus is parked at a bus stop on the right, completely obstructing the right-side view. Just as the vehicle is about to be parallel to the bus, a pedestrian suddenly rushes out from in front of the bus, attempting to cross the road. The driver brakes sharply, and the vehicle stops just before hitting the pedestrian. The pedestrian, startled, quickly runs across the lane.
[0045] The semantic tag set (Chinese) generated after VLM analysis can be as follows: Objects: Car (this vehicle), bus (stationary), pedestrian (suddenly appearing), bus stop, city road, zebra crossing (far away). Scene: City street, bus stop, limited field of view Actions: driving at a constant speed, stopping at a stationary position, suddenly rushing out, emergency braking, vehicle coming to a sudden stop, running across. Attributes: Blind spot, high risk, rapid reaction, extremely dangerous Relations: A bus obstructs the right-side view of the vehicle, and a pedestrian appears in front of the bus, resulting in a path conflict between the vehicle and the pedestrian. Events: A near-miss incident involving a pedestrian suddenly appearing out of the way was avoided, preventing a collision. High-Level Semantics / Summary: "This vehicle encountered a typical 'ghost pedestrian' scenario while driving. A parked bus ahead created a large blind spot, and a pedestrian suddenly darted out from in front of the vehicle to cross the road. The driver took emergency braking measures and successfully stopped the vehicle before a collision, avoiding a serious traffic accident. The driving strategy for this scenario is the importance of slowing down and preparing to brake in advance in areas with limited visibility." In one embodiment, if the target multimodal data retrieved from the second preset database based on the target feature vector and the target label includes multiple sets of data, then the final ranking score corresponding to each set of data is obtained based on the similarity between the target feature vector and the user feature vector corresponding to each set of data, the matching degree between the target label and the user label corresponding to each set of data, and the timeliness of each set of data. The multiple sets of data are sorted according to the final ranking score corresponding to each set of data; Output the data sets whose final sorting score is greater than the preset sorting score or whose final sorting score is among the highest.
[0046] The formula for calculating the final ranking score is as follows:
[0047] in: This represents the final ranking score of data sample j (i.e., the j-th group of data); : Represents the timestamp of data sample j. The label representing data sample j; Vector similarity: User query vector With the data feature vectors stored in the database The cosine similarity between the feature vectors corresponding to the j-th data group represents their semantic similarity. Tag matching degree: Tags in the user-input business filter conditions (such as "rainy day", "accident") With the original data's built-in labels The degree of overlap between (i.e., the target labels corresponding to the j-th group of data); Timeliness Score: This is based on timestamps. The function scores higher generally for newer data (closer to the current time). This aligns with the transportation industry's need to "focus on real-time or recent events."
[0048] This is the final ranking score. The system sorts all candidate results in descending order based on this total score, ensuring that the top-ranked result conforms to semantics, business tag constraints, and is the most recent in time. , , Weight parameters. Semantic similarity weight; Tag matching weight; Timeliness weight; this can be determined through business scenarios or learning mechanisms. For example: in an incident tracing scenario, it might be set to a higher weight to prioritize displaying recently occurred incidents. In a typical violation research scenario, it might be set to a lower weight. The size is relatively large to ensure that the retrieved samples fully conform to the defined set of violation labels.
[0049] This invention proposes a multimodal data retrieval system based on a vector database and a tag set. The system architecture diagram illustrates the entire chain from "input → encoding → vector space → tag generation → vector database + tag set → retrieval rearrangement → output," as shown below. Figure 2 As shown.
[0050] This multimodal data retrieval system includes the following modules: Data acquisition and preprocessing module: Acquires multimodal traffic scene data such as images, videos, and text, and performs cleaning and normalization processing.
[0051] Feature extraction and unified vector space mapping module: Through deep learning models (image encoder, video encoder, text encoder), multimodal data is mapped to a unified high-dimensional semantic space.
[0052] The label generation module (based on the VLM model) uses the Vision-Language Model to perform semantic understanding of multimodal data, aligns the label set, and generates multi-dimensional labels (light, weather, temporary roadblocks, driving area, intersection conditions, road conditions, etc.).
[0053] Vector database and tag set storage module: Stores high-dimensional vectors in the vector database, and stores tag information as a structured field to support dual-index retrieval.
[0054] Search and weighted ranking module: The user inputs text / image / conditional query, the system performs similarity search based on the vector database, filters and weights the results based on the tag conditions, and outputs the final result.
[0055] (II) Technical Process 1. Multimodal input representation For multimodal data such as images, videos, and text, the CLIP deep learning model is used to extract semantic features; each data sample is mapped to a unified high-dimensional vector space, denoted as:
[0056] in, Represents raw data (images, videos, text). This represents the feature extraction model. It is represented by a d-dimensional vector.
[0057] So, Image I: via image encoder Obtain the eigenvectors:
[0058] Video V: Sampled Frame Sequence After passing through the image encoder and the temporal aggregation model g:
[0059] Text T: via text encoder The vector representation is obtained:
[0060] Where I, V, and T represent raw data (images, videos, and text). This represents the feature extraction model. It is represented by a d-dimensional vector.
[0061] 2. Unified vector space alignment 2.1 Core Technical Challenges Large modal differences Images are pixel matrices, videos are time-series frame sequences, and text is symbol sequences → the feature distributions differ greatly; Directly mapping them to the same space will result in "semantic misalignment".
[0062] Cross-modal semantic gap Text expresses semantic abstraction ("nighttime traffic congestion at intersections"), while images / videos express concrete visual objects (car lights, traffic flow). Aligning "abstract descriptions" and "visual scenes" in semantic space is a challenge.
[0063] Time series information modeling Videos not only have spatial characteristics, but also contain temporal changes; If keyframes are simply used as replacements, dynamic semantics (such as "vehicle changing lanes") may be lost.
[0064] Search efficiency and scale Massive traffic data needs to support low-latency retrieval; The model needs to strike a balance between accuracy and efficiency.
[0065] 2.2 Implementation Method 2.2.1 Cross-modal transformation is the structural foundation: establishing physical mapping paths; and mapping features of different modalities to the same shared space through mapping functions such as multilayer perceptrons (MLP).
[0066] 2.2.2 (Contrastive Learning) is the core training objective: using the contrastive loss function as the objective, it ensures that the mapped features are semantically consistent in space.
[0067] 2.2.3 (Multi-task joint training) is an enhancement strategy: by simultaneously training multiple tasks (such as VQA, description generation, etc.) and summing the loss function ( This allows the features learned by the model to be more generalizable.
[0068] Examples are provided to illustrate each step: 1. Cross-modal transfer (2.2.1) – “Establishing a common language” Example: AI is faced with two types of information: a video of "a bus blocking the view and pedestrians rushing out", and a text describing the "dangerous situation of a pedestrian suddenly appearing out of the window".
[0069] Function: Since video consists of pixels and text consists of characters, they are inherently different. The MLP mapping function acts like a translator, converting pixels into a set of numbers (feature vectors) and text into a set of numbers, allowing them to enter the same unified semantic space.
[0070] 2. Comparative Learning (2.2.2) – “Find the right people, push away the wrong people” Example: The AI is presented with "ghost peeking video", "ghost peeking text", and random "smooth passage on a sunny day" text.
[0071] Purpose: To compare loss functions It will give commands like a coach. AI: "Bring the positive samples 'ghost peeking video' and 'ghost peeking text' closer; push the negative sample 'sunny day text' further away."
[0072] Effect: Through continuous adjustments, in the future, if you search for "ghost peeking out," the AI will be able to accurately locate that thrilling video in the space.
[0073] 3. Multi-task joint training (2.2.3) – “Developing all-around athletes” Example: In order to make the AI fully understand the scene, we not only let it do matching exercises (contrastive learning), but also let it write essays (video description generation, description: a pedestrian suddenly appears) and do question-and-answer questions (VQA, Q: Who is blocking the view? A: The bus).
[0074] Function: Total Loss Function It involves summing up the scores from these questions. This way, the AI trained in the same way has a stronger "comprehension" ability. It doesn't just memorize videos by rote, but truly understands the logical relationship between "occlusion" and "conflict".
[0075] Step 2.2.1, as the underlying implementation, first maps the original features of different modalities to the same shared semantic space, so that the data of different modalities have a mathematical basis for comparability.
[0076] Cross-modal transformation maps data from different modalities (such as images, text, and videos) into a common semantic space, making comparison and matching between different modalities possible. This approach introduces a mapping function (Multilayer Perceptron, MLP) to transform features from different modalities into the same space, thereby achieving a unified semantic representation.
[0077] Let there be a mapping function M, which is a multilayer perceptron (MLP) that projects the original feature vectors v extracted by each modality encoder into a unified shared space z: (Image feature mapping) (Text Feature Mapping) (Video Feature Mapping) Association Mechanism: During the feature extraction and transformation stage, the system assigns a unique identifier (data_id) to each original data sample. This ID will be used throughout the entire data flow process to ensure a one-to-one correspondence between the mapped feature vector z, the generated label set, and the original material.
[0078] 2.2.2 Contrastive Learning (a common alignment method): Defining semantic alignment criteria Based on the unified space z established in 2.2.1, this step uses specific algorithmic rules to force semantically consistent different modal vectors to move closer together in the space.
[0079] Core logic: Use the contrastive loss function as the training driver.
[0080] To ensure cross-modal semantic consistency, a typical objective function is: , , These are the unified high-dimensional vectors corresponding to the vector representations of images and text (feature vectors extracted by the encoder). It is the unified high-dimensional vector corresponding to the feature vector of the j-th text.
[0081] Effect: This function achieves "semantic clustering" in the space established in 2.2.2 by maximizing the cosine similarity *sim* between positive samples (such as "ghost peek" videos and "ghost peek" text descriptions) and minimizing the similarity between negative samples. Specifically, based on... To determine whether to adjust the temperature coefficient τ, we need to check if it matches the actual labeled positive and negative samples. The temperature coefficient τ is used to adjust the difficulty of the model in distinguishing similarity distributions.
[0082] This is the contrastive loss function. Its core function is to measure the model's ability to distinguish between positive and negative samples in a unified semantic space. Through computation, it "brings" semantically matching image and text pairs closer together in space, while "pushing" unmatched pairs further apart.
[0083] Suppose our model is processing the following massive amount of data: Text description (Query): "At night, at an intersection, pedestrians 'ghostly peeking out'."
[0084] Candidate Image Library: Image 1: At a crossroads late at night, a child suddenly runs out from behind a bus (correct answer).
[0085] Image 2: During the day, vehicles are driving in an orderly manner on the highway.
[0086] Image 3: In the evening, a row of private cars were parked along the roadside.
[0087] Compare the working process of the loss function: 1. Pull: It calculates the vector cosine similarity between "Image 1" and "Text Description". If the similarity is found to be low, the loss function will produce a huge value, forcing the model to modify its parameters until the two vectors almost overlap in space.
[0088] 2. Push: It will look at both images 2 and 3 simultaneously. Although they are also traffic scenes, the meanings are incorrect. It will require their similarity scores to be as low as possible, mathematically "kicking" them out of the current semantic core.
[0089] 3. Temperature coefficient We can put It can be understood as a "magnifying glass for discerning details": Setting it very low (e.g., 0.07): The model becomes very "picky". Even if the image and text are only slightly imperfect, it will consider them not close enough, forcing the model to focus on extremely small semantic details.
[0090] Setting it to a higher value (e.g., 1.0 or above): the model becomes more "laid-back," as long as it roughly represents a traffic scenario, the boundary between positive and negative samples becomes blurred, training becomes easier, but accuracy decreases.
[0091] Cosine similarity is used to measure the similarity between image vectors and text vectors.
[0092]
[0093] Temperature coefficient : Temperature scaling controls the "steepness" of the similarity distribution. The temperature parameter modulates the difficulty of contrastive learning.
[0094] This controls how close the similarity of positive samples (image and text pairings) should be, and how far the similarity of negative samples (mismatched images and text) should be pushed.
[0095] High temperature (greater than 1): smooths the distribution, reduces the gap between positive and negative samples, and makes training easier.
[0096] Low temperature (less than 1): makes the distribution sharper, amplifies the difference between positive and negative samples, makes training more difficult, and the model will focus more on distinguishing between positive and negative samples.
[0097] The role of temperature: It affects the speed and accuracy of model learning. If the temperature is too high, the model may have difficulty learning the correct similarity mapping; if the temperature is too low, the model may overfit to local features of the training set.
[0098] in: , This means that the similarity calculation for all negative samples is normalized by summing the similarities of all negative samples (paired with all other text in the current image) and using the sum as the denominator.
[0099] The core of the loss function: Positive samples (images and their corresponding text) are brought closer together, that is, by maximizing the similarity between them.
[0100] Negative samples (mismatched images and text) are pushed away, i.e., by minimizing the similarity between them.
[0101] The goal of the loss function is to maximize the similarity of positive samples and minimize the similarity of negative samples.
[0102] Objective: By minimizing this loss function, learn a visual-language alignment embedding space in which images and matching text are closer together, while mismatched text is further apart.
[0103] Temperature parameters It plays a role in controlling the difficulty during training, and by balancing the differences between positive and negative samples, it enables the model to learn stably and effectively.
[0104] Temperature parameter adjustment techniques: Initial settings: The normal temperature setting is... Between these, experience points.
[0105] Adjustment suggestions: High temperature: Suitable for processing diverse datasets or large-scale data, preventing the model from getting stuck in local optima.
[0106] Low temperature: Suitable for scenarios where the dataset is relatively simple, or where you want the model to focus more on detail discrimination.
[0107] 2.2.3 Joint Multi-task Learning: Achieving Global Performance Enhancement Building upon comparative learning, step 2.2.3 introduces multiple auxiliary tasks (such as video description generation and visual question answering, VQA) for parallel training to further enhance the model's ability to understand complex business logic (such as traffic hazard assessment). Specifically, during the training phase, these multiple tasks are performed on all multimodal data to be trained, and the model's understanding is then based on the results of these tasks. Adjust the coefficients corresponding to each task (i.e., below) , (etc.) to ensure that the similarity between different feature vectors can be identified more accurately.
[0108] Multi-task joint training combines the objective functions of multiple tasks, allowing the model to learn multiple tasks simultaneously during training. The outputs of each task share the same network structure and feature representations, but are optimized through different objective functions. In this context, objective tasks include image-text contrast learning, video-text alignment, and video caption generation, with different parts of the model responsible for different tasks.
[0109] Implementation details 1) Task Selection: Based on comparative learning, multiple auxiliary tasks can be added: Image-text comparison learning: Optimizing the similarity between images and text.
[0110] Video description generation: Generates natural language descriptions for videos.
[0111] Visual Question Answering (VQA): Given an image and a question, generate an answer.
[0112] Image generation: Generate corresponding images from text.
[0113] 2) Joint training: Optimization is performed on multiple tasks using a weighted loss function. Each task has an independent objective function, but through joint training, the model can simultaneously optimize the performance of each task.
[0114] Loss function example: This is the total loss function. In multi-task learning, it consists of the losses of each sub-task (such as the contrastive learning loss). (This is from above) ), video description loss (etc.) multiplied by their respective weighting coefficients The summation is obtained. The purpose is to enable the model to simultaneously consider semantic understanding across multiple dimensions during training.
[0115] Implementation mechanism: The contrastive learning loss is weighted and summed with the objective functions of other tasks to form the total loss function.
[0116] in, These are the weighting coefficients for each task (which task is more accurate, which task...). (It's a bit larger), used to control the contribution of different tasks to the final optimization.
[0117] Technical Results: Through complementary learning across multiple tasks, the model is forced to learn deeper traffic semantic features. This not only optimizes the vector arrangement in 2.2.2, but also significantly improves the accuracy and richness of VLM in subsequent label generation.
[0118] Dependencies: 2.2.1 is responsible for "building roads" (establishing space), 2.2.2 is responsible for "setting rules" (aligning semantics), and 2.2.3 is responsible for "all-round training" (global optimization).
[0119] ID Relationship: It is clear that data_id is the link between the three databases, which solves the practical problem of "how to find the original image when the retrieved data is retrieved".
[0120] 3. VLM-based tag generation 3.1 VLM Video Analytics Pipeline Process 1) Video preprocessing Keyframe extraction: Video is a sequence of frames, and processing each frame directly is computationally very expensive. Usually, keyframes are extracted first at a certain frequency (such as 1 frame per second) or by detecting keyframes through camera transitions.
[0121] Timing slicing: Cutting video into short segments (such as 2-4 second video clips) to capture timing information.
[0122] 2) VLM Analysis Execution Zero-sample / few-sample label generation: Ask the VLM directly, for example: "Describe this image / video clip in detail." List the main objects and their actions. "What is the risk level of this scenario, and what are the main driving risks?" "Generate a storyboard for this video." Targeted analysis based on prompt words: Using pre-designed prompt word templates, guide the model to output structured information (JSON format).
[0123] "Output a JSON with fields: 'objects', 'primary_action', 'scene', 'Arainy day, daytime'." Comparative analysis: Compares the content of different frames or different videos and generates descriptions of the differences.
[0124] 3) Post-processing and aggregation Deduplication and fusion: The analysis results of different keyframes or segments are fused to remove duplicate information and form a unified description of the entire video.
[0125] Confidence filtering: Filter out unreliable labels based on the confidence score output by the model.
[0126] Temporal modeling: Using language models or temporal models to connect the descriptions of individual segments to generate a coherent summary or storyline.
[0127] Tags are primarily generated for images and videos to compensate for the lack of business attributes in pure vector retrieval. Text queries themselves serve as query conditions and typically do not require tag generation via VLM; instead, keyword extraction or semantic parsing is performed.
[0128] For video, instead of processing every single frame (which would be too computationally intensive), tags are generated through keyframe extraction (e.g., 1 frame per second) and temporal slicing (e.g., 2-4 second segments). This allows for the capture of both static scene attributes (weather, location) and dynamic behaviors (sudden acceleration, emergency braking). 3.2 Example The vehicle was traveling at a normal, steady speed in the right lane of the city road. A bus was parked at the right-hand bus stop ahead, completely obstructing the right-hand view. Just as the vehicle was about to be parallel to the bus, a pedestrian suddenly darted out from in front of the bus, attempting to cross the road. The driver braked sharply, and the vehicle stopped just before hitting the pedestrian. Startled, the pedestrian quickly ran across the lane.
[0129] The generated semantic tag set (Chinese): Objects: Car (this vehicle), Bus (stationary), Pedestrian (suddenly appears), Bus stop, City road, Zebra crossing (far away) Scene: City street, bus stop, limited field of vision Actions: Driving at a constant speed, stopping at a stationary position, suddenly rushing out, emergency braking, vehicle coming to a sudden stop, running across. Attributes: Blind spot, high risk, rapid response, extremely dangerous Relations: A bus obstructs the right-side view of the vehicle; a pedestrian appears in front of the bus, resulting in a path conflict between the vehicle and the pedestrian. Events: A near-miss incident involving a pedestrian suddenly appearing out of the way; emergency avoidance was achieved, and no collision occurred. High-Level Semantics / Summary: "This vehicle encountered a typical 'ghost pedestrian' scenario while driving. A parked bus ahead created a large blind spot, and a pedestrian suddenly darted out from in front of the vehicle to cross the road. The driver took emergency braking measures and successfully stopped the vehicle before a collision, avoiding a serious traffic accident. The driving strategy for this scenario is the importance of slowing down and preparing to brake in advance in areas with limited visibility." 4. Vector Database and Tag Storage all The data is stored in a vector database that supports ANN (Approximate Nearest Neighbor) retrieval. This scheme uses the FAISS vector database and the HNSW (Hierarchical N avigable Small World s) index.
[0130] 4.1 Storage Module Vector Database (FAISS): Stores vectors + IDs. For example: { data_id: 123, vector: [0.12, -0.33, ..., 0.87]} Tag database (Elasticsearch): Stores tag sets + IDs.
[0131] { data_id: 123, scene_type: "traffic accident" location: "G15-K23", time: "2024-09-17 19:23", weather: "rain" tags: ["rear-end collision", "night", "highway"]} 4.2 Search Module User queries are encoded into query vector q + label conditions. ; There are two strategies for retrieval (label first, then vector / vector first, then label).
[0132] Unique Identifier (ID) When the system collects raw data (such as a piece of surveillance video), it immediately generates an "ID number" (data_id) for it. All subsequent derived information will be bound to this ID.
[0133] Feature Vector Database (FAISS): Stores "Feature Vector - ID" It stores high-dimensional digital features extracted by the encoder.
[0134] Example storage format: { data_id: 123, vector: [0.12, -0.33, ...]}.
[0135] Function: When you search for something "like" using images or text, it tells you which IDs are semantically closest.
[0136] Tag repository (Elasticsearch): Stores "tag sets - IDs". It stores the structured descriptions derived from the VLM model analysis.
[0137] Example storage format: { data_id: 123, scene: "ghost peek", weather: "rainy day", tags:["emergency braking", "blind spot"]}.
[0138] Function: When you set hard criteria (such as "only rainy days"), it can quickly filter out a set of IDs that meet the criteria.
[0139] Raw data storage: Stores the "raw file - ID". It typically resides in cloud object storage or file servers.
[0140] Function: It's what the user ultimately sees. When the system identifies ID 123 using vectors and tags, it retrieves the corresponding original video and plays it to the user based on that ID.
[0141] A. Labels first, then vectors (filtering priority) 1) User input query: Condition = {Scene Type = "Accident", Time = "Night"} Query vector q 2) Filter the tag database to obtain a set of candidate IDs. 3) Retrieve only the vectors corresponding to these IDs in the vector database and calculate the similarity.
[0142] 4) Sort the output.
[0143] The system ultimately outputs to the user the raw multimodal data itself (such as specific video clips and image files) and its corresponding structured tags. Vector retrieval and tag filtering are merely intermediate selection methods, aiming to accurately locate the desired "scene" within massive amounts of data. Users see not dry, uninteresting numbers, but intuitive monitoring playback and automatically generated scene reports.
[0144] B. Vectors first, then labels (semantic priority) 1) User input query: Query vector q Tag conditions
[0145] 2) Retrieve the Top-K from the vector database to obtain the candidate set.
[0146] 3) Query the tag database for tags corresponding to these candidate IDs, and compare them with... match.
[0147] 4) Filtering + rearrangement of output.
[0148] 5. Retrieval and Weighted Ranking User Inquiry : Transform the input into a query vector q, and retrieve the Top-K from the vector library:
[0149] Perform label filtering on the candidate set:
[0150] Final re-ranking scores:
[0151] in: : Represents the timestamp of data sample j. The label represents the data sample j.
[0152] Vector similarity: User query vector With the data feature vectors stored in the database The cosine similarity between them represents the degree of semantic closeness. Tag matching degree: Tags in the user-input business filter conditions (such as "rainy day", "accident") With the original data's built-in labels The degree of overlap between them; Timeliness Score: This is based on timestamps. The function scores higher generally for newer data (closer to the current time). This aligns with the transportation industry's need to "focus on real-time or recent events."
[0153] This is the final ranking score. The system sorts all candidate results in descending order based on this total score, ensuring that the top-ranked result conforms to semantics, business tag constraints, and is the most recent in time. , , Weight parameters. Semantic similarity weight; Tag matching weight; Timeliness weight; this can be determined through business scenarios or learning mechanisms. For example: in an incident tracing scenario, it might be set to a higher weight to prioritize displaying recently occurred incidents. In a typical violation research scenario, it might be set to a lower weight. The size is relatively large to ensure that the retrieved samples fully conform to the defined set of violation labels.
[0154] For example: Suppose the system stores 1 million traffic data entries, and a traffic police officer enters the query: "Find the most recent dangerous avoidance scenarios".
[0155] (Semantic relevance weight): Application: Although you didn't search for the three words "ghost peeking out", you searched for "danger avoidance". The task is to find videos in the vector library that are visually and semantically dangerous (such as sudden braking or flashing turn signals). The search results are highly selective, placing greater emphasis on intuitive matches that "AI deems similar."
[0156] (Tag matching weight): Application: You checked the tag "Accident Type: Ghost Peek" in the search box. This ensures that the results include this specific business definition. If a video has a similar semantic meaning (such as sudden braking), but it's about avoiding a puppy instead of a pedestrian (label mismatch), the score will be lowered.
[0157] (Timeliness weight): Application: You're interested in "recently occurring" events. If the system finds two perfect "ghost peek" videos, one from 2024 and the other from 23 minutes ago (1 hour ago), a higher... This will significantly boost the ranking of videos uploaded an hour ago, placing them at the top of search results. It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, as some steps can be performed in other orders or simultaneously according to this disclosure. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are optional embodiments, and the actions and modules involved are not necessarily essential to this disclosure.
[0158] The above is an introduction to the method embodiments. The following describes the solution described in this disclosure further through device embodiments.
[0159] Figure 3 A block diagram of a multimodal data retrieval apparatus 300 according to an embodiment of the present disclosure is shown. Figure 3As shown, the device 300 includes: Module 310 is used to acquire key information for retrieval. The first retrieval module 320 is used to retrieve the target feature vector and target label from the first preset database based on the key information used for retrieval. The second retrieval module 330 is used to retrieve corresponding target multimodal data in a second preset database based on the target feature vector and the target label.
[0160] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the described module can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0161] According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a non-transitory computer-readable storage medium storing computer instructions.
[0162] Figure 4 A schematic block diagram of an electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0163] Device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.
[0164] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0165] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as method 100. For example, in some embodiments, method 100 may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of method 100 described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform method 100 by any other suitable means (e.g., by means of firmware).
[0166] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0167] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0168] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0169] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0170] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0171] Computing systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0172] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0173] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A multimodal data retrieval method, characterized in that, include: Obtain key information for retrieval; Based on the key information used for retrieval, the target feature vector and target label are retrieved from the first preset database; Based on the target feature vector and the target label, retrieve the corresponding target multimodal data from the second preset database.
2. The method as described in claim 1, characterized in that, The first preset database includes a vector database and a tag database; The step of retrieving the target feature vector and target label from the first preset database based on the key information used for retrieval includes: The key information is parsed to obtain the corresponding user feature vector and user tags used for retrieval; The target feature vector is obtained by retrieving the user feature vector from the vector database. The target tag is obtained by retrieving the tag from the tag database based on the user tag.
3. The method as described in claim 2, characterized in that, The step of retrieving the target feature vector from the vector database based on the user feature vector and retrieving the target tag from the tag database based on the user tag includes: First, the target feature vector and the corresponding candidate identifier set are retrieved from the vector database based on the user feature vector; then, the target identifier set and target label corresponding to the target feature vector and the user label are queried from the tag database in the candidate identifier set. or First, the target tag and its corresponding set of candidate identifiers are obtained by searching the tag database based on the user tag; then, the target identifier set and target feature vector corresponding to the target tag and the user feature vector are queried in the vector database.
4. The method as described in claim 1, characterized in that, The first preset database is obtained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; The preprocessed data is mapped to a unified high-dimensional vector; The unified high-dimensional vector is stored in the vector database of the first preset database.
5. The method as described in claim 4, characterized in that, The method further includes: Calculate the similarity between the high-dimensional vectors in the unified high-dimensional vector; Based on the similarity between the high-dimensional vectors in the unified high-dimensional vector, calculate the current contrast loss value corresponding to each high-dimensional vector in the unified high-dimensional vector. Based on the current comparison loss value, determine whether it is necessary to adjust the similarity between the high-dimensional vectors in the unified high-dimensional vector.
6. The method as described in claim 4, characterized in that, Multiple optimization tasks are performed on the multimodal data to be trained, and the loss value corresponding to each optimization task is obtained; Based on the loss value corresponding to each optimization task, the total loss value corresponding to the multiple optimization tasks is obtained; Based on the total loss value, adjust the model optimization parameters corresponding to each optimization task.
7. The method according to any one of claims 1 to 6, characterized in that, The first preset database is obtained through the following steps: Collect multimodal data for training; The multimodal data to be trained is cleaned and normalized to obtain preprocessed data; The video data in the preprocessed data is subjected to VLM analysis to generate multi-dimensional labels; The multi-dimensional labels are stored in the label database of the first preset database.
8. A multimodal data retrieval device, characterized in that, include: The acquisition module is used to obtain key information for retrieval. The first retrieval module is used to retrieve the target feature vector and target label from the first preset database based on the key information used for retrieval. The second retrieval module is used to retrieve the corresponding target multimodal data in a second preset database based on the target feature vector and the target label.
9. An electronic device, characterized in that, include: Memory and processor The memory stores a computer program, and when the processor executes the program, it implements the method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, When the instructions in the storage medium are executed by the processor corresponding to the electronic device, the electronic device is able to implement the multimodal data retrieval method as described in any one of claims 1-7.