Method and apparatus for processing autonomous driving scenario data
The method of preprocessing and integrating embedding vectors from multimodal data in autonomous driving systems addresses decision-making uncertainty and environmental vulnerabilities, enhancing situational awareness and reliability in autonomous driving.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2024-12-24
- Publication Date
- 2026-07-02
Smart Images

Figure KR2024021048_02072026_PF_FP_ABST
Abstract
Description
Method and device for processing autonomous driving scenario data
[0001] The present invention relates to a technology for processing autonomous driving scenario data, and more specifically, to a technology for processing multimodal data necessary for the effective generation and retrieval of autonomous driving scenarios.
[0002] An autonomous driving scenario refers to a virtual situation that enables the evaluation of a vehicle's driving performance and safety by simulating various driving situations and environments that an autonomous vehicle may encounter while driving. These scenarios are designed to allow the vehicle to encounter complex road environments, diverse traffic conditions, weather conditions, pedestrians, and other obstacles, and can be utilized to evaluate and enhance responsiveness to unexpected situations that may occur during actual driving.
[0003] Autonomous driving scenarios consist of various components, and there are specific situations that represent each scenario. For example, this information can be provided in various formats, such as text, images, audio, video, and time-series sensor data. As such, autonomous driving scenarios are characterized by the ability to be expressed through the combination of complex information.
[0004] In particular, to effectively represent and process autonomous driving scenarios, it is necessary to effectively integrate and process multimodal data (e.g., text, images, audio, sensor data, video, etc.); however, since data is conventionally processed based on a single modality, the following problems arise in terms of representing and processing autonomous driving scenarios.
[0005] First, decision-making uncertainty increases. In other words, relying on a single modality leads to incomplete information about the situation, which can result in unstable and uncertain decision-making by autonomous vehicles. For example, when cameras are not functioning properly due to bad weather, it may be difficult to make a judgment based solely on video information from the camera.
[0006] Secondly, since it is very difficult to identify correlations or differences among multiple modalities using conventional methods such as statistical techniques, there are limitations to situational awareness capabilities. For example, if a vehicle is equipped with multiple sensors, there may be a specific correlation between the image of the vehicle ahead acquired by a camera and the data collected by multiple surrounding distance sensors. In other words, the distance to the captured vehicle can be estimated using the camera image, and the vehicle's position or speed can also be estimated using information from multiple surrounding distance sensors.
[0007] However, it is very difficult to identify the joint correlation of multiple modalities using conventional methods. Accordingly, a common approach has been to train or build models on a single modality and then combine them later to derive meaning. However, to utilize the correlations of multiple sensors, the inter-sensor correlations must be explicitly extracted through meticulous analysis by experts.
[0008] Thirdly, related to the aforementioned problem, it is difficult to utilize interactions between data. In other words, if interactions between modalities are not reflected, important information provided by a specific modality may be ignored or underestimated. For example, even if a warning is generated via text information and image information is provided simultaneously, if these two pieces of information are processed separately, the meaning of the warning may not be accurately conveyed.
[0009] Fourth, if driving-related data is used independently for each modality, there may be limitations in prediction when evaluating the risk of driving situations. For example, camera footage alone may not be sufficient to recognize sudden situations, or sensor data alone may have limitations in understanding complex surrounding situations.
[0010] Fifth, if a specific modality is temporarily omitted or lost, there is no alternative information available, which can lead to an incomplete overall situational awareness.
[0011] Sixth, single-modality based systems are easily affected by environmental changes, making it difficult to adapt to changing situations. For example, camera performance can vary depending on weather or lighting conditions, and LiDAR may not function well on certain surfaces.
[0012] Seventh, when training autonomous driving agents, it is difficult to adequately respond to various driving scenarios with a single modality, and the autonomous driving system may fail to learn appropriate strategies in specific scenarios.
[0013] However, the above description merely provides background information regarding the present invention and does not constitute previously disclosed technology.
[0014] In order to resolve the aforementioned problems or limitations, the present invention aims to provide a technology for processing multimodal data necessary for the effective generation and retrieval of autonomous driving scenarios.
[0015] In other words, the purpose of the present invention is to provide a technology for processing multimodal data that effectively describes various autonomous driving scenario situations.
[0016] However, the problems that the present invention aims to solve are not limited to those mentioned above, and other unmentioned problems will be clearly understood by those skilled in the art to which the present invention belongs from the description below.
[0017] A method according to an embodiment of the present invention for solving the above-mentioned problem comprises: a step of performing preprocessing on target data, which is multimodal data related to an autonomous driving scenario or driving environment; a step of generating each single embedding vector for a plurality of single modality data included in the preprocessed target data using each single embedding model that has been learned; and a step of generating a single integrated embedding vector, which is a single multimodal vector formed by integrating each single embedding vector, using a previously learned integrated embedding model.
[0018] The step of performing the above preprocessing may sort each single modality data in the target data according to a preset order, and add flags for missing single modality data or generate supplementary data.
[0019] The step of performing the above preprocessing can preprocess the target data using preprocessing information regarding the training data used during the training of the single embedding model and the integrated embedding model.
[0020] A method according to one embodiment of the present invention may further include the step of verifying consistency for a target integrated embedding vector by comparing the single integrated embedding vector with training embedding information for training data used during training of the single embedding model and the integrated embedding model.
[0021] A method according to one embodiment of the present invention may further include the step of obtaining a similarity or distance between the learning embedding information and the target integrated embedding vector in an embedding space, and verifying consistency for the target integrated embedding vector according to the obtained similarity or distance.
[0022] A method according to one embodiment of the present invention may further include the step of applying a clustering technique to the learning embedding information and the target integrated embedding vector to identify major clusters for the learning embedding information, and verifying the consistency of the target integrated embedding vector based on whether the target integrated embedding vector is included within the identified major clusters.
[0023] A method according to one embodiment of the present invention may further include the step of verifying the consistency of the target integrated embedding vector by detecting whether the target integrated embedding deviates from the data distribution of the learning embedding information through Out-of-Distribution (OOD) detection.
[0024] A method according to one embodiment of the present invention may further include the step of numerically evaluating the uncertainty for each single embedding vector and processing such that a lower weight is applied to the corresponding single embedding vector in the target integrated embedding as the uncertainty increases.
[0025] A method according to one embodiment of the present invention may further include a step of processing such that, if there is a single embedding vector for single modality data that is missing from the target integrated embedding vector, a high weight is applied to the single embedding vector for the remaining single modality data that is not missing from the target integrated embedding.
[0026] A method according to one embodiment of the present invention may further include a step of adjusting the positions of data whose distance is less than or equal to a certain distance to become closer, based on the distance in the data distribution for a plurality of target integrated embedding vectors.
[0027]
[0028] An apparatus according to one embodiment of the present invention includes: a memory; and a processor that executes at least a portion of the operation stored in the memory.
[0029] The processor can perform preprocessing on target data, which is multimodal data related to an autonomous driving scenario or driving environment, generate individual single embedding vectors for multiple single modality data included in the preprocessed target data using a previously trained individual embedding model, and generate a single integrated embedding vector, which is a single multimodal vector formed by integrating each individual embedding vector using a previously trained integrated embedding model.
[0030] During the preprocessing, the processor may sort each single modality data in the target data according to a preset order and add a flag or generate supplementary data for missing single modality data.
[0031] The processor can preprocess the target data during the preprocessing using preprocessing information regarding the training data used during the training of the single embedding model and the integrated embedding model.
[0032] The processor can verify the consistency of the target integrated embedding vector by comparing the training embedding information for the training data used during the training of the single embedding model and the integrated embedding model with the single integrated embedding vector.
[0033] The processor can obtain a similarity or distance between the learning embedding information and the target integrated embedding vector in the embedding space, and verify the consistency of the target integrated embedding vector according to the obtained similarity or distance.
[0034] The processor can apply a clustering technique to the learning embedding information and the target integrated embedding vector to identify major clusters for the learning embedding information, and verify the consistency of the target integrated embedding vector based on whether the target integrated embedding vector is included within the identified major clusters.
[0035] The processor can verify the consistency of the target integrated embedding vector by detecting whether the target integrated embedding deviates from the data distribution of the training embedding information through Out-of-Distribution (OOD) detection.
[0036] The processor can numerically evaluate the uncertainty for each single embedding vector and process it so that the higher the uncertainty, the lower the weight applied to the corresponding single embedding vector in the target integrated embedding.
[0037] The processor can process the single embedding vector for single modality data that is missing from the target integrated embedding vector so that a high weight is applied to the single embedding vector for the remaining single modality data that is not missing from the target integrated embedding.
[0038] The above processor can adjust the positions of data whose distance is less than a certain level based on the distance in the data distribution for a plurality of the above-mentioned target integrated embedding vectors so that they become closer.
[0039] This invention was researched with the support of a national research and development project, and the specific details of the national research and development project are as follows.
[0040] [Project ID] 2710007255
[0041] [Sub-project Number] II211352
[0042] [Ministry Name] Ministry of Science and ICT
[0043] [Project Management (Specialized) Agency Name] Korea Institute of Information & Communications Technology Planning & Evaluation
[0044] [Research Project Name] Autonomous Driving Technology Development Innovation Project
[0045] [Research Project Title] Development of Technology to Verify the Effectiveness of Service Scenarios for Responding to Autonomous Driving Laws and Regulations
[0046] [Name of Project Performing Organization] Korea Electronics Technology Institute
[0047] [Project Period] 2021-04-01~2024-12-31
[0048] The present invention, configured as described above, has the advantage of being able to process multimodal data necessary for the effective generation and retrieval of autonomous driving scenarios.
[0049] In other words, the present invention has the advantage of being able to process multimodal data that effectively describes various autonomous driving scenario situations.
[0050] In addition, the present invention has the advantage of contributing to an autonomous driving system recognizing situations more accurately, reducing uncertainty, and making safe decisions by processing the execution of multimodal embeddings, which are embeddings for multimodal data of autonomous driving scenarios.
[0051] In particular, through multimodal embedding of multimodal data of autonomous driving scenarios processed according to the present invention, flexibility to adapt to environmental changes, meaningful information combination through interaction, and real-time situational awareness and risk assessment become possible, thereby providing the advantage of significantly improving the reliability and safety of autonomous driving.
[0052] In addition, the present invention has the advantage of enhancing usability in various fields, such as the generation, management, mutual comparison, search, or real-time accident avoidance of autonomous driving scenarios, through the processing result data (i.e., representation vector) of multimodal embeddings.
[0053] The effects obtainable from the present invention are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art from the description below.
[0054] FIG. 1 shows a schematic block diagram of a system (100) according to one embodiment of the present invention.
[0055] FIG. 2 shows a rough block diagram of the first and second devices (10, 20).
[0056] FIG. 3 shows a flowchart of a method according to one embodiment of the present invention.
[0057] FIG. 4 shows a conceptual diagram of a method according to one embodiment of the present invention.
[0058] Figure 5 shows a detailed flowchart for S210.
[0059] Figure 6 shows a detailed flowchart for S220.
[0060] FIG. 7 is a diagram illustrating the computing environment of the first and second devices (10, 20).
[0061] Hereinafter, specific embodiments according to embodiments of the present disclosure will be described with reference to the drawings. The following detailed description is provided to facilitate a comprehensive understanding of the methods, apparatuses, and / or systems described herein. However, this is merely illustrative and the present invention is not limited thereto.
[0062] In describing the embodiments of the present disclosure, detailed descriptions of known technologies related to the present invention are omitted if it is determined that such detailed descriptions would unnecessarily obscure the essence of the embodiments. Furthermore, terms used below are defined with consideration of their functions in the present invention, and these may vary depending on the intentions or practices of the user or operator. Therefore, such definitions should be based on the content throughout this specification. Terms used in the detailed description are intended merely to describe specific embodiments and should not be limiting. Unless explicitly stated otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as "include" or "comprise" are intended to refer to certain characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof, and should not be interpreted to exclude the existence or possibility of one or more other characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof other than those described. Additionally, terms such as "...part," "...unit," "module," and "block" described in the specification refer to a unit that processes at least one function or operation, and this may be implemented in hardware, software, or a combination of hardware and software.
[0063] Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the attached drawings.
[0064] FIG. 1 shows a schematic block diagram of a system (100) according to one embodiment of the present invention.
[0065] A system (100) according to one embodiment of the present invention (hereinafter referred to as the “system”) is a system for processing multimodal data regarding autonomous driving scenarios. The system (100) can perform a function of training a plurality of embedding models (hereinafter referred to as the “first function”) and a function of generating embedding vectors of multimodal data regarding autonomous driving scenarios using the trained plurality of embedding models (hereinafter referred to as the “second function”).
[0066] In this context, an autonomous driving scenario refers to a virtual situation that enables the evaluation of a vehicle's driving performance and safety by simulating various driving situations and environments that an autonomous vehicle may encounter while driving. These scenarios are designed to allow the vehicle to encounter complex road environments, diverse traffic conditions, weather conditions, pedestrians, and other obstacles, and can be utilized to evaluate and enhance responsiveness to unexpected situations that may occur during actual driving.
[0067] These autonomous driving scenarios may include components such as the driving environment, traffic elements, weather and environmental conditions, driving objectives, and tasks. By considering each of these components, autonomous driving scenarios can be used to comprehensively evaluate the driving safety, driving capability, and response ability of an autonomous vehicle.
[0068] At this time, the driving environment may include the following:
[0069] - Road types: Includes various road types such as expressways, general roads, intersections, roundabouts, etc.
[0070] - Lane structure: Includes lane-related information on the road, such as the number of lanes, lane width, presence or absence of lanes, and whether lane changes are permitted.
[0071] - Traffic signals and signs: including traffic lights, speed limit signs, one-way signs, stop signals, etc.
[0072] Traffic elements may include the following:
[0073] - Surrounding vehicles: Includes the number of other vehicles located around the autonomous vehicle, speed, driving path, and whether they changed lanes.
[0074] - Pedestrians and Bicycles: Includes situations where pedestrians and bicycles cross the road or are located next to the road, and these factors are important for autonomous vehicles to evaluate their response to unexpected moving objects.
[0075] Weather and environmental conditions may include the following:
[0076] - Weather Conditions: Evaluates vehicle sensor performance and driving stability, including various weather conditions such as clear, rain, snow, and fog.
[0077] - Lighting conditions: Tested the recognition capabilities of autonomous vehicles by setting illumination environments such as daytime, nighttime, dawn, and dusk.
[0078] The driving purpose and task may include the following:
[0079] - Driving path and speed control: Includes how the vehicle will travel on specific road sections and how to adjust the driving speed.
[0080] - Emergency Response: Includes situations that evaluate emergency response capabilities, such as the appearance of sudden obstacles or situations requiring sudden braking.
[0081] - Lane Change and Turn: Evaluates vehicle responsiveness and safety in lane changes, left turns, and right turns.
[0082] In addition, the following are examples of autonomous driving scenarios that can be used to evaluate autonomous vehicles.
[0083] 1. Highway Driving Scenario
[0084] - Situation: An autonomous vehicle on a highway is driving at a speed similar to surrounding vehicles.
[0085] - Objective: Test lane keeping, speed control, maintaining distance from other vehicles, and reaction to surrounding vehicles changing lanes.
[0086] 2. City Intersection Scenario
[0087] - Situation: Traffic lights are operating at an intersection, and pedestrians and vehicles are entering from various directions.
[0088] - Objective: Collects data on compliance with traffic signals at intersections, pedestrian detection and ensuring safe crossing, and recognition of the direction and speed of other vehicles.
[0089] 3. Pedestrian Appearance Scenario
[0090] - Situation: A pedestrian crossing a crosswalk or suddenly entering the road from the edge of the road
[0091] - Objective: Evaluate whether collisions with pedestrians can be avoided by testing pedestrian detection, emergency braking, and evasive capabilities.
[0092] 4. Adverse Weather Driving Scenario
[0093] - Situation: Applies to an environment where visibility is low and roads are slippery due to heavy rain, snow, fog, etc.
[0094] - Objective: Evaluate driving stability, deceleration response, sensor recognition performance, etc. in adverse weather conditions
[0095] 5. Night Driving Scenario
[0096] - Situation: Applies to driving on a dark road with few or no streetlights.
[0097] - Objective: Evaluate forward recognition, lane keeping, obstacle detection, etc. using headlights
[0098] Meanwhile, multimodal data can correspond to data that represents autonomous driving scenarios. In this case, multimodal data is characterized by the combination of complex data to represent autonomous driving scenarios. That is, multimodal data includes two or more data of different types (i.e., single modality data). Accordingly, multimodal data can represent autonomous driving scenarios through n different types of single modality data (where n is a natural number greater than or equal to 2).
[0099] For example, multimodal data may correspond to a combination of data of various formats including at least two from text, images, audio, video, GPS (Global Positioning System) data, map data, sensor data, and time-series sensor data. In this case, text, images, audio, video, GPS data, map data, sensor data, and time-series sensor data each correspond to different single-modality data.
[0100] Meanwhile, the embedding model may include n single embedding models and at least one integrated embedding model. In this case, the number of single embedding models is equal to the number of types (i.e., n) of single modality data included in the multimodal data. Accordingly, each single embedding model can be trained to process each single modality data exclusively to generate an embedding vector (i.e., a single embedding vector) for the corresponding single modality data. That is, each single embedding model can generate a single embedding vector for the corresponding single modality data when the dedicated single modality data is input.
[0101] Embedding vectors, also referred to as "representation vectors," represent high-dimensional source data (i.e., single-modality data, etc.) compressed into a low-dimensional space and expressed as a vector; they can be used in fields such as machine learning. In other words, an embedding vector represents high-dimensional source data as a lower-dimensional vector, containing a meaningful representation of the source data.
[0102] For example, multimodal data may include individual single-modality data for text, images, audio, video, GPS data, map data, sensor data, and time-series sensor data. In this case, a single embedding model for generating a single embedding vector for text (i.e., text embedding vector) (hereinafter referred to as the “text embedding model”), a single embedding model for generating a single embedding vector for an image (i.e., image embedding vector) (hereinafter referred to as the “image embedding model”), a single embedding model for generating a single embedding vector for audio (i.e., audio embedding vector) (hereinafter referred to as the “audio embedding model”), a single embedding model for generating a single embedding vector for video (i.e., video embedding vector) (hereinafter referred to as the “video embedding model”), a single embedding model for generating a single embedding vector for GPS data (i.e., GPS embedding vector) (hereinafter referred to as the “GPS embedding model”), and a single embedding model for generating a single embedding vector for map data (i.e., map embedding vector) (hereinafter, A “map embedding model”, a single embedding model for generating a single embedding vector for sensor data (i.e., sensor embedding vector) (hereinafter referred to as the “sensor embedding model”), and a single embedding model for generating a single embedding vector for time-series sensor data (i.e., time-series embedding vector) (hereinafter referred to as the “time-series embedding model”) may each be provided.
[0103] The integrated embedding model can be trained to generate an embedding vector (i.e., an integrated embedding vector) by integrating the single embedding vectors output from each single embedding model. For example, the embedding model can receive at least two of text embedding vectors, image embedding vectors, audio embedding vectors, video embedding vectors, GPS embedding vectors, map embedding vectors, sensor embedding vectors, and time series embedding vectors as inputs and generate an integrated embedding vector for these single embedding vectors.
[0104] For example, in relation to autonomous driving scenarios, multimodal data from the perspective of the driving vehicle may include data containing various forms of information regarding the vehicle's current state and driving environment. That is, such multimodal data consists of data of different single modalities, such as text warning messages, camera images, GPS location information, LiDAR and radar sensor data, and time-series sensor data, with each single modal data providing different types of information. While each single modal data contains important information individually, combining them enables a richer and more accurate perception of the situation.
[0105] For example, image data, a single-modality data acquired from camera footage, provides visual information about the vehicle's surroundings. In other words, cameras can capture various visual information necessary for driving, such as the color and shape of objects, lanes, traffic lights, and pedestrians. Additionally, sensor data, a single-modality data acquired from sources like LiDAR and radar, can provide information on the distance and speed of objects around the vehicle. In this context, LiDAR uses lasers to capture 3D spatial information around the vehicle, while radar is useful for measuring the position and speed of objects in real-time during driving. GPS data, a single-modality data acquired via GPS, and map data, a single-modality data acquired via High Definition Maps (HD Maps), respectively provide precise location and route information for the vehicle. Specifically, GPS data provides the vehicle's current location, while map data provides precise information about the vehicle's position and surrounding environment, including static elements such as lane information and traffic light locations. Text data, a single-modality data for warnings and guidance messages, is used to convey warnings or guidance messages regarding the vehicle's status or surrounding conditions while driving. This serves to enhance driving safety and guide driving strategies in specific situations. Time-series sensor data, a single-modality data regarding vehicle movement, includes dynamic data such as vehicle speed, acceleration, direction, and brake status. This time-series sensor data is necessary to understand the vehicle's status in real time during driving. Audio data, a single-modality data regarding sound, captures sounds occurring around the vehicle to provide supplementary situational information. For example, it assists in situational awareness by recognizing audio data for specific sounds, such as sirens, pedestrian warnings, and traffic lights.
[0106] Referring to FIG. 1, the system (100) may include first and second devices (10, 20). In this case, various information may be transmitted and received between the first and second devices (10, 20) via wired or wireless communication. In this case, the second device (20) corresponds to a different device relative to the first device (10). Of course, conversely, the first device (20) corresponds to a different device relative to the second device (10).
[0107] In the present system (100), the first device (10) is an electronic device for performing a first function. That is, the first device (10) corresponds to an electronic device for training and providing multiple embedding models. Additionally, in the present system (100), the second device (20) is an electronic device for performing a second function. That is, the second device (20) corresponds to an electronic device for generating embedding vectors of multimodal data regarding autonomous driving scenarios using multiple trained embedding models. However, both the first and second functions may be performed in either the first and second devices (10, 20).
[0108] For example, the electronic device may be a general-purpose computing system such as a desktop PC, laptop PC, tablet PC, netbook computer, workstation, smartphone, or smartpad, a dedicated embedded system implemented based on Embedded Linux, or a cloud server system, but is not limited thereto.
[0109] FIG. 2 shows a rough block diagram of the first and second devices (10, 20).
[0110] The first and second devices (10, 20) may include a memory (14, 24) and a control unit (15, 25), as shown in FIG. 2. Of course, the first and second devices (10, 20) may further include an input unit (11, 21), a communication unit (12, 22), or an output unit (13, 23).
[0111] The input unit (11, 21) generates input data in response to various user inputs and may include various input means. For example, the input unit (11, 21) may include a keyboard, key pad, dome switch, touch panel, touch key, touch pad, mouse, menu button, sound input device, various types of sensor devices, or a shooting device, but is not limited thereto.
[0112] The communication unit (12, 22) is configured to perform communication with other devices. That is, the first communication unit (12) of the first device (10) can transmit preprocessing information, multiple embedding models, learning embedding information, etc. to the second communication unit (22) of the second device (20). In addition, the communication unit (22) of the second device (20) can transmit the results of performing the second function, etc. to other devices (of course, the first device, etc.).
[0113] For example, the communication unit (12, 22) may perform wireless communication such as cellular communication, LoRa communication, SigFox communication, 5G (5th generation communication), LTE-A (long term evolution-advanced), LTE (long term evolution), WiFi communication or Bluetooth, or wired communication using UTP (Unshielded Twisted Pair cable) cable, coaxial cable, optical cable or HFC (Hybrid Fiber Coax) cable, but is not limited thereto.
[0114] The output unit (13, 23) is configured to generate various outputs. For example, the output unit (13, 23) may include output devices such as a display, printer, speaker, or network card. In particular, the display displays various image data on a screen and may be composed of a non-emissive panel or an emissive panel. At this time, the display may display the process or result of performing the first or second function. For example, the display may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a micro electro mechanical systems (MEMS) display, or an electronic paper display, but is not limited thereto. Additionally, the display may be combined with the input unit (12) and implemented as a touch screen, etc.
[0115] The memory (14, 24) stores various information necessary for the operation of the first and second devices (10, 20). The information stored in the first memory (14) of the first device (10) may include learning data, preprocessing information, a number of embedding models, information received from another device, and a program (14a, 24a) related to a method to be described later, but is not limited thereto. Additionally, the information stored in the second memory (25) of the second device (20) may include preprocessing information, a number of embedding models, learning embedding information, information received from another device, and a program (14a, 24a) related to a method to be described later, but is not limited thereto.
[0116] For example, memory (14, 24) may include volatile memory devices such as DRAM or SRAM, non-volatile memory such as PRAM, MRAM, ReRAM or NAND flash memory, or hard disk drives (HDD) or solid-state drives (SSD), but is not limited thereto. Additionally, memory (14, 24) may be a cache, buffer, main memory, or secondary memory, or a separately provided storage system, depending on its use / location, but is not limited thereto.
[0117] The control unit (15, 25) can perform various control operations of the first and second devices (10, 20). In particular, the first control unit (15) of the first device (10) can control the performance of the first function, etc., and the second control unit (25) of the second device (20) can control the performance of the second function, etc. Of course, if both the first and second functions are performed in either of the first and second devices (10, 20), the control unit of the device can control the performance of the first and second functions.
[0118] To this end, the control unit (15, 25) can control the execution of the method described below and can control the operation of the remaining components of the first and second devices (10, 20), namely the input unit (11, 21), communication unit (12, 22), output unit (13, 23), memory (14, 24), etc.
[0119] The control unit (15, 25) may include a processor (16, 26) which is hardware or a process which is software executed on the processor (16, 26), but is not limited thereto. For example, the processor (16, 26) may include a microprocessor, an MCU (Micro Controller Unit), a CPU (Central Processing Unit), a processor core, a multiprocessor, an ASIC (Application-Specific Integrated Circuit), or an FPGA (Field Programmable Gate Array), but is not limited thereto.
[0120] Hereinafter, a method according to one embodiment of the present invention will be described in more detail.
[0121] A method according to one embodiment of the present invention (hereinafter referred to as "the present method") is a method performed in the present system (100) and is a method for processing multimodal data for an autonomous driving scenario.
[0122] FIG. 3 shows a flowchart of a method according to one embodiment of the present invention, and FIG. 4 shows a conceptual diagram of a method according to one embodiment of the present invention.
[0123] This method may include S210 and S220 as illustrated in FIG. 3. In this case, S210 and S220 may be performed under the control of the control unit (15, 25). That is, the processor (16, 26) of the control unit (15, 25) may process the execution of S210 and S220.
[0124] First, the first device (10) can perform a first function (S210).
[0125] That is, the first control unit (15) can control the training of multiple embedding models using multimodal data for training, which is training data for an autonomous driving scenario. At this time, multiple sets of such training data are stored in memory (14).
[0126] Figure 5 shows a detailed flowchart for S210.
[0127] Referring to FIG. 5, such S210 may include S211 to S215.
[0128] In relation to S210, first, the first device (10) performs data preprocessing on the training data (S211).
[0129] That is, the first control unit (15) can control the performance of preprocessing on the training data so that each input training data (i.e., multimodal training data) is aligned and each training data is prepared in a consistent state. As a result of performing such preprocessing, preprocessing information may be generated. At this time, the preprocessing information is information about the performed preprocessing and may include information about the consistently aligned multimodal training data after preprocessing (i.e., alignment information), missing status information, and feature information.
[0130] Referring to FIG. 4, such preprocessing information can be transmitted to the second device (20) through the first communication unit (12). That is, the second device (20) can receive the preprocessing information through the second communication unit (22), and can perform the same type of preprocessing as performed in S211 in S221, which will be described later, using the received preprocessing information.
[0131] This preprocessing may include the following processing.
[0132] - Each training multimodal data set is sorted in a specific order. Accordingly, each single modality data set included in the training multimodal data set can be placed in its unique position.
[0133] - Normalization is performed to standardize the range of each single modality data point, and scaling is performed to reduce the numerical differences between identical single modality data points to a certain range. Such normalization and scaling can enhance the stability of the learning process.
[0134] - Detect missing single modality data in each multimodal training dataset (i.e., detect missing status) and reduce the impact of the omission by adding a flag or generating supplementary data.
[0135] - Increases efficiency by selecting only the key features necessary for training from each single modality data and removing unnecessary information.
[0136] In relation to S210, the first device (10) trains dedicated single embedding models to generate a single embedding vector using each single modality data included in the multimodal data for training preprocessed in S211 (S212).
[0137] That is, the first control unit (15) can control the training of each single embedding model so that dedicated single modality data included in the preprocessed multimodal data for training is input to each single embedding model, and that each single embedding model generates and outputs an embedding vector (i.e., a single embedding vector) corresponding to the dedicated single modality data. Accordingly, the single embedding model can be trained to generate an embedding vector for each single modality data in the preprocessed multimodal data for training, thereby performing the function of expressing individual characteristics. At this time, the training of each single embedding model means that the parameters included in each single embedding model (i.e., weights and biases, etc.) are set to optimal values.
[0138] For example, a text embedding model can be trained to generate and output an embedding vector (i.e., text embedding vector) for a given text when single-modality data of the text included in the preprocessed multimodal training data is input. An image embedding model can be trained to generate and output an embedding vector (i.e., image embedding vector) for a given image when single-modality data of the image included in the preprocessed multimodal training data is input. An audio embedding model can be trained to generate and output an embedding vector (i.e., audio embedding vector) for a given audio when single-modality data of the audio included in the preprocessed multimodal training data is input. A video embedding model can be trained to generate and output an embedding vector (i.e., video embedding vector) for a given video when single-modality data of the video included in the preprocessed multimodal training data is input. A GPS data embedding model can be trained to generate and output an embedding vector (i.e., a GPS data embedding vector) for the corresponding GPS data when single-modality data of the GPS data included in the preprocessed multimodal data for training is input. A map embedding model can be trained to generate and output an embedding vector (i.e., a map embedding vector) for the corresponding map data when single-modality data of the map data included in the preprocessed multimodal data for training is input. A sensor embedding model can be trained to generate and output an embedding vector (i.e., a sensor embedding vector) for the corresponding sensor data when single-modality data of the sensor data included in the preprocessed multimodal data for training is input. A time-series sensor data embedding model can be trained to generate and output an embedding vector (i.e., a time-series embedding vector) for the corresponding time-series sensor data when single-modality data of the time-series sensor data included in the preprocessed multimodal data for training is input.
[0139] For example, text embedding models can be implemented using natural language embedding models such as BERT or GPT. Image and video embedding models can be implemented using models such as CNN, ResNet, or Vision Transformer to extract visual features of single-modality data of images and videos into image and video embedding vectors. Audio embedding models can convert single-modality data of audio into spectrograms and then use CNNs to extract frequency features into audio embedding vectors. Sensor and time-series embedding models can be implemented using models such as LSTM or GRU to extract temporal features of single-modality data of sensor and time-series data into sensor and time-series embedding vectors.
[0140] Referring to FIG. 4, each learned single embedding model can be transmitted to the second device (20) through the first communication unit (12). That is, the second device (20) can receive each single embedding model through the second communication unit (22), and accordingly, can perform S222 described later using each single embedding model having the same parameters as learned in S212.
[0141] In relation to S210, the first device (10) trains an integrated embedding model to generate a single integrated embedding vector by integrating (i.e., combining) each single embedding vector generated and output from each single embedding model learned in S212 (S213).
[0142] That is, the first control unit (15) can control the training of an integrated embedding model using each single embedding vector output from each single embedding model. At this time, the integrated embedding model can be trained to generate an integrated embedding vector by integrating each single embedding vector when each single embedding vector output from each single embedding model is input. At this time, the training of the integrated embedding model means that the parameters included in the integrated embedding model (i.e., weights and biases, etc.) are set to optimal values.
[0143] For example, methods for combining multiple single embedding vectors may include concatenation, weighted fusion, Cross-Attention, or Bilinear Pooling. In this case, the integrated embedding model corresponds to a model that performs the concatenation, weighted fusion, Cross-Attention, or Bilinear Pooling. Additionally, the integrated embedding model can be implemented using Transformer-based Cross-Attention or GNN. In this case, the integrated embedding model learns the interactions between single embedding vectors to generate an integrated embedding vector by supplementing with other single embedding vectors even when information is missing.
[0144] Referring to FIG. 4, the learned integrated embedding model can be transmitted to the second device (20) through the first communication unit (12). That is, the second device (20) can receive the integrated embedding model through the second communication unit (22), and accordingly, can perform S223 described later using the integrated embedding model having the same parameters as learned in S213.
[0145] In relation to S210, the first device (10) performs training on the integrated embedding model so that the integrated embedding model is optimized using a loss function suitable for the learning objective (S214). This S214 may correspond to a step performed in conjunction with S213.
[0146] That is, the first control unit (15) can control the learning of the integrated embedding model so that parameter values of the optimized integrated embedding model can be derived using the loss calculation result according to the loss function.
[0147] For example, a loss function based on the Contrastive Loss method can be used. In this case, similarity is learned by placing similar data in close embedding spaces and different data far apart. Alternatively, a Multi-task Loss function can be used. Accordingly, when optimizing multiple tasks simultaneously, learning can be achieved by weighting the losses of each task.
[0148] In relation to S210, the first device (10) evaluates and verifies the performance of the learned integrated embedding model (S215).
[0149] That is, the first control unit (15) provides performance evaluation indicators by verifying that the learned integrated embedding model can perform in various situations. For example, the performance of the model can be evaluated and verified by checking whether the integrated embedding vector generated by the integrated embedding model is consistently and correctly mapped within the learned embedding space. In addition, the generalization performance of the integrated embedding model can be verified by testing whether the integrated embedding model operates stably even in situations that are not similar to the training data. Furthermore, it can be evaluated how well the integrated embedding model detects and responds in situations where there is unexpectedly missing single modality data or abnormal situations.
[0150] Meanwhile, when the execution of S211 to S215 is completed for a plurality of training data, each integrated embedding vector for the plurality of training data can be generated. At this time, the first control unit (15) can transmit information (i.e., training embedding information) regarding at least one integrated embedding vector selected from among the integrated embedding vectors for all training data to the second device (20).
[0151] That is, the first control unit (15) can control the transmission of the corresponding learning embedding information to the second device (20) through the first communication unit (12). Accordingly, the second control unit (25) can receive the corresponding learning embedding information through the second communication unit (22) and store it in the second memory (24).
[0152] Next, the second device (20) can perform a second function (S210). Of course, the first device (10) may also perform a second function, in which case the details regarding the second device (20) and the second control unit (25), etc. described later may be replaced with the details regarding the first device (10) and the first control unit (15), etc.
[0153] That is, the second control unit (25) can control the process of converting target multimodal data, which is target data for an autonomous driving scenario, into an embedding vector using a plurality of previously learned single embedding models and an integrated embedding model. At this time, the target data is stored in memory (14).
[0154] Each single embedding model and integrated embedding model may be a machine learning model trained according to machine learning techniques, but is limited to this and may be a model based on various other techniques for generating embedding vectors. Each of these single embedding models and integrated embedding models may include various layers, and each layer may include parameters for generating embedding vectors. Accordingly, training for each single embedding model and integrated embedding model may mean finding and setting optimal values for these parameters.
[0155] Figure 6 shows a detailed flowchart for S220.
[0156] Referring to FIG. 6, such S220 may include S221 to S224.
[0157] In relation to S220, first, the second device (20) performs data preprocessing on the target data (S221).
[0158] That is, the second control unit (225) performs the role of aligning the input target data (i.e., multimodal data for the target) and processing omissions using preprocessing information based on the preprocessing performed in S211, thereby converting the target data consistently. Of course, the preprocessing information has already been received from the first device (10) and stored in the second memory (24).
[0159] By sorting the target data into a specific order according to this preprocessing and then inputting it into each single embedding model in S222 described later, the consistency, prediction performance, and computational efficiency of each single embedding model can be significantly improved. In particular, in systems where real-time data processing is critical, such as autonomous driving, consistent data alignment plays a major role in generating stable and reliable embeddings. Therefore, by inputting multimodal data into each single embedding model in the same order through the same preprocessing during training according to S210 and transformation according to S22, each single embedding model becomes capable of correctly recognizing and processing each single modality data.
[0160] In relation to S220, the second device (20) generates each single embedding vector for a plurality of single modality data included in the multimodal data for a target preprocessed in S221 using each single embedding model learned in S210 (S222).
[0161] That is, the first control unit (15) can control each single embedding model to generate and output an embedding vector (i.e., a single embedding vector) corresponding to the dedicated single modality data included in the target multimodal data preprocessed in S221 by inputting the dedicated single modality data included in the single embedding model learned in S210.
[0162] At this time, each single embedding model having the same parameters as those learned in S210 must be used, and only in this case can the desired performance be expected when each single embedding vector is combined (i.e., integrated) for the integrated embedding vector. Of course, each of the corresponding single embedding models has already been received from the first device (10) and stored in the second memory (24).
[0163] In relation to S220, the second device (20) uses an integrated embedding model previously learned in S210 to integrate (i.e., combine) each single embedding vector generated in S222 to generate a single integrated embedding vector (S223).
[0164] That is, the second control unit (25) can control the output of an integrated embedding vector for the target data (i.e., a target integrated embedding vector) by inputting each single embedding vector output from each single embedding model in S222 into the integrated embedding model learned in S210.
[0165] At this time, an integrated embedding model having the same parameters as those learned in S210 must be used, and only in this case can a target integrated embedding vector be expected to be generated with the desired performance. Of course, the integrated embedding model has already been received from the first device (10) and stored in the second memory (24).
[0166] In relation to S220, the second device (20) performs verification or optimization on the target integrated embedding vector generated in S223 (S224).
[0167] That is, the second control unit (25) can control the verification and optimization of the target integrated embedding vector. At this time, for the target integrated embedding vector, (1) embedding consistency validation, (2) handling missing modality flags and uncertainty, and (3) optimization may be optionally performed.
[0168] (1) Embedding consistency verification can be performed when training embedding information is transmitted from S210. That is, embedding consistency verification is a function that verifies the consistency of the target integrated embedding vector by comparing the target integrated embedding vector with the training embedding information.
[0169] In particular, embedding consistency verification can be performed using similarity checks, clustering-based checks, or Out-of-Distribution (OOD) detection.
[0170] First, in the case of similarity checking, the location of the newly generated target integrated embedding vector and the training embedding information within the embedding space (i.e., the embedding vector space) is checked and compared. For example, the target integrated embedding vector and the training embedding information can be compared based on their similarity or distance using cosine similarity or Euclidean distance. In this case, it may be preferable for the training embedding information used for comparison to be a single integrated embedding vector arbitrarily selected.
[0171] If the similarity is smaller than the first threshold value or the distance is larger than the second threshold value, it may be determined as an outlier, and it may be determined that the consistency of the target integrated embedding vector is not maintained. On the other hand, if the similarity is larger than the first threshold value or the distance is smaller than the second threshold value, it may be determined as a normal value, and it may be determined that the consistency of the target integrated embedding vector is maintained.
[0172] Next, clustering-based inspection is performed by applying a clustering technique (e.g., K-Means or DBSCAN) to the training embedding information and the target integrated embedding vector. In this case, the inspection can be conducted by verifying whether the target integrated embedding vector is included within the major clusters of the training embedding information derived according to the clustering technique. For example, a major cluster may be a cluster region in the embedding space (i.e., the embedding vector space) that is primarily occupied by multiple integrated embedding vectors. Accordingly, it may be desirable for the training embedding information used for this inspection to consist of multiple arbitrarily selected integrated embedding vectors.
[0173] If the target integrated embedding vector is located outside the corresponding major cluster, it may be judged as an outlier, and it may be determined that the consistency of the target integrated embedding vector is not maintained. Conversely, if the target integrated embedding vector is located within the corresponding major cluster, it may be judged as a normal value, and it may be determined that the consistency of the target integrated embedding vector is maintained.
[0174] OOD detection is an inspection performed based on an OOD technique that detects whether the target integrated embedding deviates from the data distribution of the training embedding information and the target integrated embedding vector. For example, the Mahalanobis Distance, OpenMax, or Isolation Forest can be used to detect whether the target integrated embedding deviates from the corresponding data distribution. Accordingly, it may be desirable for the training embedding information used for this inspection to consist of multiple arbitrarily selected integrated embedding vectors.
[0175] If the target integrated embedding vector is detected to be outside the corresponding data distribution, it may be determined as an outlier, and it may be decided that the consistency of the target integrated embedding vector is not maintained. On the other hand, if the target integrated embedding vector is detected to be included within the corresponding data distribution, it may be determined as a normal value, and it may be decided that the consistency of the target integrated embedding vector is maintained.
[0176] (2) Missing flag and uncertainty processing can be performed by [adding missing flags and generating replacement values], [assessing uncertainty], and [adjusting weights].
[0177] At this time, [Add Missing Flag and Generate Replacement Value] detects single modality data missing from the target integrated embedding vector and flags whether each single modality data is missing. If the missing flag is enabled, the value of the missing single modality data is predicted using a pre-trained replacement prediction model, and the corresponding single embedding vector is reflected in the target integrated embedding vector. Accordingly, supplementary data for the missing single modality data can be reflected in the target integrated embedding vector.
[0178] For example, if text data is missing from a target integrated embedding vector, a model can be applied to predict that text data using a single embedding vector of other single modality data included in that target integrated embedding vector.
[0179] [Uncertainty Assessment] quantifies the uncertainty of the target integrated embedding vector using Monte Carlo Dropout or Bayesian Neural Networks. It numerically evaluates the uncertainty arising during the generation process of each single embedding vector, and reduces the impact on the overall target integrated embedding vector by applying a lower weight to the single embedding vector with higher uncertainty when it is included in the target integrated embedding vector.
[0180] [Weight Adjustment] adjusts the weights of the remaining non-missing single modality data (i.e., remaining single embedding vectors) based on the uncertainty or confidence score for the single modality data (i.e., missing single embedding vector) missing from the target integrated embedding vector. For example, if image or video data for the front camera (i.e., image or video embedding vector) is missing from the target integrated embedding vector, a higher weight is applied to the single embedding vectors for the remaining single modality data (e.g., image or video data for the rear camera, sensor data for the distance sensor, etc.) and reflected in the target integrated embedding vector.
[0181] (3) Optimization is a post-processing step that allows the target integrated embedding vector to perform optimized performance for a given task by normalizing, scaling, contrastive embedding adjustment, or fine-tuning for a specific task on the target integrated embedding vector, thereby improving the expressiveness and stability of the target integrated embedding vector.
[0182] Normalization refers to performing normalization on the target integrated embedding vector. Depending on the situation, various methods can be used for this normalization. For instance, L2 normalization can be applied to ensure that the lengths of all embedding vectors are consistent. Using L1 normalization makes the sum of the absolute values of all elements in the target integrated embedding vector equal to 1, which is advantageous for generating sparse vectors by emphasizing specific dimensions or reducing the weights of specific elements. In addition, various other methods such as Max normalization, softmax normalization, batch normalization, and Z-score normalization can be applied.
[0183] Scaling refers to performing scaling on the target integrated embedding vector. Using this scaling allows the size of a single embedding vector corresponding to specific single modality data to be scaled and reflected in tasks where that data is critical. For example, in risk prediction, if distance sensor data is important, a higher weight is assigned to the sensor embedding vector of that distance sensor data to give it a greater influence on the overall target integrated embedding vector.
[0184] Contrastive embedding adjustment can adjust the data distribution of target integrated embedding vectors based on distances generated during contrastive learning. In this process, data points with similar distances can be adjusted to be closer together (i.e., first adjustment), while data points with dissimilar distances can be adjusted to be further apart. For example, the first adjustment can be performed on data points located within a certain distance, and the second adjustment can be performed on data points located at other distances. This allows new data to be better distinguished in relation to previously learned data.
[0185] Fine-tuning for specific tasks fine-tunes the target integrated embedding vector to suit a specific purpose (e.g., risk classification, situational awareness). Through this adjustment of the integrated embedding vector, it enables performance optimized for the task.
[0186] In accordance with the execution of S220, the second device (200) can represent different types of single-modality data, such as text, images, audio, video, and sensor data, into a single integrated multimodal embedding vector (i.e., an integrated embedding vector). Each single-modality data possesses unique information and characteristics, allowing for the synthesis of rich meanings and interactions that are difficult to obtain from single-modality data alone into a multimodal embedding. Through this, information obtained from various single-modality data is combined into a single consistent vector to reflect correlations between data and integrate important features of all single-modality data. Accordingly, more sophisticated situational awareness becomes possible by synthesizing multiple single-modality data, enabling a deep understanding of the overall situation in autonomous driving or video analysis. Furthermore, when similar information is identified in single-modality data, the reliability of decision-making increases, thereby enhancing the accuracy of predictions and judgments.
[0187] For example, image data from a camera can visually capture vehicles, pedestrians, and road conditions ahead; sensor data from a LiDAR can detect the distance and location of vehicles and pedestrians in 3D; audio data can recognize surrounding horns or emergency vehicle sirens; GPS data can verify the vehicle's location; and text warning messages or log data can identify characteristics of the driving situation. In other words, a multimodal embedding vector (i.e., an integrated embedding vector) is capable of representing such comprehensive information as a single vector.
[0188] <Example of Multimodal Embedding>
[0189] Examples of training for each embedding model and the generation and utilization of embedding vectors through this are as follows.
[0190] First, it is assumed that the multimodal data includes individual modality data such as text, audio, images, videos, and time-series sensor data, and that these must be processed together. The key is to generate individual embedding vectors by considering the characteristics of each modality data, and then effectively combine them to create an integrated embedding vector that reflects the interaction.
[0191] 1. Preprocessing of Multimodal Data
[0192] First, the input multimodal data is preprocessed. Specifically, the order of each single modality data within the multimodal data is sorted according to a predefined sequence, and if data required for a specific sequence is missing, the corresponding missing information is flagged. For example, missing information can be indicated while sorting the multimodal data in a manner such as {situation description text, supplementary audio information, related image, related video, driving time-series sensor data}.
[0193] 2. Select each single embedding model and generate each single embedding vector
[0194] For each single modality data of the preprocessed multimodal data, each single embedding vector is generated using a dedicated single embedding model.
[0195] For example, text data is converted into text embedding vectors using pre-trained NLP models such as BERT or GPT. This is T∈R dT It can be represented as follows. Here, T is the text data, and R is the text embedding vector, dT represents the dimensions of the text embedding vector, respectively.
[0196] Since audio data contains frequency features, features are extracted using CNN or Audio Transformer models after Mel-spectrogram transformation, or the data is converted into audio embedding vectors using speech recognition models (such as Wav2Vec). This is A∈R dT It can be represented as follows. Here, A is the audio data, and R is the audio embedding vector, dT represents the dimensions of the audio embedding vector, respectively.
[0197] Image data is converted into image embedding vectors by extracting visual features using CNNs (such as ResNet or EfficientNet). This is I∈R dT It can be represented as follows. Here, I is the image data, and R is the image embedding vector, dT represents the dimensions of the image embedding vector, respectively.
[0198] Video data is decomposed along the time axis to extract frame-by-frame image features, and then these are processed through LSTM, GRU, or 3D-CNN to generate video embedding vectors containing temporal patterns. This is V∈R dT It can be represented as follows. Here, V is the video data, and R is the video embedding vector, dT represents the dimensions of the video embedding vector, respectively.
[0199] Time-series sensor data is transformed from its temporal features into time-series embedding vectors using LSTM, GRU, Temporal CNN, or TCN (Temporal Convolutional Network). This is S∈R dT It can be represented as follows. Here, S is the time-series sensor data, and R is the time-series embedding vector, dT represents the dimensions of the time series embedding vector, respectively.
[0200] 3. Generate integrated embedding
[0201] Single embedding vectors obtained for each single modality data vector are combined to generate a consistent integrated embedding vector. There are various vector combination methods, and an appropriate method reflecting the interactions between each single embedding vector can be selected. For example, the following combination method may be used.
[0202] 3-1) In a simple concatenation method, an integrated embedding vector is generated by simply concatenating each individual embedding vector. This method is simple to implement and can include the features of all single modality data. However, it has limitations in that it is difficult to sufficiently reflect interactions between individual embedding vectors if such interactions are required. The integrated embedding vector (M) generated according to this simple concatenation method can be represented as follows.
[0203] M = Concat(T, A, I, V, S)
[0204] Here, 3-2) The Cross-Attention method utilizes Cross-Attention to reflect the interaction between each single embedding vector. For example, in an attention structure, text embedding vectors can be used as the Query, and other single embedding vectors as the Key and Value to focus on information that needs attention.
[0205] For example, the interaction between text and audio is,
[0206] TA can be represented as CrossAttention(query=T, key=A, value=A).
[0207] In addition, the interaction between text and images,
[0208] TI can be expressed as CrossAttention(query=T, key=I, value=I).
[0209] After generating interaction embedding vectors between each single modality data through Cross-Attention, all single embedding vectors can be combined to generate a final integrated embedding vector. The integrated embedding vector (M) generated according to this Cross-Attention method can be represented as follows.
[0210] M = Concat (TA, TI, T, A, I, V, S)
[0211] Of course, the combination method and order according to this Cross-Attention must be shared by the first and second devices (10, 20) so that they are performed identically in S210 and S220 and consistency is maintained.
[0212] 3-3) The multimodal transformer method considers each individual embedding vector as an individual token and learns the interactions between modalities through a transformer model. This allows all modalities to interact organically and be combined.
[0213] 4. Training the integrated embedding model
[0214] When training an integrated embedding model, an appropriate loss function can be used depending on the specific purpose in various autonomous driving scenarios. For example, contrastive learning, supervised learning, or self-supervised learning may be used for training the integrated embedding model.
[0215] Contrast learning can train the system to maintain meaningful relationships in the embedding space by positioning integrated embedding vectors of similar situations close to each other and integrated embedding vectors of different situations far apart.
[0216] Supervised learning can supervise an integrated embedding model to perform classification or regression when specific labels are present.
[0217] Self-supervised learning can enhance the robustness of integrated embedding vectors so that they can respond even in situations where specific single modality data (i.e., specific single embedding vector) is missing.
[0218] 5. Utilization of Target Integrated Embedding Vectors
[0219] Each learned embedding model is transmitted to the second device (20) to generate a target integrated embedding vector for the target multimodal data. This target integrated embedding vector can be used in the following situations.
[0220] It can accurately identify the current situation and perform necessary responses through text, images, audio, video, and sensor data. By utilizing target integrated embedding vectors, it can predict whether a driving situation poses a high risk of accidents and generate warnings accordingly. When a new driving situation arises, it matches it with similar past scenarios to perform safe actions based on prior experience.
[0221] Furthermore, in terms of autonomous driving data augmentation and learning enhancement, new integrated embedding vectors for various autonomous driving scenarios can be generated through simulation by modifying or processing the values of pre-existing target integrated embedding vectors. These newly generated integrated embedding vectors can be utilized for supplementary learning in situations where actual data is scarce.
[0222] Additionally, when searching for autonomous driving scenarios, an embedding vector for the desired scenario (i.e., a search embedding vector) can be generated. At this time, target integrated embedding vectors that match or are similar to the search embedding vector can be searched from the previously generated target integrated embedding vectors related to the autonomous driving scenario.
[0223] In addition, the target integrated embedding vector can be utilized in various application fields, such as autonomous driving agents, autonomous driving scenario generation processes, autonomous driving data management units, autonomous vehicles, and general vehicles.
[0224] FIG. 7 is a diagram illustrating the computing environment of the first and second devices (10, 20).
[0225] In the illustrated embodiments, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those not described below. The computing environment of the first and second devices (10, 20) may include one or more components, as illustrated in FIG. 7.
[0226] That is, for a computing environment, the first and second devices (10, 20) include at least one processor (16, 26), memory (14, 24), and communication bus (19, 29). The processor (16, 26) can enable the first and second devices (10, 20) to operate according to the exemplary embodiments mentioned above. Such a processor (16, 26) may be included in a control unit (15, 26). For example, the processor (16, 26) may execute one or more programs (14a, 24a) stored in memory (14, 24). Such one or more programs (14a, 24a) may include one or more computer-executable instructions, and the computer-executable instructions may be configured to enable the first and second devices (10, 20) to perform operations according to the exemplary embodiments when executed by the processor (16, 26).
[0227] The communication bus (19, 29) interconnects various other components of the first and second devices (10, 20) in addition to the processor (16, 26) and memory (14, 24).
[0228] The first and second devices (10, 20) may include one or more input / output interfaces (18, 28) and one or more communication interfaces (12a, 22a) that provide an interface for one or more input / output devices (17, 27). The input / output devices (17, 27) may include an input section (11, 21) and an output section (13, 23).
[0229] The input / output interface (18, 28) and the communication interface (12a, 22a) are connected to the communication bus (19, 29). The input / output device (17, 27) may be connected to other components of the first and second devices (10, 20) through the input / output interface (18, 28). The input / output device (17, 27) may be included inside the first and second devices (10, 20) as a component constituting the first and second devices (10, 20), or it may be connected to the first and second devices (10, 20) as a separate device distinct from the first and second devices (10, 20).
[0230] The present invention, configured as described above, has the advantage of being able to process multimodal data necessary for the effective generation and retrieval of autonomous driving scenarios.
[0231] In other words, the present invention has the advantage of being able to process multimodal data that effectively describes various autonomous driving scenario situations.
[0232] In addition, the present invention has the advantage of contributing to an autonomous driving system recognizing situations more accurately, reducing uncertainty, and making safe decisions by processing the execution of multimodal embeddings, which are embeddings for multimodal data of autonomous driving scenarios.
[0233] In particular, through multimodal embedding of multimodal data of autonomous driving scenarios processed according to the present invention, flexibility to adapt to environmental changes, meaningful information combination through interaction, and real-time situational awareness and risk assessment become possible, thereby providing the advantage of significantly improving the reliability and safety of autonomous driving.
[0234] In addition, the present invention has the advantage of enhancing usability in various fields, such as the generation, management, mutual comparison, search, or real-time accident avoidance of autonomous driving scenarios, through the processing result data (i.e., representation vector) of multimodal embeddings.
[0235] Although the present invention has been described in detail above through representative embodiments, those skilled in the art will understand that various modifications and equivalent alternative embodiments are possible therefrom. Accordingly, the true technical scope of protection of the present invention should be determined by the technical spirit of the appended claims.
[0236] The present invention relates to autonomous driving scenario data processing technology and has industrial applicability as it provides technology for processing multi-modal data necessary for the effective generation and retrieval of autonomous driving scenarios.
Claims
1. As a method performed by a processor, A step of performing preprocessing on target data, which is multimodal data related to autonomous driving scenarios or driving environments; A step of generating each single embedding vector for a plurality of single modality data included in the preprocessed target object data using each previously trained single embedding model; A step of generating a single integrated embedding vector, which is a single multimodal vector formed by integrating each single embedding vector, using a previously trained integrated embedding model; A method including 2. In Paragraph 1, The step of performing the above preprocessing is a method of sorting each single modality data in the target data according to a preset order, and adding a flag or generating supplementary data for missing single modality data.
3. In Paragraph 2, The step of performing the above preprocessing is a method of preprocessing the target data using preprocessing information for training data used during training of the single embedding model and the integrated embedding model.
4. In Paragraph 1, A method further comprising the step of verifying consistency for a target integrated embedding vector by comparing the training embedding information for the training data used during training of the single embedding model and the integrated embedding model with the single integrated embedding vector.
5. In Paragraph 1, A method further comprising the step of obtaining a similarity or distance between the learning embedding information and the target integrated embedding vector in an embedding space, and verifying consistency for the target integrated embedding vector according to the obtained similarity or distance.
6. In Paragraph 1, A method further comprising the step of applying a clustering technique to the above-mentioned learning embedding information and the above-mentioned target integrated embedding vector to identify major clusters for the above-mentioned learning embedding information, and verifying the consistency of the above-mentioned target integrated embedding vector based on whether the above-mentioned target integrated embedding vector is included within the identified major clusters.
7. In Paragraph 1, A method further comprising the step of verifying the consistency of the target integrated embedding vector by detecting whether the target integrated embedding deviates from the data distribution of the training embedding information through OOD (Out-of-Distribution) detection.
8. In Paragraph 1, A method further comprising the step of numerically evaluating the uncertainty for each of the above single embedding vectors and processing such that the higher the uncertainty, the lower the weight applied to the corresponding single embedding vector in the target integrated embedding.
9. In Paragraph 1, A method further comprising the step of processing such that, if there is a single embedding vector for single modality data that is missing from the target integrated embedding vector, a high weight is applied to the single embedding vector for the remaining single modality data that is not missing from the target integrated embedding.
10. In Paragraph 1, A method further comprising the step of adjusting the positions of data points whose distance is less than a certain distance to become closer based on the distance in the data distribution for a plurality of the above-mentioned target integrated embedding vectors.
11. Memory; and A processor that executes at least a portion of the operations stored in the memory; comprising, The above processor is, Preprocessing is performed on target data, which is multimodal data related to autonomous driving scenarios or driving environments, and Using each previously trained single embedding model, each single embedding vector is generated for multiple single modality data included in the preprocessed target object data, and A device that generates a single integrated embedding vector, which is a single multimodal vector formed by integrating each single embedding vector, using a previously trained integrated embedding model.
12. In Paragraph 11, The above processor is a device that, during the above preprocessing, sorts each single modality data in the target data according to a preset order and adds a flag or generates supplementary data for missing single modality data.
13. In Paragraph 12, The above processor is a device that preprocesses the target data using preprocessing information for training data used during the training of the single embedding model and the integrated embedding model during the above preprocessing.
14. In Paragraph 11, The above processor is a device that verifies consistency for a target integrated embedding vector by comparing the training embedding information for the training data used during training of the single embedding model and the integrated embedding model with the single integrated embedding vector.
15. In Paragraph 11, The above processor is a device that obtains a similarity or distance between the learning embedding information and the target integrated embedding vector in an embedding space, and verifies the consistency of the target integrated embedding vector according to the obtained similarity or distance.
16. In Paragraph 11, The above processor applies a clustering technique to the above learning embedding information and the above target integrated embedding vector to identify major clusters for the above learning embedding information, and verifies the consistency of the above target integrated embedding vector based on whether the above target integrated embedding vector is included in the identified major clusters.
17. In Paragraph 11, The above processor is a device that verifies the consistency of the target integrated embedding vector by detecting whether the target integrated embedding deviates from the data distribution of the training embedding information through OOD (Out-of-Distribution) detection.
18. In Paragraph 11, The above processor is a device that numerically evaluates the uncertainty for each single embedding vector and processes such that the higher the uncertainty, the lower the weight applied to the corresponding single embedding vector in the target integrated embedding.
19. In Paragraph 11, The above processor is a device that processes such that, if there is a single embedding vector for single modality data missing from the target integrated embedding vector, a high weight is applied to the single embedding vector for the remaining single modality data not missing from the target integrated embedding.
20. In Paragraph 11, The above processor is a method for adjusting the positions of data whose distance is less than a certain distance to become closer based on the distance in the data distribution for a plurality of the above target integrated embedding vectors.