Method for extending a scene database
By analyzing the empty volume of existing scene databases and using machine learning techniques to generate synthetic sensor data, the problem of insufficient expansion of scene databases in existing technologies is solved, enabling more comprehensive testing and verification of autonomous driving systems and improving the robustness of the system and data collection efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZENSEACT AB
- Filing Date
- 2025-12-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for expanding scene databases are insufficient in terms of considering scene diversity and computational efficiency, resulting in incomplete test coverage of autonomous driving systems, difficulty in effectively identifying and collecting rare scene samples, and difficulty in guaranteeing the accuracy and reliability of synthetic sensor data.
By analyzing data in existing scene databases, uncovered empty volumes are identified, and machine learning techniques such as NeRF, GAN, and diffusion models are used to generate synthetic sensor data. Combined with a query system, data collection is optimized, the level of detail is dynamically adjusted to optimize bandwidth consumption, and high-priority scene data is prioritized.
It enables more efficient expansion of the scene database, improves the robustness and reliability of the autonomous driving system, reduces data transmission requirements, ensures the credibility and accuracy of synthetic sensor data, and reduces data collection costs.
Smart Images

Figure CN122240699A_ABST
Abstract
Description
Technical Field
[0001] The technology disclosed herein relates to the field of autonomous driving system development. In particular, it relates to methods and apparatus for expanding a scenario database for the development of autonomous driving systems. Background Technology
[0002] One of the core principles of validating and validating Automated Driving Systems (ADS) is the use of scenario-based testing. This type of testing is used to systematically evaluate the system's performance across a wide range of driving scenarios, including rare and challenging situations that are statistically possible but would require an unbearable amount of experience for real-world driving. These scenarios are important for exposing potential system weaknesses and ensuring robustness in diverse and unpredictable environments.
[0003] The core component of scenario-based testing is the database, which stores a collection of recorded driving scenarios. Building such a database presents several challenges. The database must capture not only common driving situations but also rare edge cases such as extreme weather conditions, unexpected pedestrian behavior, and unusual traffic patterns. Therefore, ensuring statistical representativeness of all possible scenarios is a challenge. To address this, efficient and intelligent scenario database population is desired. When expanding the scenario database, the key is to identify which scenarios are needed to supplement the existing ones. This may involve identifying which scenarios might be “missing” and could enrich the current database, while also determining which scenarios are already well covered. Avoiding redundant or irrelevant data helps prevent the database from becoming overwhelmed and ensures efficient use of bandwidth and cloud infrastructure.
[0004] There is also a need for systems that consider how practically these complementary scenario samples can be obtained, taking into account available resources, time, and the likelihood of scenarios actually occurring. For example, real-world data collection can be costly and may lack the diversity required for rare scenarios. As another example, synthetic sensor data generation can mitigate the problem of the low occurrence of rare scenarios. However, it may instead introduce inaccuracies or reduce its reliability in reflecting real-world driving conditions, thereby reducing its relevance and usefulness in ADS testing. In other words, its effectiveness may be hampered by uncertainties regarding the credibility of synthetic data and its ability to accurately reproduce real-world sensor stimuli if not carefully validated.
[0005] Existing methods for populating or expanding scene databases are often inadequate in one or more aspects of these areas. Therefore, there is a need for improvements to provide comprehensive and reliable coverage of statistically probable scenarios while taking into account scene diversity and computational efficiency. Such advancements will enable more comprehensive testing of ADS, ultimately contributing to the capability, reliability, and performance of autonomous driving systems or any of their functions. Summary of the Invention
[0006] The techniques disclosed herein aim to mitigate, alleviate, or eliminate one or more of the aforementioned identified defects and disadvantages in the prior art to address various problems related to the formation of a scenario database for scenario-based development of features or functions related to an autonomous driving system (ADS). More specifically, the disclosed techniques address problems related to how to expand a scenario database of existing scenario samples in an improved and more efficient manner. The disclosed techniques provide an improved way of consuming or exhausting the space of possible scenarios. Various aspects and implementations of the disclosed techniques are defined below and in the appended independent and dependent claims.
[0007] According to the first aspect, a computer-implemented method for expanding a scene database is provided. The scene database includes multiple existing scene samples. Each scene sample includes sensor data depicting the environment surrounding a vehicle over a period of time. Each scene sample is associated with a scene embedding representing the scene sample in a multidimensional space. The method includes: obtaining an indication of an empty volume in the multidimensional space. The empty volume is a volume not covered by the multiple existing scene samples in the scene database. The method further includes: sending a data collection request to one or more vehicles in a vehicle fleet for the empty volume. The method further includes: receiving data from the vehicles in the vehicle fleet indicating a recorded scene that matches the data collection request. The method further includes: storing the received data indicating the recorded scene into the multiple existing scene samples in the scene database. Utilizing this aspect of the disclosed technology, there are similar advantages and preferred features as other aspects.
[0008] According to a second aspect, a computer program product including instructions is provided that, when executed by a computing device, cause the computing device to perform the method according to any embodiment of the first aspect. According to an alternative embodiment of the second aspect, a (non-transitory) computer-readable storage medium is provided. The non-transitory computer-readable storage medium stores one or more programs configured to be executed by one or more processors of a processing system, the programs including instructions for performing the method according to any embodiment of the first aspect. This aspect utilizing the disclosed technology has similar advantages and preferred features as other aspects.
[0009] As used herein, the term "non-transitory" is intended to describe computer-readable storage media (or "memory") that exclude the propagation of electromagnetic signals, but is not intended to otherwise limit the types of physical computer-readable storage devices encompassed by the phrases "computer-readable medium" or "memory." For example, the terms "non-transitory computer-readable medium" or "tangible memory" are intended to encompass types of storage devices that include, for example, random access memory (RAM) that do not necessarily permanently store information. Program instructions and data stored in a non-transitory form on a tangible computer-accessible storage medium can be further transmitted via a transmission medium or a signal such as an electrical signal, electromagnetic signal, or digital signal, which can be transmitted via a communication medium such as a network and / or a wireless link. Therefore, as used herein, the term "non-transitory" is a limitation on the medium itself (i.e., tangible, not a signal), not a limitation on the persistence of data storage (e.g., RAM and ROM).
[0010] According to a third aspect, a computing device for expanding a scene database is provided. The scene database includes multiple existing scene samples. Each scene sample includes sensor data depicting the surrounding environment of a vehicle over a period of time. Each scene sample is associated with a scene embedding representing the scene sample in a multidimensional space. The computing device includes control circuitry. The control circuitry is configured to obtain an indication of an empty volume in the multidimensional space. An empty volume is a volume not covered by the multiple existing scene samples in the scene database. The control circuitry is further configured to send a data collection request to one or more vehicles in a vehicle platoon for the empty volume. The control circuitry is further configured to receive data from the vehicles in the vehicle platoon indicating a recorded scene matching the data collection request. The control circuitry is further configured to store the received data indicating the recorded scene into the multiple existing scene samples in the scene database. Utilizing this aspect of the disclosed technology, there are similar advantages and preferred features as in other aspects.
[0011] The disclosed aspects and preferred embodiments may be suitably combined with each other in any manner that is obvious to those skilled in the art, such that one or more features or embodiments relating to one aspect may also be considered to relate to embodiments of another aspect or another aspect.
[0012] One advantage of some implementations is that the scene database can be expanded in an improved and more efficient manner. By considering the transformation volume of scene samples, more efficient use of bandwidth and other computing resources can be achieved. This simultaneously considers the authenticity and reliability of synthetic sensor data generated from the transformation of existing scene samples. It can further provide more targeted data collection methods. In other words, it can provide improved ways to identify gaps in existing scene databases and understand how newly recorded / experienced scenes fill these gaps. Improvements can be found, for example, in bandwidth utilization and data constraints.
[0013] Furthermore, one advantage of some implementations is that prioritization can provide more efficient management and transmission of large amounts of data from vehicle fleets. This can be achieved in part by balancing data richness with bandwidth constraints. This can be viewed further as a dynamic, scenario-value-based hierarchical data collection system. More specifically, the hierarchical approach can optimize bandwidth consumption by adjusting the level of detail in the transmitted data based on scenario importance, ensuring that valuable data is prioritized.
[0014] By adaptively providing data at the appropriate level of detail, the system can scale to more data from larger fleets, ultimately improving the performance and safety of the ADS. It can further restrict the transmission of complete sensor data to high-priority scenarios, reducing overall data storage and transmission costs. Furthermore, it ensures that resources are used to capture the most relevant events when truly needed. Therefore, all raw data is transmitted for the truly valuable sequences.
[0015] One advantage of some implementation methods is that they make the development of more exhaustive scenario-based ADS functionality easier to use and implement in a more efficient manner. By performing more exhaustive scenario-based testing and verification, the robustness, reliability, trustworthiness, and overall performance of ADS can be achieved across a wider range of scenarios.
[0016] One advantage of some implementations is that they reduce the need for data transmission from vehicle fleets while still enabling accurate and reliable testing of the ADS. This may be due to the need for less raw sensor data to exhaustively cover the scene space.
[0017] One advantage of some implementations is that they enable the generation of scenarios that are close to the data already collected, but which may take months or years to collect from real-world driving.
[0018] One advantage of some implementation methods is that the scene space can be exhausted with less collected data.
[0019] Further embodiments are defined in the dependent claims. It should be emphasized that when the term "comprising" and variations thereof are used in this specification, they are used to specify the presence of a described feature, integer, step, or component. They do not exclude the presence or addition of one or more other features, integers, steps, components, or groups thereof.
[0020] These and other features and advantages of the disclosed technology will be further illustrated below with reference to the embodiments described herein. Attached Figure Description
[0021] The foregoing aspects, features, and advantages of the disclosed technology will be more fully understood when taken in conjunction with the accompanying drawings and by referring to the following illustrative and non-limiting detailed description of exemplary embodiments of the present disclosure, wherein:
[0022] Figure 1 It is a schematic flowchart representing a method according to some implementation methods;
[0023] Figure 2 This is a schematic illustration of a computing device according to some embodiments;
[0024] Figure 3 This is an illustrative illustration of a vehicle based on some implementation methods;
[0025] Figure 4 The example illustrates the mapping between scene samples and multidimensional space;
[0026] Figure 5 Example illustrations show scene samples in multidimensional space;
[0027] Figure 6A and Figure 6B The example illustrates how the scene space is filled. Detailed Implementation
[0028] This disclosure will now be described in detail with reference to the accompanying drawings, in which some exemplary embodiments of the disclosed technology are illustrated. However, the disclosed technology may be embodied in other forms and should not be construed as limited to the exemplary embodiments disclosed. The exemplary embodiments of the disclosure are provided to fully convey the scope of the disclosed technology to those skilled in the art. Those skilled in the art will understand that the steps, services, and functions explained herein can be implemented using separate hardware circuitry, using software that works in conjunction with a programmable microprocessor or general-purpose computer, using one or more application-specific integrated circuits (ASICs), using one or more field-programmable gate arrays (FPGAs), and / or using one or more digital signal processors (DSPs).
[0029] It will also be appreciated that when this disclosure is described in the form of a method, it can also be embodied in a device including one or more processors and one or more memories coupled to the one or more processors, wherein computer code is loaded to implement the method. For example, in some embodiments, the one or more memories may store one or more computer programs that, when executed by the one or more processors, cause the device to perform the steps, services, and functions disclosed herein.
[0030] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. It should be noted that the articles “a,” “an,” “the,” and “said,” as used in the specification and appended claims, are intended to mean the presence of one or more elements unless the context clearly indicates otherwise. Thus, for example, in some contexts, a reference to “unit” or “the unit” may refer to more than one unit, etc. Furthermore, the word “comprising” and its variations do not exclude other elements or steps. It should be emphasized that when the term “comprising” and its variations are used in this specification, they are used to specify the presence of a described feature, integer, step, or component. It does not exclude the presence or addition of one or more other features, integers, steps, components, or groups thereof. The term “and / or” should be interpreted as also meaning “both” and each as an alternative.
[0031] It should also be understood that although the terms first, second, etc., may be used herein to describe various elements or features, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of the embodiments, a first machine learning model may be referred to as a second machine learning model, and similarly, a second machine learning model may be referred to as a first machine learning model. Both the first machine learning model and the second machine learning model are machine learning models, but they are not the same machine learning model.
[0032] As used herein, the phrase "one or more" in a set of elements (such as "one or more of A, B, and C" or "at least one of A, B, and C") should be interpreted as conjunction or disjunction. In other words, it can refer to all elements, one element, or a combination of two or more elements in a set. For example, the phrase "one or more of A, B, and C" can be interpreted as A or B or C, A and B and C, A and B, B and C, or A and C.
[0033] As used herein, the term "in response to" can be interpreted, depending on the context, as meaning "when," "in the event of," or "if." Similarly, the phrase "in response to successfully [identifying at least one scene sample]" can be interpreted as "when it is determined that at least one scene sample has been identified," "in the case that at least one scene sample has been identified," "when at least one scene sample has been identified," or similar expressions.
[0034] Overview
[0035] As explained above, the disclosed technology involves expanding (or extending) a scene database that includes multiple existing scene samples. The disclosed technology is at least partially based on analyzing existing data in the scene database to the extent that existing scene samples can be transformed into modified scenes by a transformation system. More specifically, the disclosed technology is at least partially based on how the transformation system can reliably generate synthetic sensor data from the transformation of sensor data from existing scene samples. Furthermore, this can be combined with a query system for requesting relevant sensor data. The expanded scene database can then be used for the development of autonomous driving systems (ADS). The disclosed technology is therefore at least partially based on the idea of how to identify empty volumes (or gaps) in the existing scene database within the associated scene space, while considering the transformation volume of existing samples, and / or understanding how recorded scene samples fill such empty volumes given available transformations. In other words, determining how to transform received data to fill empty volumes and understanding the degree of overlap with existing data. More specifically, the disclosed technology can enable improved data collection by leveraging an understanding of how transformations of existing and newly collected data fill gaps within the scene space. Instead of trying to collect all the data, resources can be focused on collecting the most critical data—gaps (i.e. empty volumes) in the scene space that cannot be filled by data synthesized from existing data samples.
[0036] The disclosed technology further enables the exhaustion or depletion of scene space in a more efficient manner than simply relying on a fleet of vehicles to experience all possible scenarios. The technology relies on the fact that small perturbations or transformations can be made to the recorded sensor data while achieving the desired credibility of the generated synthetic sensor data. In this context, credibility refers to the fact that the generated synthetic sensor data appears realistic (i.e., looks and feels like real-world sensor data) for a perception system applied to it. For example, given that real or synthetic data depict the same scene, the perception system is expected to have similar outputs when applied to real or synthetic data. The credibility of synthetic sensor data can involve several different aspects, indicating the overall quality and reliability of the synthetic sensor data in representing real-world scenes and supporting robust and reliable testing and development of ADS (or any subsystem thereof). Credibility can, for example, encompass the validity, reliability, and / or accuracy of synthetic sensor data. In other words, credibility can be considered as a collective term for one or more aspects. More specifically, validity can refer to the degree to which the synthetic sensor data reflects (or conforms to) the characteristics of real-world sensor data or scenes. In other words, reliability can be a measure of the credibility of synthetic sensor data for its intended purpose without introducing artifacts or inconsistencies that could lead to erroneous conclusions. Reliability reflects the level of consistency of the generated synthetic sensor data. In other words, reliability means that the generated data consistently exhibits the same characteristics and behavior under similar conditions, ensuring its repeatability and predictability when used for testing or analysis. Accuracy can refer to the degree to which synthetic sensor data matches the scenario it is meant to depict. In other words, how close it is to the expected or requested scenario.
[0037] Recent examples of techniques for transforming (or perturbing) data (sensor data in this context) include: rendering-based methods that can shift the viewpoint of an image, or even move / offset, add, or remove objects in a scene; and generative models used to alter the texture and appearance of certain objects or parts of a scene. Examples of rendering-based methods are Neural Radiation Fields (NeRF) and Gaussian Sputtering. Generative models can refer to Generative Adversarial Networks (GANs), Denoising Diffusion Probabilistic Models (DDPM), and Normalized Flow, among others. Perturbations or transformations using such methods can render optional scenes or scenes from the initial input of raw data. As achieved by the disclosed techniques, assuming the magnitude of the transformation is reasonable, the optional scene can appear realistic, and thus provide a means to render / generate synthetic data based on samples of the raw data.
[0038] The disclosed technology connects this transformation system with a query system for new scenarios, which further considers the space of modified scenarios accessible from existing samples through this transformation. To this end, the disclosed technology utilizes a scenario space (also known as a multidimensional space), as will be explained further below.
[0039] definition
[0040] Throughout this disclosure, various machine learning techniques are generally referred to as machine learning models (or simply "models"). Herein lies any form of machine learning algorithm, such as deep learning models or neural networks, that learns and adapts from input data and subsequently performs predictions, decisions, classifications, or any other related tasks based on new data.
[0041] Deploying a machine learning model typically involves a training phase, during which the model learns from labeled or unlabeled training data to achieve accurate predictions during the subsequent inference phase. Training data (and input data during inference) can be, for example, images or image sequences, LiDAR data (i.e., point clouds), radar data, or any other form of data. Furthermore, training / input data can include combinations or fusions of one or more different data types. Additionally, or in combination, it can include combinations or fusions of two or more instances of the same data type, such as two or more images from different cameras.
[0042] In some implementations, the machine learning model may be implemented using publicly available, appropriate software development machine learning code elements in any manner deemed appropriate by a person skilled in the art, such as code elements available in PyTorch, TensorFlow, and Keras, or in any other appropriate software development platform.
[0043] An example of this machine learning technique mentioned below is the so-called Neural Radiation Field (NeRF). NeRF is an example of a way to provide a learnable (e.g., through backpropagation) representation of a scene and is used in conjunction with the rendering process. NeRF is therefore an example of a learning-based rendered scene representation. As the name suggests, NeRF utilizes radiation fields and is therefore a radiation-based technique. Furthermore, NeRF is neural because it is (at least in part) constructed from neural networks. NeRF can, for example, enable the rendering of new viewpoints in a recorded scene, or alter the presence or position of objects in the scene.
[0044] More specifically, NeRF is a neural network capable of reconstructing a 3D scene from a partially 2D image (or other sensor data types). NeRF can learn the scene geometry, objects, and angles of a specific scene. This can be learned, for example, by how light travels through the scene. It can then be used to render realistic 3D (or 2D) views from different viewpoints and different sensor data types. The view can be rendered as a 2D view or a 3D view. The view can be further generated with a temporal dimension to generate a dynamic scene. The view can also thus be rendered as a 4D view. NeRF is typically constructed from a so-called multilayer perceptron (MLP) as a fully connected neural network architecture. The network can be trained to map spatial coordinates and viewing directions (e.g., light rays from points in an image) to color and density values. The MLP uses a set of mathematical structures that organize inputs such as location in 3D space or 2D viewing directions to determine the color and density values at each point in the 3D image.
[0045] NeRF requires training (i.e., learning) for each unique scene using sensor data (e.g., images) from different viewpoints. Furthermore, the sensor's position and orientation are needed, requiring sensor tracking. This can be accomplished, for example, through some combination of SLAM, GPS, or inertial measurement. Alternatively, it can be done after acquisition through analysis of the sensor data, for example, with the help of neural networks.
[0046] The training process for NeRF can generally be described as follows, using a camera as an example. For each sparse camera (and image) viewpoint provided, a set of 3D points with a given radiating direction (incoming to the camera) is generated by tracing camera rays through the scene. For these points, volume density and emitted radiation are predicted using an MLP. Given the density, the colors along the rays can be weighted together to give information about occlusion (i.e., objects blocking the light). A rendered image can then be generated using classical volume rendering. The error between the rendered image and the original image can be minimized (e.g., through gradient descent) across multiple viewpoints, encouraging the development of coherent scene models using MLPs.
[0047] Another example of a learning-based rendering method for representing scenes is Gaussian sputtering. Similar to NeRF, Gaussian sputtering is also a radiation-based technique that includes rasterization. More specifically, Gaussian sputtering is a volume rendering technique that processes volume data directly without converting the data into surface or line primitives. This technique integrates sparse points generated during camera calibration and represents the scene using 2D or 3D Gaussians that preserve the properties of the continuous volume radiation field. Sparse points (or point clouds) can be initialized, for example, randomly and / or obtained from LiDAR point clouds. Gaussians can have positions that vary over time and can therefore be used to render dynamic 4D scenes (i.e., including the time dimension).
[0048] In this scenario, the scene representation can include a set of Gaussians. This set of learnable parameters can then correspond to the Gaussian's position, size, rotation, and spherical harmonics. Rendering can be accomplished by projecting a 3D (or dynamic 4D) Gaussian onto the image plane. Then, for each pixel, the algorithm iterates through the sputtered Gaussians based on their distance from the current camera position and accumulates their density and color.
[0049] As an alternative to rendering-based techniques, generative models can also be used to transform sensor data. Examples of this machine learning technique include Generative Adversarial Networks (GANs) and diffusion models.
[0050] Generative Adversarial Networks (GANs) are machine learning frameworks consisting of two neural networks: a generator and a discriminator, which compete against each other in a zero-sum game. The generator creates data that resembles real-world samples (e.g., images, audio, or text), while the discriminator evaluates whether a given input is real (from a dataset) or fake (generated by the generator). Through this adversarial process, the generator can improve its ability to create realistic outputs, while the discriminator can improve its ability to distinguish between real and fake data. This dynamic leads to the generation of highly realistic synthetic data.
[0051] Diffusion models are a class of generative machine learning models used to create synthetic data that includes sensor data. They work by iteratively transforming random noise into structured data. During the training phase, the model learns to reverse the process of progressively adding noise to the real data, effectively disrupting its structure. In the generation phase, the model applies the learned reverse process to generate new data from the random noise that is similar to the original training data.
[0052] Diffusion models are well-suited for generating synthetic sensor data because they can capture fine details and complex patterns, making them useful for creating realistic representations of inputs such as images, point clouds, or time-series data.
[0053] All of the above techniques can be used as part of the aforementioned transformation system. This can be used in conjunction with a query system that is based on the use of so-called embedded networks (also known as “encoding networks,” “embedded neural networks,” or “embedded artificial neural networks”).
[0054] Embedded networks refer to computational models or a set of techniques that enable computers to generate embedded representations of input data (such as sensor data, text data, etc.), where "embedded" is a mathematical (vector) representation of the input data. More specifically, embedded networks can be used to transform input data into a more compact representation in a multidimensional space while preserving meaningful relationships between input data points.
[0055] Embedding networks are used for tasks such as Natural Language Processing (NLP) and Computer Vision. These networks take raw input data, such as words in a sentence or pixels in an image, and transform them into fixed-size numerical vectors (embedded ones) that capture the essential features or characteristics of the input data. More specifically, in NLP, embedding networks convert words into numerical vectors, where words with similar meanings or contextual usages are represented as being closer to each other in the embedding space. Similarly, in computer vision, embedding networks convert images into numerical vectors, enabling the network to understand visual similarities, such as grouping similar objects or scenes more closely together in the embedding space (a multidimensional (vector) space).
[0056] Embedding networks themselves can include layers of a neural network architecture that typically employ techniques such as convolutional layers, recurrent layers, fully connected layers, attention layers, or transformation layers to learn and extract meaningful patterns from input data. Embedding networks can be trained through processes such as supervised learning, unsupervised learning, or self-supervised learning to optimize embeddings for specific downstream tasks such as classification, clustering, or recommendation.
[0057] Different data types can be embedded using different embedding networks. These different embedding networks can then be trained to generate embeddings in the same embedding space (the same multidimensional space) so that the context-, spatially, and / or temporally relevant embeddings (generated by different embedding networks) point to the same point within the multidimensional space. The term "pointing to the same point within the multidimensional space" should be interpreted broadly in this context and encompasses "points pointing in substantially the same direction within the multidimensional space" or "points pointing to substantially the same point within the multidimensional space," etc. More specifically, given two embedding vectors pointing to the same point or the same direction, the relationship between these two underlying data samples can be inferred. For example, if there are two embedding vectors, their proximity in pointing to the same point or the same direction can be calculated to determine the relationship between the underlying data samples, where the closer they are to pointing to the same point or the same direction, the more likely the underlying data samples are to be related to the same object or the same scene.
[0058] This can be accomplished, for example, by training a first embedding network to generate embeddings in a multidimensional space based on input data from a first data source. Then, each of the other embedding networks can be trained "to" the first embedding network (or any other already trained embedding network), or trained in association with the first embedding network, so that embeddings of other networks contextually, spatially, and / or temporally related to the embeddings of the first embedding network point to the same point in the multidimensional space as the relevant embedding of the first embedding network. For example, if the first embedding network is trained to generate image embeddings for camera images, and the second embedding network is designed to generate embeddings for LiDAR data, the second embedding network can be trained by feeding LiDAR data of the scene to it, where the corresponding image embeddings (of the scene) will be used as the basis for forming the ground truth (desired output). By performing this process on each subsequent embedding network, a set of embedding networks capable of ingesting outputs from various data sources and outputting corresponding embeddings can be obtained, where contextual, spatial, and / or temporal relationships are represented by the proximity or directional similarity of the embeddings (vectors) in the multidimensional space.
[0059] Below, we refer to the scene embedding network and the query embedding network. Both refer to embedding networks, as generally stated above. The different names should only be understood as indicating different functions of the embedding networks. The scene embedding network is configured to generate scene embeddings for scene samples based on sensor data recorded in the vehicle. The query embedding network is then configured to generate query embeddings for query scenes. The following will combine... Figure 1 This will be further elaborated upon.
[0060] It should be noted that the disclosed techniques are not limited to the examples of machine learning techniques described above. For example, other machine learning techniques employing some of the aspects described above, as well as entirely different techniques implemented by those skilled in the art, may be used.
[0061] The vehicle's surrounding environment can be understood as the general area around the vehicle, where objects (such as traffic signs, other vehicles, landmarks, obstacles, etc.) can be detected and identified by the vehicle's sensors (radar, LiDAR, cameras, etc.), i.e., within the vehicle's sensor range. Sensor data can therefore depict the world around the vehicle. In other words, the surrounding environment can refer to the world around the vehicle that is relevant to the vehicle's decision-making and control.
[0062] The term "synthetic" in this article, as in the context of synthetic sensor data, means synthesis in the sense that it is machine-generated (or computer-generated), rather than data recorded in the real world or data collected in other ways. However, it should be recognized that synthetic sensor data can be generated from "real" sensor data, for example, by performing the transformations described below on the real sensor data. In this context, synthetic sensor data can be viewed as transformed sensor data—that is, the original sensor data after undergoing some transformations.
[0063] Implementation
[0064] Figure 1 This is a schematic flowchart representation of a computer-implemented method 100 for expanding a scene database. More specifically, it can be a method 100 for expanding the reachable volume within the scene space of a scene database based on existing scene samples. As will be further explained below, the reachable volume can be viewed as a set of scene samples that can be covered by available scene samples through the use of associated sensor data transformation techniques.
[0065] The term "expand" in "expanding the scene database" can be interpreted herein as adding new scene samples to an existing set of scene samples. This can be accomplished, for example, by collecting, generating, providing, or otherwise obtaining said new scene samples. In other words, by performing method 100 as described herein, new scene samples (i.e., scene samples that did not previously exist in the scene database) can be added to the scene database. Method 100 can also therefore be viewed as a method for creating, populating, or otherwise generating the scene database. It should be noted that the new scene samples do not need to be added to the same database instance as the existing scene samples in the scene database. They can also be added to separate databases. Separate databases can then form part of the scene database. The (expanded) scene database can be used for scene-based testing and / or validation of autonomous driving systems. The scene database will be explained further below.
[0066] Method 100 can be executed by a general-purpose computing device such as a server (also referred to as a remote server, cloud server, central server, back-end server, fleet server, or back-end server). More specifically, method 100 can be executed by a server's processing system. The processing system may, for example, include one or more processors and one or more memories coupled to the one or more processors, wherein the one or more memories store one or more programs that, when executed by the one or more processors, perform the steps, services, and functions of method 100 disclosed herein.
[0067] The different steps of method 100 are described below in more detail. Even though illustrated in a specific order, the steps of method 100 can be performed in any suitable order and multiple times. Therefore, although the accompanying drawings may show a specific order of method steps, the order of steps may differ from what is depicted. Furthermore, two or more steps may be performed simultaneously or partially simultaneously. This variation will depend on the chosen software and hardware system and the designer's choice. All such variations are within the scope of this invention. Similarly, software implementation can be accomplished using standard programming techniques based on rule-based logic and other logics to complete the various steps. Further variations of method 100 will become apparent from this disclosure. The embodiments mentioned and described herein are given by way of example only and should not be limited to the invention. Other solutions, uses, purposes, and functions within the scope of the invention as claimed in the patent claims described below will be apparent to those skilled in the art.
[0068] It should be further recognized that, Figure 1 Method 100 includes steps illustrated with solid lines and steps illustrated with dashed lines. The steps illustrated with solid lines are those included in the broadest example implementation of method 100. The steps included with dashed lines are examples of multiple optional steps that may form part of multiple optional implementations. It should be understood that the optional steps do not need to be performed sequentially. Furthermore, it should be understood that not all steps need to be performed. The example steps can be performed in any order and in any combination. For example, method 100 may optionally include a step denoted as S104. Optionally, or in combination with step S104, the method may optionally include a set of steps denoted as S102a, S102b, and S102c and / or a set of steps denoted as S102a' and S102b'. The steps denoted as S102a, S102b, and S102c can be considered as a set of sub-steps of the step denoted as S102. Similarly, the steps denoted as S102a' and S102b' can be considered as another set of sub-steps of the step denoted as S102. These two sets of sub-steps can be executed separately as alternatives, or they can be executed in combination with each other. For example, the indication of an empty volume can be generated as a combined volume of two volumes generated by the corresponding set of sub-steps.
[0069] As previously described, method 100 uses a scene database. The scene database can be viewed as a collection of existing scene samples. More specifically, the scene database includes multiple scene samples. Each scene sample includes sensor data depicting the vehicle's surrounding environment over a period of time. In other words, the scene samples in the scene database can include sequences of sensor data. The sensor data for each scene sample can therefore be multiple sensor records (such as image frames constituting a video sequence) over a period of time. The time period can extend over at least two subsequent time points. The sensor data depicting the scene can, for example, include two or more sensor data frames. However, the time period can also be a single time point. Therefore, the sensor data associated with a scene sample can include the sensor data frame at that time point. It should be noted that the sensor data associated with a scene sample can include sensor data of one or more sensor data modalities. In other words, the sensor data can include one or more sensor data types such as image data, LiDAR data, radar data, ultrasonic data, etc. Furthermore, the sensor data can include sensor data captured by two or more instances of the same sensor data type, such as image data from two or more cameras.
[0070] It should be noted that even when a scene sample in the scene database refers to "a" or "the" vehicle, the scene sample can of course be captured by / for multiple different vehicles. The vehicle referred to in connection with a particular scene sample therefore refers to any vehicle that captured the corresponding sensor data, or a vehicle otherwise linked to said sensor data (e.g., the vehicle depicted in the sensor data). The sensor data can therefore be captured by the vehicle's onboard sensors. Alternatively, the sensor data can be captured by non-vehicle sensors, such as sensors of roadside infrastructure or sensors of other road users (such as other vehicles), within the field of view of the vehicle's surroundings.
[0071] A driving scenario (or situation) can be understood as a situation or a series of events. A scenario can also be described as a situation that evolves over a period of time. Driving scenarios can range from common situations (such as following another vehicle on a highway) to more rarer, edge cases (such as avoiding obstacles while entering a busy road). The purpose of ADS development is to ensure that ADS can effectively handle both typical and special situations.
[0072] Furthermore, a scenario can be defined by a set of conditions or circumstances under which the vehicle operates during the stated time period. This can encompass a variety of environmental factors, or factors that can affect the driving experience, the vehicle's performance, and how it operates. A driving scenario can include, for example, one or more of the following: a specific route, geographical location, type of driving environment (e.g., school zone, urban environment, highway section, etc.), type of road (e.g., highway, city street, rural road, intersection, and roundabout), presence and type of other road users, time of day (e.g., morning, noon, evening, night, etc.), traffic level (e.g., peak hours, low traffic density, etc.), weather conditions, road conditions, lighting level, traffic flow level (e.g., speed, distance from other road users), etc. It should be understood that a driving scenario can be defined by any combination of the above examples. As a non-limiting example, a scenario can be defined as "driving on a city street in heavy rain, pedestrians crossing the road, and low traffic flow."
[0073] The term "sample" in "scene sample" can then be viewed as an instance of a scene within the scene database. More specifically, a scene sample can be considered as an existing scene that has been recorded and exists in the scene database. Each scene sample can be associated with a set of data, such as sensor data collected by onboard sensors on the vehicle, and other data added to the sample, which will be explained further below. Data stored in the scene database that indicates recorded scenes (as explained further below) can then constitute new scene samples in the scene database.
[0074] A query scenario can refer to a requested or desired scenario. A query scenario can be a scenario that does not currently exist in a scenario database but is desired to be obtained. A query scenario can be represented by a textual description of the query scenario. For example, a sentence such as "driving on a city street, in heavy rain, pedestrians are crossing the road, and traffic is light." Optionally or in combination, a query scenario can be represented by a computer-simulated scenario. For example, a computer-simulated scenario can be generated by computer graphics. More realistic sensor data of the simulated scenario can then be desired, which can be achieved by synthetic sensor data generated by the proposed method 100. More specifically, the computer-simulated scenario can be represented by a scenario embedding, just like "real" sensor data. Method 100 then provides means for collecting real-world sensor data of the scenario corresponding to the scenario embedding.
[0075] Each scene sample is further associated with a scene embedding that represents the scene sample in a multidimensional space. The scene embedding associated with a scene sample can be generated by processing sensor data of the scene sample through a scene embedding network trained to process data from the input sensor data and output the corresponding scene embedding in a multidimensional space. In the context of embedding being associated with a scene, the phrase "associated with" can therefore be understood as an embedding representing the scene in a multidimensional space.
[0076] The multidimensional space (also referred to as the "scene space" or "embedding space") in this paper refers to a mathematical space in which high-dimensional data (such as sensor data) can be transformed and represented as low-dimensional vector representations, called embeddings. The multidimensional space can be structured so that embeddings can capture meaningful patterns, relationships, or features from the original data, enabling effective processing, comparison, and analysis.
[0077] More specifically, in an embedding space, similar sensor data (e.g., frames from similar driving scenarios) can be mapped to points that are close together, while dissimilar data can be mapped to points that are farther apart. This is helpful for tasks such as classification, clustering, retrieval, and anomaly detection in autonomous driving systems. For example, the embedding space can be used (e.g., through clustering) to group sensor data from similar driving scenarios, or (e.g., through matching algorithms) to identify rare or challenging events for training and testing purposes.
[0078] Embedding spaces can further enable mapping between different data modalities. More specifically, a multidimensional space can be a common space for two or more data modalities. For example, different types of sensor data (e.g., image data, LiDAR data, radar data, etc.) and text data (e.g., scene descriptions) can be mapped to the same embedding space. Thus, a query embedding generated for text data, for example, can be used to identify, for example, image data by comparing it with an embedding associated with image data. In some examples, the multidimensional space is a common space for both image and text data. Furthermore, an embedding space can be formed from two or more subspaces. Subspaces can, for example, have different dimensions or be used for different data modalities. The techniques described herein can then be applied across different subspaces.
[0079] In this context, the multidimensional space refers to the space encompassing different possible driving scenarios, i.e., the "scenario space." Analysis of the scenario space can therefore provide information about which scenarios are covered (or not covered) by existing scenario samples in the scenario database. Furthermore, the scenario space allows for the identification of specific scenarios, such as those corresponding to a specific query scenario, thus facilitating the retrieval of relevant data in the scenario database.
[0080] The construction of data embeddings (or vector representations or encodings) typically involves machine learning models such as neural networks, which are trained to learn representations that preserve the underlying semantics of the input data. As explained earlier, such networks can be called embedding networks. Scene embeddings of scene samples in a scene database can thus be generated by processing scene samples (or more specifically, corresponding sensor data) through one or more scene embedding networks. Different scene embedding networks can be used for different types of sensor data. Furthermore, the sensor data for scene samples can include more than one sensor data type, such as two or more from image data, LiDAR data, radar data, etc. More generally, the sensor data for scene samples can include sensor data of a first sensor type and sensor data of a second sensor type. Each scene embedding can be formed by aggregating a first sensor embedding generated for sensor data of the first sensor type and a second sensor embedding generated for sensor data of the second sensor type. In other words, separate embeddings can be generated for different sensor data types. Scene embeddings can then be formed by aggregating (or combining) the embeddings of different sensor data types in any other way. However, it should be noted that a single embedding network can be trained to directly generate scene embeddings for two or more sensor data types (i.e., two or more sensor modalities).
[0081] Each scene embedding (or scene sample) in the scene database can be further associated with a transformation volume in a multidimensional space. A transformation volume can indicate a set of possible transformed scenes that can be generated from the corresponding scene sample. The transformation volume of a scene sample can thus be viewed as a subspace within the multidimensional space that covers a set of transformations reachable from said scene sample. In other words, the transformation volume spans a set of scene embeddings associated with synthetic sensor data that can be generated from the scene sample (through the transformations described herein). The transformation volume can be further viewed as a finite volume within the multidimensional space, or a constraint on transformations within the multidimensional space. The constraint on the transformation volume is the degree to which the original sensor data can be modified while maintaining a certain level of realism (e.g., satisfying a certain level of reliability). If it deviates too far from the original sensor data, artifacts, biases, or other errors may be introduced, reducing the utilization for ADS development. The transformation volume can thus further indicate a set of possible transformed scenes that can be generated from the scene sample while satisfying validity thresholds, reliability thresholds, and / or accuracy thresholds. As explained above, the validity threshold can refer to the degree to which the synthetic sensor data reflects the characteristics of real-world sensor data. In other words, it can be a measure of the reliability of synthetic sensor data for its intended purpose without introducing artifacts or inconsistencies that might lead to erroneous conclusions. A reliability threshold can be a measure of how consistently synthetic sensor data is generated. An accuracy threshold can be a measure of how well the synthetic sensor data matches the query scenario. In some implementations, a reliability threshold can be used. A reliability threshold can capture two or more aspects of a validity threshold, a reliability threshold, and an accuracy threshold. In other words, a reliability threshold can be a measure of how realistic the synthetic sensor data appears, or how representative it is of real-world sensor data. The following will combine... Figure 4 and Figure 5 Further explanation of the conversion volume.
[0082] It should be noted that the scene database can be constructed in different ways depending on the specific implementation. For example, the scene database can be represented by a single database that includes all the data of the scene database described herein. In another example, the data can be distributed across several databases linked together to form the scene database. As an example, the sensor data associated with each scene sample can be stored in a first database. The associated scene embeddings (optionally also including the associated transformation volume) can be stored in a second database. The first and second databases can then be linked by scene sample identifiers. It should be further noted that the scene database can include additional data. For example, each scene sample can have an associated learning-based rendered scene representation or other means for transforming sensor data, as will be explained further below.
[0083] Method 100 includes obtaining an indication of an empty volume (S102) in a multidimensional space. An empty volume is a volume not covered by multiple existing scene samples in a scene database. An empty volume can therefore refer to a volume or subspace within a scene space where no existing data points exist. In other words, an empty volume can refer to a set of scenes not covered (or unreachable) by existing scene samples in the scene database. An empty volume can be defined, for example, based on distances to existing scene embeddings in the multidimensional space. More specifically, an empty volume can be determined as a volume spanning a set of data points whose distances to the nearest existing scene embedding are greater than a threshold distance. Optionally or in combination, an empty volume can be a volume not covered by the transformed volumes of multiple existing scene samples in the scene database. In other words, the transformed volumes of existing scene samples can be considered to identify an empty volume. It should be noted that when identifying an empty volume, both the transformed volume and the distance to existing scene samples (i.e., the corresponding scene embedding) can be considered. For example, collecting scene data in the outer region of the transformed volume (i.e., towards the boundary) can be relevant. In other words, there are scenes with associated scene embeddings within the existing scene's transformation volume, but these scenes are at a certain distance from their corresponding scene embeddings. The following will combine... Figure 5 Further explanation of how empty volumes are identified.
[0084] The term "uncovered," as used in scenarios such as "not covered by multiple existing scene samples" or "not covered by the transformation volume of multiple existing scene samples," can be understood in this paper as data points (in this case, scenes) that are not represented or are insufficiently represented in the existing scene set and may not be accessible through transformations of existing scenes. These can include situations, events, or combinations of variables that are outside the scope or domain of existing data points in the scene database. Such uncovered scenes (or empty volumes) can therefore leave gaps in the scene database, which can affect the ability to fully train, test, or validate a system using the database. As a specific and non-limiting example, a scene database may include scene samples of highway driving under multiple different weather and traffic conditions, but lack data for nighttime highway driving under foggy conditions; such scenes would be "not covered by the database." As proposed in this paper, this can be determined in the scene space by analyzing the coverage of scene embeddings associated with existing scene samples and / or their corresponding transformation volumes.
[0085] The term "obtain" is to be interpreted broadly herein and encompasses the direct and / or indirect receiving, retrieving, collecting, and acquiring of information between two entities configured to communicate with each other or further with other external entities. However, in some embodiments, the term "obtain" is to be interpreted as determining, deriving, forming, calculating, etc.
[0086] In this specific case, the step of obtaining an indication of the empty volume of S102 may include multiple sub-steps. Two different sets of sub-steps are described below. The first set of sub-steps includes steps denoted as S102a, S102b, and S102c, and can be viewed as describing a situation where searching for a specific query scene in the scene database fails, and serves as a trigger for executing method 100. The second set of sub-steps includes steps denoted as S102a' and S102b', and can be viewed as describing a situation where the scene space is analyzed to identify "empty spots" or "white spots" in existing scene samples. These optional steps will be further elaborated below.
[0087] First, proceed to the first set of sub-steps. Obtaining the indication of the empty volume of S102 may include: obtaining a request for the query scenario specified by S102. The query scenario may be associated with a query embedding representing the query scenario in a multidimensional space. The step of obtaining the indication of the empty space of S102 may then further include: determining whether the query scenario S102b is covered by existing scenario samples in the scenario database. In response to the existing scenario samples failing to cover the query scenario, the step of obtaining the indication of the empty space of S102 may then further include: generating an indication of the empty volume of S102c based on the query embedding. In other words, if searching for the query scenario in the existing scenario samples fails, an indication of the empty volume can be generated.
[0088] Receiving the S102a request may include, for example, receiving a request from a developer. In another example, the requested query scenario may be determined or identified as part of method 100 based on existing scenario samples in a scenario database. For example, the query scenario may be determined by a computing device (e.g., a server) performing method 100. The query scenario may, for example, be determined as part of data searched for development purposes. For example, the query scenario may be a scenario requiring training, testing, or validation of ADS (or any of its functionality). It should be further noted that the query embedding may be received as part of the received request. Optionally, method 100 may include: determining the query embedding based on the received query scenario.
[0089] Query embeddings can be generated by processing query scenarios through a query embedding network. The query embedding network is trained to process data from the input query scenario and output the corresponding query embedding in a multidimensional space. The query scenario can be represented, for example, by a textual description of the query scenario or a computer-simulated scenario. As explained earlier, the query embedding network and the scenario embedding network can be trained in a correlated manner such that when the query embedding and scenario embedding are contextually, spatially, and / or temporally related, the query embedding generated by the query embedding network and the scenario embedding generated by the scenario embedding network point to the same point in the multidimensional space. In other words, they can be trained to be related to the same multidimensional space.
[0090] Determining whether the S102b query scene is covered by existing scene samples can be accomplished by comparing the position of the query embedding in the multidimensional space with the position of the scene embedding of the existing scene samples and / or their corresponding transformed volumes. For example, vector-based similarity measures such as Euclidean distance and cosine similarity can be used to determine the similarity or distance between the query embedding and the existing scene embedding.
[0091] An empty volume can be identified as the volume covering the query scenario. An indication of the existence of this empty volume can then be generated. Information about query embeddings that are not covered by existing scenario samples in the scenario database can constitute an indication of the existence of an empty volume in the scenario database. The empty volume can then be represented or defined in different ways. For example, the empty volume can be represented by only the query scenario or the query embedding. Optionally or in combination, further analysis can be performed to expand the volume surrounding the query scenario to represent the empty volume around it.
[0092] In some implementations, a query scene can be represented by a query volume in a multidimensional space. The query volume can therefore be a volume in a multidimensional space that is a transformed volume or an empty volume, as previously described. The query volume can span a set of query embeddings. As an example of the use of a query volume, a scene such as "a scene with lighting" is broad enough to cover several cases and can therefore advantageously be queried as a volume in a multidimensional space. The data collection request can then include the query volume. If the corresponding scene embedding lies within the query volume, the recorded scene can then be matched with the data collection request.
[0093] The process now moves to the second set of sub-steps. Obtaining the indication for the empty volume S102 may include: identifying the empty volume S102' as a volume in multidimensional space not covered by the aggregated transformation volume formed by the transformation volumes of existing scene samples. In other words, the empty volume can be determined as the space between the transformation volumes of existing scene samples. The aggregated transformation volume can be understood as the combined volume covered by the transformation volumes of existing scene samples. The aggregated transformation volume can also be considered as a reachable volume in multidimensional space. A reachable volume refers to the volume of scene embedding that can be reached through the transformation of sensor data of the corresponding existing scene samples. The empty volume can therefore be considered as a "gap" in the reachable volume. Obtaining the indication for the empty volume S102 may further include: generating an indication for the empty volume S102b' in response to the identification of the empty volume.
[0094] An indicator of empty volumes can act as a trigger for executing method 100. More specifically, an indicator of empty volumes can serve as an indication that an area exists in the scene database that is not covered by existing scene samples. This can then provide a data collection task to be generated.
[0095] Method 100 further includes: sending a data collection request (S106) to one or more vehicles in a vehicle fleet for an empty volume. The empty volume can be considered as defining the boundary of the data request. In other words, the data collection request can be used for any scenario within the boundary of the empty volume.
[0096] The word "for" in the context of sending a data collection request for an empty volume can be interpreted as the data collection request being "defined by [...]", "generated based on [...]", or otherwise "instructing [...]" the empty volume. The data collection request may include, for example, the empty volume itself, or any other information describing data not covered by existing scene samples. An empty volume may be represented, for example, by the boundary surrounding it in a multidimensional space.
[0097] In some implementations, a data collection request may include the query scenario and / or associated query embeddings as described above. In some implementations, a data collection request may include a set of query embeddings located within an empty volume in a multidimensional space. As described above, a set of query embeddings may be represented by a query volume.
[0098] Method 100 further includes: receiving, from vehicles in the vehicle fleet, data of a recorded scene that matches a data collection request, indicated by S108. In other words, in response to a sent data collection request, data of a recorded scene indicated by S108 can be received.
[0099] Receiving data indicating a recorded scene from a vehicle can encompass receiving data directly from the vehicle that has already collected data, or receiving data from any intermediate storage or device. For example, the vehicle may have already sent data indicating a recorded scene to a data aggregation device or temporary storage. After determining that the recorded scene matches a data collection request, the data indicating the recorded scene can be forwarded to the computing device executing method 100.
[0100] In this context, "vehicle" refers to a vehicle that has recorded or collected data indicating a recorded scene, such as data recorded or collected by the vehicle's onboard sensors.
[0101] The data indicating the recorded scene can be, for example, raw sensor data depicting the vehicle's surrounding environment at the moment the scene was experienced. Optionally or in combination, the data indicating the recorded scene may include some optional representation of the raw sensor data, such as compressed sensor data, a scene representation based on learning-based rendering, or a scene embedding generated based on sensor data.
[0102] The word "match" in the context of a recorded scenario that matches a data collection request can generally be understood as the recorded scenario fulfilling the instructions of the data collection request to a certain extent. For example, the recorded scenario might belong to an empty volume (i.e., located within an empty volume in multidimensional space) or have a transformation volume that at least partially covers the empty volume. This will be elaborated upon below, in conjunction with... Figure 5 Further illustrations are provided.
[0103] If a recorded scene has an associated scene embedding located within an empty volume, then the recorded scene can be matched with the data collection request. In other words, it can be determined that the scene embedding associated with the recorded scene is located within the boundary of an empty volume in multidimensional space.
[0104] A recorded scene can be matched with a data collection request if it has an associated transformation volume in multidimensional space that at least partially covers the empty volume. The associated transformation volume can indicate a set of possible transformed scenes that can be generated from the recorded scene. In other words, a recorded scene that is found to be transformable into a scene falling within the empty volume can be identified as a match for the data collection request.
[0105] In cases where a data collection request includes one or more query embeddings, a recorded scene can be matched with the data collection request if the scene embedding associated with the recorded scene is within a defined distance of one or more query embeddings. In other words, matching can be based on embedding matching, for example, by calculating Euclidean distance or cosine similarity.
[0106] In some implementations, method 100 further includes: determining a desired scene sample based on an empty volume in step S104. The desired scene sample may be a scene sample located within an empty volume in a multidimensional space. Optionally or in combination, the desired scene sample may have a corresponding transformed volume that at least partially overlaps with the empty volume. The data collection request may then include a scene embedding associated with the desired scene sample. The desired scene may therefore be considered a query scene. The scene embedding may therefore be considered a query embedding. The difference may be that the desired scene may be determined based on an empty volume, while the query scene may be determined independently of the current coverage of the scene space. Recorded scenes may have associated scene embeddings that are within a defined distance from the scene embeddings associated with the desired scene sample. In other words, if the scene embedding associated with a recorded scene is within a defined distance from the scene embedding associated with the desired scene sample, then the recorded scene may match the data collection request.
[0107] Data collection requests may further include prioritization. Prioritization can be viewed as an indication of the importance of the data collection request (or any desired scenario associated with it), or an indication of the relevance of a recorded scenario. Importance can, for example, relate to its importance / relevance to the development of the ADS. Priority can be set, for example, based on the size of the empty volume and / or the distance to the closest existing scenario sample in the multidimensional space. Therefore, priority can depend on the distance of the requested data to existing scenarios.
[0108] Data indicating a recorded scene can have a first resolution when the priority is above a first threshold, and a second resolution when the priority is below the first threshold. The first resolution is greater than the second resolution. In other words, the type of data received in response to a data collection request can vary depending on the priority. More specifically, depending on the priority, data indicating a recorded scene can be received at different resolutions. Resolution can generally be understood as the level of detail of the data. Essentially, higher resolution usually provides more detail, but may require greater storage, processing power, or bandwidth. Resolution can, for example, refer to the pixel density of image data. Optionally or in combination, resolution can refer to the sampling frequency of a sensor data sequence. Different resolutions can therefore refer to data with different levels of compression. In other words, data indicating a recorded scene can be raw sensor data or compressed sensor data. Data indicating a recorded scene can have different resolutions because it is represented by different forms of data representation. The data can, for example, be represented as a learning-based rendered scene representation (such as NeRF) capable of reconstructing or re-rendering the raw sensor data. As another example, the data can simply be represented by a scene embedding associated with the recorded scene. A decoding network can then be used to reconstruct the raw sensor data encoded into the scene embedding.
[0109] Data indicating a recorded scene can have a third resolution if its priority is below a first threshold but above a second threshold. The third resolution is between the first and second resolutions. In the example, the first resolution may correspond to the raw sensor data. The second resolution may correspond to the scene embedding associated with the raw data. The third resolution may correspond to a scene representation rendered based on learning. Furthermore, priority above the first threshold can be understood as high priority. Priority below the first threshold but above the second threshold can be understood as medium priority. Priority below the second threshold can be understood as low priority.
[0110] Priorities can be defined in data collection requests. In other words, data collection requests can be associated with priorities. Then, if a recorded scenario that matches the data collection request is found, it can be sent to the computing device that executes method 100 according to the priorities described above.
[0111] As another example, the priority of recorded scenes can be evaluated within a vehicle. In other words, the priority of recorded scenes can be determined within the vehicle where the scenes are recorded. This can be determined, for example, by comparing the recorded scene (or its associated scene embedding) with a data collection request (or more specifically, with an empty volume or the query embedding of the data collection request). For instance, if a recorded scene is found to be relatively far from the query scene or toward the boundary of the empty volume, it can be sent at a lower resolution. Conversely, if a recorded scene is found to be relatively close to the query scene or toward the center of the empty volume, it can be sent at a higher resolution. Thus, existing scene samples that may be far from the scene database can be sent at a higher resolution, and therefore can effectively supplement the recorded scenes in the scene database.
[0112] Prioritizing data collection requests allows for more efficient management and transmission of the vast amounts of sensor data generated by vehicle fleets. This can be achieved in part by balancing data richness with bandwidth constraints. This can be further viewed as a dynamic, scenario-value-based, hierarchical data collection system.
[0113] Method 100 further includes: storing the received data indicating the recorded scene S110 into multiple existing scene samples in a scene database. The data indicating the recorded scene can be stored S110 for subsequent use in the development of the autonomous driving system. Method 100 may further include: training, testing, or validating the ADS (or any feature thereof) on the data indicating the recorded scene.
[0114] The possibility of expanding the scene database also allows for an expansion of the achievable volume within the embedding space, as it provides greater coverage of transformed scenes that can be generated from existing scene samples. In particular, after collection, recorded scenes can be used to generate synthetic sensor data within the transformed volume of recorded scenes. Recorded scenes (or their sensor data) can, for example, be transformed into scenes with associated scene embeddings within a threshold distance in the multidimensional space from the query embedding or desired scene sample (i.e., synthetic sensor data generating the scene). The threshold distance provides an error margin, ensuring that the synthetic sensor data is "sufficiently" similar to the query scene, although it does not need to be exact. In other words, synthetic sensor data corresponding to the query (or desired) scene can be generated based on similarity to the query embedding in the multidimensional space. The threshold distance can be compared with the Euclidean distance between the query embedding and the scene embedding of the synthetic sensor data. In another example, the threshold distance can be compared with the cosine similarity between the query embedding and the scene embedding of the synthetic sensor data. However, it should be noted that other similarity measures can also be used.
[0115] Furthermore, synthetic sensor data can be generated, allowing the associated scene to be embedded within the transformed volume of the recorded scene. This ensures that the synthetic sensor data meets any realism requirements.
[0116] Synthetic sensor data can include one or more sensor data frames. For example, synthetic sensor data can include multiple subsequent sensor data frames (such as image frames, LiDAR point clouds, radar data, etc.). Therefore, in some cases, synthetic sensor data can be a video stream of two or more time instances. In another example, synthetic sensor data can be a sensor data frame at a single time point. Furthermore, synthetic sensor data can include sensor data from one or more sensor modalities (i.e., one or more sensor data types).
[0117] For completeness, the process of transforming scene samples into desired scene samples (such as query scenes) will be described below. In summary, the transformation process can be built upon synthetic sensor data that generates scene embeddings that match the desired scene samples, hereinafter referred to as query embeddings.
[0118] Transforming sensor data from scene samples into synthetic sensor data can be performed by: (i) applying a transformation to the sensor data to generate updated sensor data; (ii) determining the location of the embedding representing the updated sensor data in a multidimensional space; repeating steps (i) and (ii) until the location of the updated sensor data is within a threshold distance of the query embedding, and providing the updated sensor data as synthetic sensor data. In other words, synthetic sensor data can be generated through an iterative process that transforms the (raw) sensor data until it (at least within a threshold distance) corresponds to the query embedding. Transformations can include: changes in viewpoint, addition / removal of objects in the depicted surrounding environment, changes in object attributes (e.g., modification of color, texture, or material), changes in weather, changes in lighting conditions, changes in road layout, changes in object trajectories, changes in sensor characteristics, changes in available sensors, traversal of previously unseen areas, etc. The above iterative transformation process can be combined with different machine learning techniques.
[0119] In some examples, generative machine learning techniques can be used. In other words, the transformed scene can be generated by a generative machine learning model. For example, converting sensor data from scene samples into synthetic sensor data can include feeding the sensor data into a generative adversarial network (GAN) or diffusion model trained to output synthetic sensor data. The GAN and / or diffusion model can be further trained to take a query embedding as input. In another example, the GAN and / or diffusion model can be trained to take a query scene as input. The output synthetic sensor data can be viewed as the transformed sensor data.
[0120] Optionally or in combination, rendering-based techniques can be used. In other words, the transformed scene can be generated by a rendering-based machine learning model. For example, each scene sample in the database can be further associated with a learned, rendered scene representation configured for subsequent rendering of synthetic sensor data associated with the scene sample. The scene representation can thus be learned from the sensor data of the scene sample. Converting the sensor data of the scene sample into synthetic sensor data can include rendering the synthetic sensor data using the learned, rendered scene representation. The process of rendering the synthetic sensor data can therefore be rendering a transformed scene that differs from the scene from which the scene representation was learned. This can be accomplished, for example, by modifying the parameters of the scene representation.
[0121] A scene representation can be understood as a set of learnable parameters that together describe the different physical properties of a scene, such as geometry, objects, color, lighting, etc. In other words, a scene representation can be physics-based, enabling it to understand and simulate the underlying physical processes that occur in the real world, and how sensors reflect these processes in sensor data (consider, for example, projection, refraction, lenses, etc.). This can be based, for example, on material properties and simulate how light propagates in the environment. Thus, a scene representation can learn to model geometric aspects such as the position, orientation, and scale of a 3D model. It can further model lighting aspects such as color, shadows, brightness, and reflection. It can further model transparency and translucency, describing how light passes through different materials such as glass or fog. A learning-based rendering scene representation could be a neural radiation field. In another example, a learning-based rendering scene representation could be a model based on Gaussian sputtering.
[0122] It should be further noted that combinations of the above techniques are also possible. As an example, a learning-based rendering scene representation such as NeRF can be trained. Then, a diffusion model can be used to perform direct editing on the NeRF representation.
[0123] Executable instructions for performing these functions may optionally be included in a non-transitory computer-readable storage medium or other computer program product configured for execution by one or more processors.
[0124] Generally, computer-accessible media can include any tangible or non-transitory storage medium or storage media such as electrical, magnetic, or optical media—for example, a hard disk or CD / DVD-ROM bus-connected to a computer system. The terms “tangible” and “non-transitory” as used herein are intended to describe computer-readable storage media (or “memory”) excluding those that transmit electromagnetic signals, but are not intended to otherwise limit the types of physical computer-readable storage devices encompassed by the phrases “computer-readable medium” or “memory.” For example, the terms “non-transitory computer-readable medium” or “tangible memory” are intended to encompass types of storage devices that include, for example, random access memory (RAM) that do not necessarily permanently store information. Program instructions and data stored in a non-transitory form on a tangible computer-accessible storage medium can be further transmitted via a transmission medium or a signal such as an electrical signal, electromagnetic signal, or digital signal, which can be transmitted via a communication medium such as a network and / or a wireless link.
[0125] Figure 2 This is a schematic illustration of a computing device 200 according to some embodiments of the disclosed technology. The computing device 200 can be configured to perform, as in conjunction with... Figure 1 Method 100 is described. Therefore, computing device 200 is configured to expand the scene database as described above.
[0126] As described herein, computing device 200 refers to a computer system, or any device or general-purpose computing system configured to perform various functions. Computing device 200 may, for example, refer to a server. Even though computing device 200 is illustrated as a single device herein, it can also be a distributed computing system comprised of multiple different devices.
[0127] The computing device 200 includes a control circuit 202. The control circuit 202 may physically comprise a single circuit device. Alternatively, the control circuit 202 may be distributed across several circuit devices.
[0128] like Figure 2 As shown in the example, computing device 200 may further include transceiver 206 and memory 208. Control circuitry 202 is communicatively connected to transceiver 206 and memory 208. Control circuitry 202 may include a data bus, and control circuitry 202 may communicate with transceiver 206 and / or memory 208 via the data bus.
[0129] Control circuitry 202 can be configured to perform overall control of the functions and operations of computing device 200. Control circuitry 202 may include processor 204, such as a central processing unit (CPU), microcontroller, or microprocessor. Processor 204 can be configured to execute program code stored in memory 208 to perform the functions and operations of computing device 200. Control circuitry 202 is configured to perform the above-described combination... Figure 1 The steps of method 100. The steps may be implemented in one or more functions stored in memory 208.
[0130] Transceiver 206 is configured to enable computing device 200 to communicate with other entities such as other devices. Transceiver 206 can both send data to and receive data from computing device 200. Computing device 200 may, for example, be part of a vehicle. Transceiver 206 can then enable computing device 200 to communicate with other systems of the vehicle or external entities such as other vehicles or remote servers.
[0131] Memory 208 may be a non-transitory computer-readable storage medium. Memory 208 may be one or more of a buffer, flash memory, hard disk drive, removable media, volatile memory, non-volatile memory, random access memory (RAM), or other suitable devices. In a typical arrangement, memory 208 may include non-volatile memory for long-term data storage and volatile memory used as system memory for computing device 200. Memory 208 may exchange data with circuit 202 via a data bus. Accompanying control lines and address buses may also exist between memory 208 and circuit 202. Memory 208 may further store a scene database, as described above. Figure 1 Alternatively, the scene database can be provided externally to the computing device 200. The computing device 200 can then communicatively connect to the scene database.
[0132] The functions and operations of computing device 200 can be implemented in the form of executable logic routines (e.g., lines of code, software programs, etc.) stored on a non-transitory computer-readable recording medium (e.g., memory 208) in computing device 200 and executed by circuit 202 (e.g., using processor 204). In other words, when circuit 202 is described as being configured to perform a specific function, processor 204 of circuit 202 can be configured to execute a portion of program code stored in memory 208, wherein the stored portion of program code corresponds to the specific function. Furthermore, the functions and operations of circuit 202 can be independent software applications or part of a software application that performs additional tasks related to circuit 202. The described functions and operations can be considered as methods by which the corresponding device is configured to perform, such as those described above. Figure 1 Method 100 is discussed. Furthermore, while the described functions and operations can be implemented in software, they can also be implemented via dedicated hardware or firmware, or a combination of one or more of hardware, firmware, and software. The functions and operations of computing device 200 are described below.
[0133] Control circuit 202 is configured to obtain an indication of an empty volume in multidimensional space. An empty volume is a volume not covered by multiple existing scene samples in a scene database. This can be performed, for example, by executing the acquisition function 210.
[0134] The control circuit 202 is further configured to send a data collection request to one or more vehicles in the vehicle fleet for an empty volume. This can be performed, for example, by executing the sending function 212.
[0135] The control circuit 202 is further configured to receive data from the vehicles in the vehicle fleet that matches an instruction with a data collection request for a recorded scene. This can be performed, for example, by executing the receiving function 214.
[0136] The control circuit 202 can be further configured to store the received data indicating a recorded scene into multiple existing scene samples in a scene database. This can be performed, for example, by executing the storage function 216.
[0137] Control circuitry 202 can be further configured to determine desired scene samples based on empty volume. This can be performed, for example, by executing determination function 218. The data collection request can then include scene embeddings associated with the desired scene samples. Recorded scenes can have associated scene embeddings within a defined distance from the scene embeddings associated with the desired scene samples.
[0138] It should be noted that, as mentioned above... Figure 1 The principles, features, aspects, and advantages of method 100 are also applicable to computing device 200 as described herein. To avoid unnecessary repetition, reference is made to the foregoing. Therefore, control circuitry can be configured to perform any of the steps described as part of method 100.
[0139] Based on some aspects of the publicly available technology, a system can be provided for expanding a scene database. The system can be configured to perform, for example, the actions described herein. Figure 1The technology described. The system may include the aforementioned computing device 200. The computing device 200 can be used as a server within the system. The system may further include one or more vehicles. In other words, the system may include a fleet of vehicles. One or more vehicles of the system may be configured to collect or record sensor data while driving on a road. One or more vehicles may therefore include one or more onboard sensors, such as one or more cameras, LiDAR, radar sensors, etc. The vehicles may therefore be used for data collection tasks. The vehicles may therefore correspond to one or more vehicles to which data collection requests are sent as described above. One or more vehicles may be further configured to perform the above combination. Figure 1 Some of the aforementioned techniques. One or more vehicles may be configured, for example, to receive a data collection request. One or more vehicles may be further configured to generate a scene embedding associated with a recorded scene (i.e., a scene experienced by the vehicle). One or more vehicles may be further configured to determine whether the recorded scene matches the data collection request. If it matches, one or more vehicles may be further configured to transmit data indicating the recorded scene to computing device 200. The following combination Figure 3 More detailed examples of this type of vehicle are provided.
[0140] Figure 3 This is an illustrative illustration of a vehicle 300 according to some implementations. The vehicle 300 may be equipped with an automated driving system (ADS) 310. As used herein, "vehicle" means any form of motorized transportation. For example, vehicle 300 can be any road vehicle such as a car (as illustrated herein), a motorcycle, a (freight) truck, a bus, a smart bicycle, etc. In this context, vehicle 300 should be understood as a vehicle that can be deployed with an ADS trained using a scenario database as described herein. Vehicle 300 may further be a vehicle that can be used to collect sensor data of different driving scenarios experienced by the vehicle. Therefore, a data collection request can be sent from computing device 200 to vehicle 300.
[0141] In this context, an Automated Driving System (ADS) refers to a complex combination of hardware and software components designed to control and operate a vehicle without direct human intervention. ADS technology aims to automate various aspects of driving, such as steering, acceleration, deceleration, and monitoring of the surrounding environment. The primary goal of ADS is to enhance safety, efficiency, and convenience in transportation. The range of ADS can vary from basic driver assistance systems to highly advanced automated driving systems, depending on their level of automation as classified by standards such as SAE J3016. These systems utilize various sensors, cameras, radar, lidar, and powerful computer algorithms to perceive the environment and make driving decisions. The specific capabilities and characteristics / functions of ADS can vary greatly, from systems providing limited assistance to systems capable of independently handling complex driving tasks under specific conditions.
[0142] Advanced Driver Assistance Systems (ADAS) are technologies that assist drivers during driving, although they do not necessarily provide complete autonomy. ADAS features are often used as building blocks for ADS. Examples include adaptive cruise control, lane keeping assist, automatic emergency braking, and parking assist. They enhance safety and convenience but typically require a certain level of human supervision and intervention. Autonomous Driving (AD), on the other hand, is a technology designed to control and navigate a vehicle without human supervision. Accordingly, the difference between ADAS and AD can be said to lie in the level of autonomy and control. ADAS systems are designed to assist and support the driver, while ADS aims for complete control of the vehicle without continuous human supervision. Accordingly, AD aims for a higher level of autonomy (such as Level 4 and Level 5 according to SAE international standards), where the vehicle can operate independently in most or all driving scenarios without human intervention. As mentioned earlier, the term "ADS" used in this article is used as a collective term encompassing both ADAS and AD. In this context, ADS functions or ADS features can be understood as specific functions or features of the entire ADS stack, such as highway navigation features, traffic congestion navigation features, route planning features, etc.
[0143] Vehicle 300 includes several components typically found in autonomous or semi-autonomous vehicles. It should be understood that vehicle 300 may have... Figure 3 Any combination of the various elements shown herein. Furthermore, vehicle 300 may include more than Figure 3 The elements shown herein are further elements. Although the various elements are shown herein as being located inside the vehicle 300, one or more of the elements may be located outside the vehicle 300. Furthermore, even though the various elements are depicted herein in certain arrangements, as will be readily understood by those skilled in the art, the various elements may also be implemented in different arrangements. It should be further noted that the various elements may be communicatively connected to each other in any suitable manner. Figure 3 Vehicle 300 should be considered only as an illustrative example, as the components of vehicle 300 can be implemented in several different ways.
[0144] Vehicle 300 includes a control system 302. The control system 302 is configured to perform overall control of the functions and operations of vehicle 300. The control system 302 includes control circuitry 304 and memory 306. Control circuitry 302 may physically comprise a single circuit device. Alternatively, control circuitry 302 may be distributed across several circuit devices. As an example, control system 302 may share its control circuitry 304 with other parts of the vehicle. Control circuitry 302 may include one or more processors such as a central processing unit (CPU), microcontroller, or microprocessor. One or more processors may be configured to execute program code stored in memory 306 to perform the functions and operations of vehicle 300. The processor may be or may include any number of hardware components for performing data or signal processing or for executing computer code stored in memory 306. In some embodiments, control circuitry 304 or some of its functions may be implemented on one or more so-called system-on-a-chip (SoC). As an example, ADS 310 may be implemented on an SoC. Memory 306 optionally includes high-speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory such as one or more disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state memory devices. Memory 306 may include database components, object code components, script components, or any other type of information structure used to support the various activities of this specification.
[0145] In the illustrated example, memory 306 further stores map data 308. Map data 308 can be used, for example, by the ADS 310 of vehicle 300 to perform autonomous functions of vehicle 300. Map data 308 may include high-definition (HD) map data and / or standard-definition (SD) map data. Even though memory 308 is illustrated as a separate element from ADS 310, it is conceivable that memory 12 could be provided as an integrated element of ADS 310. In other words, according to some embodiments, any distributed or local memory device can be used to implement the inventive concept. Similarly, control circuitry 304 can be distributed, for example, such that one or more processors of control circuitry 304 are provided as integrated elements of ADS 310 or any other system of vehicle 300. In other words, according to exemplary embodiments, any distributed or local control circuitry device can be used to implement the technology of this disclosure.
[0146] Vehicle 300 further includes a sensor system 320. Sensor system 320 is configured to acquire sensing data about the vehicle itself or its surroundings. Sensor system 320 may, for example, include a Global Navigation Satellite System (GNSS) module 322 (such as GPS) configured to collect geographic location data of vehicle 300. Sensor system 320 may further include one or more sensors 324. Sensors 324 may be any type of onboard sensor such as a camera, LiDAR and RADAR, ultrasonic sensors, gyroscope, accelerometer, odometer, etc. It should be appreciated that sensor system 320 may also provide the possibility of acquiring sensing data directly or via dedicated sensor control circuitry within vehicle 300.
[0147] Vehicle 300 further includes a communication system 326. Communication system 326 is configured to communicate with external units such as other vehicles (i.e., via vehicle-to-vehicle (V2V) communication protocols), remote servers (e.g., cloud servers), databases, or other external devices, i.e., vehicle-to-infrastructure (V2I) or vehicle-to-everything (V2X) communication protocols. Communication system 326 can communicate using one or more communication technologies. Communication system 326 may include one or more antennas. Cellular communication technologies can be used for remote communication, such as to remote servers or cloud computing systems. Additionally, if the cellular communication technology used has low latency, it can also be used for V2V, V2I, or V2X communication. Examples of cellular radio technologies include GSM, GPRS, EDGE, LTE, 5G, 5G NR, and so on, including future cellular solutions. However, in some solutions, short-to-medium range communication technologies such as wireless local area networks (LANs), for example, solutions based on IEEE 802.11, can be used for communication with other vehicles near vehicle 300 or with local infrastructure components. ETSI is developing cellular standards for vehicle communications, and 5G is considered a suitable solution, for example, due to its high bandwidth, low latency, and efficient processing of communication channels.
[0148] The communication system 326 can further provide the possibility of transmitting outputs (such as sensor data recorded for driving scenarios) to remote locations (e.g., remote servers, operators, or control centers) via one or more antennas. Furthermore, the communication system 326 can be further configured to enable various components of the vehicle 300 to communicate with each other. As an example, the communication system can provide a local network setup such as CAN bus, I2C, Ethernet, fiber optics, etc. Local communication within the vehicle can also be a wireless type with protocols such as WiFi, LoRa, Zigbee, Bluetooth, or similar medium / short-range technologies.
[0149] Vehicle 300 further includes a control system 320. The control system 328 is configured to control the handling of vehicle 300. The control system 328 includes a steering module 330 configured to control the heading of vehicle 300. The control system 328 further includes a throttle module 332 configured to control the actuation of the throttle valve of vehicle 300. The control system 328 further includes a brake module 334 configured to control the actuation of the brakes of vehicle 300. The various modules of the steering system 328 can receive manual input from the driver of vehicle 300 (i.e., from the steering wheel, accelerator pedal, and brake pedal, respectively). However, the control system 328 can be communicatively connected to the vehicle's ADS 310 to receive instructions on how the various modules should operate. Therefore, ADS 310 can control the handling of vehicle 300.
[0150] As described above, vehicle 300 includes ADS 310. ADS 310 may be part of the vehicle's control system 302. ADS 310 is configured to perform autonomous functions and operations of vehicle 300. ADS 310 may include multiple modules, each responsible for a different function of ADS 310.
[0151] ADS 310 may include a positioning module 312 or a positioning block / system. The positioning module 312 is configured to determine and / or monitor the geographic location and heading of the vehicle 300, and may utilize data from sensor systems 320, such as data from GNSS module 322. Optionally or in combination, the positioning module 312 may utilize data from one or more sensors 324. The positioning system may optionally be implemented as Real-Time Kinematic (RTK) GPS.
[0152] ADS 310 may further include a perception module 314 or a perception block / system. Perception module 314 may refer to any known module and / or function, for example, included in one or more electronic control modules and / or nodes of vehicle 300, adapted and / or configured to interpret sensing data relevant to driving of vehicle 300, to identify, for example, obstacles, lanes, relevant signs, appropriate navigation paths, etc. Perception module 314 may therefore be adapted to rely on and receive input from multiple data sources such as automotive imaging, image processing, computer vision, and / or in-vehicle networks, and to combine with sensing data, for example, from sensor system 320.
[0153] The positioning module 312 and / or the sensing module 314 can be communicatively connected to the sensor system 320 to receive sensor data from the sensor system 320. The positioning module 312 and / or the sensing module 314 can further send control commands to the sensor system 320.
[0154] The ADS may further include a path planning module 316. The path planning module 316 is configured to determine a planned path for the vehicle 300 based on the vehicle's perception and position, as determined by the perception module 314 and the positioning module 312, respectively. The planned path determined by the path planning module 316 can be transmitted to the control system 328 for execution. As an example, the determined current position of the vehicle on the navigation map can be sent to the path planning module 316.
[0155] The ADS may further include a decision and control module 318. The decision and control module 318 is configured to perform control of the ADS 310 and make decisions. For example, the decision and control module 318 may decide whether the planned path determined by the path planning module 316 should be executed. The decision and control module 318 may be further configured to detect any deviation behavior of the vehicle, such as deviating from the planned path or the expected trajectory of the path planning module 316. This includes evasive maneuvers performed by the ADS 310 and by the vehicle's driver.
[0156] It should be understood that portions of the described solution can be implemented in vehicle 300, in a system located outside the vehicle, or in a combination of inside and outside the vehicle; for example, in a server communicating with the vehicle, i.e., a so-called cloud solution. Different features and principles of the implementation methods can be combined in combinations other than those described herein. Furthermore, the elements (i.e., systems and modules) of vehicle 300 can be implemented in combinations different from those described herein.
[0157] Figure 4 The example illustrates the mapping between scene samples and the multidimensional space 400. Figure 4 This is intended to enhance understanding of the technical aspects disclosed herein and should not be considered a limitation of scope. More specifically, Figure 4 This illustrates the effect of the transformation of (raw) sensor data 412 in the embedding space 400. Specifically, the transformation can be mapped to the embedding space, meaning it can be viewed and analyzed within the embedding space.
[0158] exist Figure 4 In the upper part, a conversion system 410 is shown. The conversion system 410 includes a conversion module 414. The conversion module 414 herein represents a block for performing techniques related to the conversion of (raw) sensor data 412 to generate synthetic sensor data 416a, 416b, 416c. The conversion module 414 can therefore implement the aforementioned techniques such as generative or rendering-based machine learning techniques.
[0159] exist Figure 4The lower part of the diagram shows an illustration of multidimensional space 400. For practical reasons, multidimensional space 400 is illustrated herein as a two-dimensional space spanned by two axes. However, it should be noted that multidimensional space 400 can be of any dimension. More specifically, multidimensional space 400 can be formed by two or more dimensions.
[0160] Furthermore, as illustrated herein, the transformation volume 404 can be a closed set. In other words, the transformation volume 404 can be formed by a single closed volume in a multidimensional space. However, it should be noted that even though the transformation volume 404 is described as a closed set herein, it can still be formed by multiple separate subsets. In other words, the transformation volume 404 associated with a scene sample can be formed by multiple sub-volumes separated in a multidimensional space. In other words, the transformation volume can be a disjoint volume. This may occur, for example, if (at least some) possible transformations occur in discrete steps rather than continuous steps.
[0161] As explained earlier, scene samples (or their sensor data) can be encoded into scene embeddings within a 400-dimensional space. Figure 4 In the middle, the dashed double arrows indicate how sensor data 412 is mapped to the associated scene embedding 402 in the multidimensional space 400.
[0162] As sensor data 412 is converted into synthetic sensor data, this paper uses three distinctly converted scene representations indicated by 416a to 416c, with the scene embedding associated with each converted data being offset to another point in the multidimensional space 400. More specifically, the first synthetic sensor data 416a is mapped to a first offset scene embedding 406a in the multidimensional space. Similarly, the second synthetic sensor data 416b is mapped to a second offset scene embedding 406b. Finally, the third synthetic sensor data 416c is mapped to a third offset scene embedding 406c.
[0163] Multiple such transformations (still possessing reasonable "magnitudes" to maintain the validity / reliability / accuracy of the results, as explained below) are applied to render the transformation volume in a multidimensional space. The transformation volume 404 can be viewed as a subspace within the multidimensional space spanned by scene embeddings, which are accessible through transformations of scene samples formed from sensor data 412 and associated scene embeddings 402.
[0164] A transformation volume (also known as a perturbation space) can be determined for each scene sample in the scene database. More specifically, the size and / or shape of the transformation volume for one scene sample can differ from the size and / or shape of the transformation volume for another scene sample. In other words, the transformation volume can be determined individually for each scene sample. The transformation can be determined by performing multiple transformations on the corresponding sensor data to determine the extent to which the sensor data can be transformed with sufficient confidence. This can then result in an asymmetric transformation volume as illustrated herein. In the illustrated example, the synthetic sensor data 416a to 416c can be considered as lying on the boundaries where the transformation from the original sensor data 412 can be effectively, reliably, and / or accurately realized. The boundaries of the transformation volume 404 can then be determined based on the positions of the corresponding scene embeddings 406a to 406c in the multidimensional space. As mentioned above, the transformation volume 404 can be determined using an iterative method. The iterative method may include applying a large number of transformations at different magnitudes to observe what the corresponding offsets in the embedding space 400 are and what volume they result in. The term "magnitude" in the context of transformation can be understood as the degree or amount of transformation performed, or a general measure of the extent to which the original sensor data is modified.
[0165] In another example, the transformation module can be rule-based. In other words, the transformation module can be determined for each scene sample based on predefined rules. As a non-restrictive example, the transformation volume can be defined as a circle (in 2D embedding space) or a sphere (in 3D embedding space) with a given radius, centered on the scene sample. This is an example of a symmetric transformation volume.
[0166] The disclosed technology is then at least partially based on the following: synthetic sensor data corresponding to scene embeddings within the transformation volume can be generated using the transformation system 410. More efficient data collection methods can be achieved by applying this knowledge to recorded scenes not currently in the scene database, or to existing scene samples in the scene database.
[0167] Figure 5 The example illustration shows multiple scene samples from a scene database within a multidimensional space of 500. More specifically, the black circles represent the locations of scene embeddings 502a, 502b, 502c, 502d, 502e, and 502f associated with the multiple scene samples in the scene database. For scene-based development purposes, it is desirable to have the broadest possible coverage of the space of possible scenes. Figure 5This is used to illustrate the process of understanding which areas in the scene space are covered by existing scene samples, which areas are not covered, and how to fill in these areas. To understand which new scenes should be collected and used to expand the scene database, it is necessary to understand the currently accessible volume within the scene space of 500 (i.e., which parts of the scene space are accessible from existing scene samples) and the coverage provided by newly recorded scenes.
[0168] exist Figure 5 The diagram illustrates first scene embeddings 502a to 502f associated with corresponding first to sixth scene samples. Scene embeddings may be further shown together with associated transformation volumes. More specifically, each of the first to fifth scene embeddings 502e is shown having a corresponding first to fifth transformation volume 504a to 502e. As shown herein, transformation volumes 504a to 502c may have different shapes and / or sizes. Furthermore, transformation volumes may be symmetrical (as shown in the first and second transformation volumes 504a and 504b). However, transformation volumes may also be asymmetrical (as shown in the third, fourth, and fifth transformation volumes 504c and 504d). It should be further noted that scene embeddings may also not have associated transformation volumes, such as for the sixth scene embedding 502f as shown herein.
[0169] As explained earlier, empty volumes in scene space 500 can be identified for existing scene samples (and optionally also for their respective transformed volumes). Consider... Figure 5 In the given scenario, a first empty volume 508a and a second empty volume 508b can be identified. In the illustrated example, the first empty volume 508a and the second empty volume 508b are identified as regions not covered by any transform volume, and / or at least at a defined distance D1 from any existing scene embedding. It should be noted that the relative shapes and sizes of the transform volumes and the first and second empty volumes 508a and 508b are only examples. Furthermore, empty volumes may partially overlap with one or more transform volumes in some cases. Even further, empty volumes may be defined to the boundaries of one or more adjacent transform volumes.
[0170] Figure 5The diagram further illustrates multiple query / scenario embeddings 506a, 506b, and 506c associated with the corresponding query scenarios (or desired scenario samples) as described above. As an example, the first query embedding 506a may correspond to a query scenario requested by the developer or determined in any other way. By analyzing the location of the first query embedding 506a against existing scenario samples, it can be found that it is not covered by existing scenario samples. For example, it can be found that the first query embedding 506a is located outside any transformation volume or at a distance from existing scenario embeddings, making it unobtainable through transformations of any existing scenario samples. A data collection request can then be generated based on the first query embedding 506a.
[0171] The second query embedding 506b can be examined and can correspond to a desired scene sample determined based on a first empty volume 508a identified in scene space 500. The second query embedding 506b can therefore be generated as a scene embedding 506b located within the first empty volume 508a. A data collection request can then be generated based on the second query embedding 506b as an alternative or in combination with the first empty volume 508a.
[0172] Finally, the third query embedding 506c can correspond to another query scenario. However, by comparing the third query embedding 506c with the fifth transformation volume 504e, it can be found that the third query embedding 506c is located within the transformation volume 504e. Therefore, the corresponding fifth existing scenario sample can be transformed into the query scenario corresponding to the third query embedding 506c. This transformation is illustrated in this document by an arrow representing the position of the fifth scenario embedding 502e offset to the third query embedding 506c in the multidimensional space 500.
[0173] Figure 6A and Figure 6B The example illustrates how the scene space is filled. More specifically, Figure 6A and Figure 6B The comparison highlights the effectiveness of the publicly disclosed technologies. That is, the scene space can be exhausted using fewer scene samples by taking into account the current coverage of the scene space through the use of targeted data collection methods.
[0174] Figure 6A A first scene space 600a is shown, filled with a set of scene samples with associated scene embeddings 602. More specifically, Figure 6AThe scenario without publicly disclosed techniques is illustrated. In this case, to exhaust the first scene space 600a (i.e., achieve coverage of all possible scenes), some form of grid sampling technique is required. Sensor data associated with each scene embedding (corresponding to grid points with a certain distance (d1, d2) between adjacent points) needs to be traversed, collected, and transmitted by the vehicle convoy and ultimately stored in the scene database. Given that the distance between adjacent points must be relatively small, this means that a large number of scenes need to be collected. As explained earlier, this is not feasible given the vast number of possible scenes and the rarity of some scenes.
[0175] Figure 6B The technique of utilizing a second scene space 600b, which employs the disclosed techniques, is illustrated. More specifically, it utilizes a transformation volume 604 associated with each scene embedding 602 of scene samples in the scene database. Using the transformation system, the same scene space can be exhaustively explored with fewer data samples. This is because the distance (d3, d4) between adjacent scene samples can be greater than... Figure 6A The distance (d1, d2) in the example. In other words, the scene space can be filled with a sparser set of scene samples. Mesh sampling methods can still be applied, but because sensor data between grid points can be generated, the mesh can be made sparser. However, it should be noted that mesh sampling methods are not necessary. For example, because the transformation volume can be asymmetric, the scene database can be filled with scene samples from any possible distribution in the scene space.
[0176] The disclosed technology has been presented above with reference to specific embodiments. However, other embodiments besides those described above are possible and within the scope of the invention. Within the scope of the invention, method steps different from those described above, performed by hardware or software, can be provided. Thus, according to an exemplary embodiment, a non-transitory computer-readable storage medium is provided storing one or more programs configured to be executed by one or more processors of a vehicle control system, the programs including instructions for performing the method according to any of the embodiments described above. Optionally, according to another exemplary embodiment, a cloud computing system can be configured to perform any of the methods presented herein. The cloud computing system may include distributed cloud computing resources that jointly perform the methods presented herein under the control of one or more computer program products.
[0177] It should be noted that any reference numerals in the drawings do not limit the scope of the claims. The invention can be implemented, at least in part, by both hardware and software means, and the same hardware item can refer to several “devices” or “units”.
Claims
1. A computer-implemented method (100) for expanding a scene database, wherein, The scene database includes multiple existing scene samples, each scene sample including sensor data depicting the vehicle's surrounding environment over a period of time, wherein each scene sample is associated with a scene embedding representing the scene sample in a multi-dimensional space, and the method includes: (S102) Obtain an indication of the empty volume in the multidimensional space, wherein the empty volume is the volume not covered by the plurality of existing scene samples in the scene database; (S106) Send a data collection request to one or more vehicles in the vehicle fleet for the empty volume; Receive (S108) data from the vehicles in the vehicle convoy indicating a recorded scene that matches the data collection request; and The received data indicating the recorded scene (S110) is stored in the scene database of the plurality of existing scene samples.
2. The method (100) according to claim 1, wherein, Each scene embedding in the scene database is associated with a transformation volume in the multidimensional space, wherein the transformation volume indicates a set of possible transformed scenes that can be generated from the scene samples, and The empty volume is the volume that is not covered by the transformed volume of the plurality of existing scene samples in the scene database.
3. The method (100) according to claim 1, wherein, If the recorded scene has an associated scene embedding located within the empty volume, then the recorded scene matches the data collection request.
4. The method (100) according to claim 1, wherein, If the recorded scene has an associated transformed volume in the multidimensional space that at least partially covers the empty volume, then the recorded scene matches the data collection request. The associated transformation volume indicates a set of possible transformed scenes that can be generated from the recorded scenes.
5. The method (100) according to claim 1, further comprising: Based on the empty volume, determine the expected scene sample (S104); The data collection request includes a scene embedding associated with the desired scene sample; and The recorded scene has an associated scene embedding within a defined distance from the scene embedding associated with the desired scene sample.
6. The method (100) according to claim 1, wherein, The indication of obtaining (S102) the empty volume includes: Obtain (S102a) a request for a specified query scenario, wherein the query scenario is associated with a query embedding representing the query scenario in the multidimensional space; Determine (S102b) whether the query scenario is covered by existing scenario samples in the scenario database; In response to the existing scene samples failing to cover the query scene, the indication of an empty volume is generated based on the query embedding (S102c).
7. The method (100) according to claim 1, wherein, The indication of obtaining (S102) the empty volume includes: The empty volume is identified (S102a') as the volume in the multidimensional space that is not covered by the aggregated transformation volume formed by the transformation volumes of the existing scene samples; and In response to the identification of the empty volume, an indication of the empty volume is generated (S102b').
8. The method (100) according to claim 1, wherein, The data collection request further includes a priority, and Specifically, the data indicating the recorded scene has a first resolution when the priority is higher than a first threshold, and a second resolution when the priority is lower than the first threshold. Wherein, the first resolution is greater than the second resolution.
9. The method (100) according to claim 8, wherein, The data indicating the recorded scene has a third resolution when the priority is below the first threshold and above the second threshold. The third resolution is between the first resolution and the second resolution.
10. The method (100) according to claim 8, wherein, The priority is set based on the size of the empty volume and / or the distance to the nearest existing scene sample in the multidimensional space.
11. The method (100) according to claim 1, wherein, The transformation volume further indicates the set of possible transformed scenarios generated from the scenario samples while meeting the validity threshold, reliability threshold, and / or accuracy threshold.
12. The method (100) according to claim 1, wherein, The transformed scene is generated by a generative machine learning model or a rendering-based machine learning model.
13. The method (100) according to claim 1, wherein, The multidimensional space is a common space used for two or more data modalities.
14. A computer program product storing instructions that, when the program is executed by a computing device, cause the computing device to perform the method (100) according to any one of claims 1 to 13.
15. A computing device (200) for expanding a scene database, wherein, The scene database includes multiple existing scene samples, each scene sample including sensor data depicting the vehicle's surrounding environment over a period of time, wherein each scene sample is associated with a scene embedding representing the scene sample in a multidimensional space, and the computing device (200) includes a control circuit (202) configured to: Obtain an indication of the empty volume in the multidimensional space, wherein the empty volume is the volume not covered by the plurality of existing scene samples in the scene database; Send a data collection request to one or more vehicles in the vehicle fleet for the empty volume; Receive data from the vehicles of the vehicle convoy indicating a recorded scene that matches the data collection request; and The received data, which indicates the recorded scene, is stored in the scene database of the plurality of existing scene samples.