Multi-level sensor fusion architecture for autonomous driving systems
By independently performing sensor data encoding and preliminary fusion in the autonomous driving system through a multi-level sensor fusion architecture, the problems of bandwidth limitation and insufficient redundancy are solved, ensuring the safety and high performance of the system under fault conditions and realizing efficient predictive output generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZENSEACT AB
- Filing Date
- 2025-12-16
- Publication Date
- 2026-06-16
AI Technical Summary
Autonomous driving systems suffer from limitations in sensor fusion bandwidth and insufficient redundancy, which affect the performance and safety of the system's perception and planning functions.
A multi-level sensor fusion architecture is adopted, in which sensor data encoding and preliminary fusion are performed independently across different hardware platforms, sharing only the fused state to reduce data transmission volume, and providing a decoder network on each platform to generate predictive output, ensuring redundancy and high performance.
It enables the provision of safety-critical redundancy and high-performance sensing and planning capabilities even in the event of hardware or software failures, reducing bandwidth requirements and improving processing speed and the efficiency of generating predictive outputs.
Smart Images

Figure CN122220718A_ABST
Abstract
Description
Technical Field
[0001] The disclosed technology relates to methods and systems for generating predictive outputs for autonomous driving systems of vehicles. In particular, but not exclusively, the disclosed technology relates to a sensor fusion architecture for improving redundancy while complying with bandwidth constraints within autonomous driving systems. Background Technology
[0002] Currently, ongoing research and development are underway in multiple technical areas related to both Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD). ADAS and AD will be collectively referred to herein as the term "Automated Driving System (ADS)," corresponding to all the different levels of automation as defined, for example, by SAE J3016 Driving Automation Levels (1-5), with particular emphasis on Levels 4 and 5. ADS solutions are already in use in most new vehicles on the market, and their application prospects will only expand in the near future. ADS can be interpreted as a complex combination of various components, which can be defined as a system in which the perception, decision-making, and operation of the vehicle are performed electronically and mechanically rather than by a human driver, or electronically and mechanically in collaboration with a human driver, and can be defined as the introduction of automation into road traffic. This includes the handling of the vehicle, the destination, and the perception of the surroundings. While automated systems can control the entire vehicle, they allow the human operator to delegate all or at least some of the responsibilities to the system.
[0003] Autonomous driving systems rely on a combination of sensors, such as cameras, radar, lidar, and ultrasonic devices, to perceive and interpret their surroundings. These sensors capture various types of data, including object detection, distance measurement, and environmental mapping. Sensor fusion is the process of integrating data from multiple sensor types to improve the accuracy, reliability, and robustness of the system's perception capabilities. By combining information from different sources, sensor fusion helps overcome the limitations of individual sensors (such as poor visibility or occlusion) and improves decision-making in ADS-operated vehicles. This technology is crucial for achieving higher levels of autonomy, enabling safer and more efficient driving operations under a wide range of conditions.
[0004] Therefore, there is always a need for improvement in autonomous driving systems, especially in optimizing sensor fusion algorithms to enhance real-time performance and ensure safety when ADS operates the vehicle. Summary of the Invention
[0005] The techniques disclosed herein seek to mitigate, alleviate, or eliminate one or more defects and deficiencies in the prior art to address various issues related to redundancy, performance, and bandwidth limitations for perception and / or planning functions in autonomous driving systems.
[0006] The various aspects and embodiments of the disclosed technology are defined below and in the appended independent and dependent claims.
[0007] The first aspect of the disclosed technology includes a computer-implemented method for generating predictive outputs for an autonomous driving system of a vehicle. This computer-implemented method includes: encoding a first sensor dataset generated by a first sensor cluster of the vehicle using one or more processors of a first hardware platform of the vehicle, employing one or more first encoder networks; and fusing the encoded first sensor dataset using a first fusion network to form a first set of fused encoded sensor data features. The computer-implemented method further includes: encoding a second sensor dataset generated by a second sensor cluster of the vehicle using one or more second encoder networks of one or more processors of a second hardware platform of the vehicle; and fusing the encoded second sensor dataset using a second fusion network to form a second set of fused encoded sensor data features. The first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. The computer implementation method further includes: using one or more processors of a first hardware platform of the vehicle or one or more processors of a second hardware platform, using a third sensor fusion network to fuse a first set of fused coded sensor data features and a second set of fused coded sensor data features to form a third set of fused coded sensor data features, and using one or more decoder networks to generate a prediction output based on the third set of fused coded sensor data features.
[0008] A second aspect of the disclosed technology includes a computer program product comprising instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a method according to any of the embodiments disclosed herein. This aspect of the disclosed technology possesses similar advantages and preferred features to other aspects.
[0009] A third aspect of the disclosed technology includes a (non-transitory) computer-readable storage medium comprising instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a method according to any of the embodiments disclosed herein. This aspect of the disclosed technology possesses similar advantages and preferred features to other aspects.
[0010] As used herein, the term "non-transitory" is intended to describe computer-readable storage media (or "memory") that do not include the propagation of electromagnetic signals, but is not intended to otherwise limit the types of physical computer-readable storage devices covered by the term "computer-readable medium or memory." For example, the terms "non-transitory computer-readable medium" or "tangible memory" are intended to cover types of storage devices that do not necessarily store information permanently, including, for example, random access memory (RAM). Program instructions and data stored in a non-transitory form on a tangible computer-accessible storage medium can be further transmitted via transmission media or signals (such as electronic signals, electromagnetic signals, or digital signals) that can be transmitted via communication media such as networks and / or wireless links. Therefore, the term "non-transitory" as used herein is a limitation on the medium itself (i.e., tangible, not signaling), rather than a limitation on the persistence of data storage (e.g., RAM versus ROM).
[0011] A fourth aspect of the disclosed technology includes a system for generating predictive outputs for an autonomous driving system of a vehicle. The system includes: a first hardware platform comprising a first sensor cluster and one or more processors, the first hardware platform being configured to: encode a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder networks, and fuse the encoded first sensor dataset using a first fusion network to form a first set of fused encoded sensor data features. The system further includes: a second hardware platform comprising a second sensor cluster and one or more processors, the second hardware platform being configured to: encode a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder networks, and fuse the encoded second sensor dataset using a second fusion network to form a second set of fused encoded sensor data features. The first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. Furthermore, one or more processors of the first hardware platform or one or more processors of the second hardware platform are further configured to: use a third sensor fusion network to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form a third set of fused coded sensor data features, and use one or more decoder networks to generate a prediction output based on the third set of fused coded sensor data features.
[0012] The fifth aspect of the disclosed technology includes a vehicle comprising a system according to any of the embodiments of the fourth aspect disclosed herein. This aspect of the disclosed technology possesses similar advantages and preferred features as the other aspects.
[0013] The disclosed aspects and preferred embodiments can be suitably combined with each other in any manner that is obvious to those skilled in the art, such that one or more features or embodiments disclosed with respect to one aspect may also be considered as disclosed with respect to embodiments of another aspect or another aspect.
[0014] Some embodiments have the advantage of providing safety-critical redundancy for the ADS's perception and / or planning functions without limiting the available inputs to the perception and / or planning functions, thereby avoiding performance degradation during the system's full operational time.
[0015] Some implementations offer advantages such as improved performance and redundancy for ADS's sensing and / or planning functions, and reduced bandwidth requirements compared to conventional solutions.
[0016] One advantage of some embodiments is that perception data can be shared between different computing platforms within the vehicle in a bandwidth-efficient manner, thereby improving the processing speed when generating predictive outputs for autonomously maneuvering the vehicle in a safe manner.
[0017] Other embodiments are defined in the dependent claims. It should be emphasized that, when used in this specification, the term "comprising / including" is used to indicate the presence of a described feature, integral, step, or component. It does not exclude the presence or addition of one or more other features, integrals, steps, components, or groups thereof.
[0018] These and other features and advantages of the disclosed technology will be further illustrated below with reference to the embodiments described herein. Attached Figure Description
[0019] When used in conjunction with the accompanying drawings, the foregoing aspects, features, and advantages of the disclosed technology will be more fully appreciated by referring to the following illustrative and non-limiting detailed description of exemplary embodiments of the present disclosure, wherein:
[0020] Figure 1 This is a schematic flowchart illustrating a method for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments.
[0021] Figure 2a A schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments;
[0022] Figure 2b A schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments;
[0023] Figure 3aA schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments;
[0024] Figure 3b A schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments;
[0025] Figure 4 A schematic block diagram representation of an end-to-end training setup for a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some embodiments.
[0026] Figure 5 For a vehicle that includes a system according to some embodiments for generating predictive outputs for an autonomous driving system of the vehicle. Detailed Implementation
[0027] This disclosure will now be described in detail with reference to the accompanying drawings, in which some exemplary embodiments of the disclosed technology are illustrated. However, the disclosed technology may be embodied in other forms and should not be construed as limited to the exemplary embodiments disclosed. The exemplary embodiments disclosed are provided to fully convey the scope of the disclosed technology to those skilled in the art. Those skilled in the art will appreciate that the steps, services, and functions explained herein can be implemented using separate hardware circuitry, software that works in conjunction with a programmable microprocessor or general-purpose computer, one or more application-specific integrated circuits (ASICs), one or more field-programmable gate arrays (FPGAs), and / or one or more digital signal processors (DSPs).
[0028] It will also be appreciated that when this disclosure is described as a method, it can also be embodied in an apparatus including one or more processors and one or more memories coupled to the one or more processors, in which computer code is loaded to implement the method. For example, in some embodiments, the one or more memories may store one or more computer programs that, when executed by the one or more processors, cause the apparatus to perform the steps, services, and functions disclosed herein.
[0029] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. It should be noted that, as used in this specification and the appended claims, the words “a,” “an,” “the,” and “the” are intended to indicate the presence of one or more elements unless the context explicitly specifies otherwise. Thus, for example, in some contexts, references to “a unit” or “the unit” may refer to more than one unit, etc. Furthermore, the terms “comprising,” “including,” and “containing” do not exclude other elements or steps. It should be emphasized that, when used in this specification, the term “comprising / containing” is used to indicate the presence of the described feature, integral, step, or component. It does not exclude the presence or addition of one or more other features, integrals, steps, components, or groups thereof. The term “and / or” should be interpreted as meaning “both” and each as an alternative.
[0030] It will also be understood that although the terms “first,” “second,” etc., may be used herein to describe various elements or features, these elements should not be limited by these terms. These terms are used only to distinguish one element from another. For example, a first signal may be referred to as a second signal, and similarly, a second signal may be referred to as a first signal, without departing from the scope of the embodiments. Both the first signal and the second signal are signals, but they are not the same signal.
[0031] Overview
[0032] Autonomous Driving Systems (ADS) are safety-critical systems, meaning they should not have a single point of failure. Regarding the embedded hardware (“HW”) on which the ADS software runs, this paper proposes to meet this redundancy requirement by enabling the ADS software to execute on at least two separate computing units. Regarding the sensors on the device, this means the vehicle is equipped with a large number of complementary sensors and separate sensor clusters connected to individual computing units. With this design, if one sensor cluster and / or its corresponding computing unit experiences a hardware or software failure, another sensor cluster and its corresponding computing unit can still be used to safely operate the vehicle.
[0033] Simultaneously, to maximize the performance of the ADS, the system should ideally have access to information from all sensors. For ADS using neural networks, the information output from all sensors will be fused into a neural network model responsible for perceiving the environment and / or planning the path or trajectory to be executed by the ADS. The optimal performance of such a model itself can be a safety consideration. For example, inaccurate or even erroneous outputs, regardless of hardware or other software malfunctions, will pose safety hazards. Therefore, from both a performance and safety perspective, it is desirable for the neural network model to have access to information from all sensors on the vehicle to improve the accuracy of the output and thereby reduce the risk of accidents. To enable the ADS to access all sensor measurement data, different sensor clusters and computing platforms (“computing units”) should be linked, but the data bandwidth limitations between two hardware clusters are typically much smaller than the internal bandwidth within each hardware cluster. Therefore, constantly feeding the output from all sensors on the vehicle to two hardware platforms is often considered impractical, and thus, developers often face trade-offs between performance and redundancy.
[0034] In light of the above two considerations, the embodiments described herein introduce a solution that enables the development of neural network models that provide optimal performance by operating synchronously across all sensors in different clusters in the absence of hardware failures, while simultaneously providing redundancy in the event of a hardware or software failure on one of the clusters. This reduces the risk of overall system failure and enhances the overall system security.
[0035] In other words, the embodiments in this paper propose an architecture that enables the use of neural network models in an autonomous driving system (which has access to information from all sensors, regardless of which hardware cluster these sensors are assigned to) while still providing redundancy in the event of hardware or software failure in one of the hardware clusters. This allows for perception and / or planning functions of the autonomous driving system that comply with redundancy requirements, while still achieving high performance levels during periods without hardware or software failures.
[0036] The inventors recognized a particular challenge when designing neural network models or architectures for ADS applications that operate synchronously on multiple sensors. More specifically, when each sensor outputs new data, that data can be encoded at that specific timestamp by a dedicated encoder network for each sensor. After a specific time period or time interval, the data from all sensor measurements output during that period / interval is merged into a single representation of the vehicle's surrounding environment (also referred to as "spatial fusion"). Furthermore, historical representations of the surrounding environment prior to that time interval are stored in memory. Therefore, when generating a representation for the current time interval, it can be combined with the historical representations stored in memory to obtain a final representation of the current state of the vehicle's surrounding environment (also referred to as "temporal fusion"). The final predictive output can then be provided by one or more decoder networks trained to solve various tasks (object detection, object classification, object tracking, object trajectory prediction, path planning, trajectory planning, etc.), which receive the spatially and temporally fused state of the surrounding environment at regular intervals.
[0037] When designing a neural network model or architecture (i) that follows the process described above and (ii) is suitable for implementation in a vehicle, where multiple sensors are grouped into different clusters, each with its own computing hardware), the challenge lies in the very limited available data bandwidth between these clusters.
[0038] More specifically, if we consider a scenario where a vehicle equipped with ADS has two hardware platforms, each with its own sensor cluster and computing hardware (“Platform A” and “Platform B”).
[0039] Then, the information output from all sensors on the vehicle needs to be fused synchronously. Therefore, to run such a solution in a vehicle, all sensor data, or more specifically, all encoded sensor data, needs to be sent from platform A to platform B, or vice versa. Spatial and temporal fusion can then be performed on the receiving platform. However, when new sensor measurement data is output, the representation of the surrounding environment must be "updated" (temporal fusion) in a timely manner, otherwise there is a risk that the ADS will not be able to react in time to changes in the surrounding environment, potentially leading to safety-critical issues. Therefore, sensor data or encoded sensor data must be sent from one platform to another at a sufficiently high rate to ensure timely temporal fusion. However, due to limited onboard computing power and the bandwidth limitations between hardware platforms, this may be infeasible. Furthermore, even if sensor data or encoded sensor data could be sent between platforms, enforcing this requirement to share large amounts of data between hardware platforms may constrain or otherwise hinder the performance of neural network models. This is because it may require smaller sensor data packets or lower-dimensional encoded sensor data, which reduces the amount of information available to generate predictive outputs. Therefore, it is desirable to use higher-dimensional encoding, as higher-dimensional encoding usually translates to better performance; however, if sensor data encoding is to be transmitted between platforms, this also translates to higher bandwidth requirements.
[0040] Therefore, this paper proposes a multi-level sensor fusion architecture in which a complete sensor fusion step is independently performed on each hardware platform based on encoded sensor data generated from its own sensor cluster before data is transmitted between platforms. Thus, instead of sending sensor data or encoded sensor data to another platform, only the fused state is shared between platforms. The platform receiving the fused state can then perform a second fusion step to fuse its own fused state (derived from sensor measurement data from its own cluster) with the received fused state, providing a final fused characterization of the surrounding environment.
[0041] The first sensor fusion step (executed on each platform) combines the information contained in the output information from each sensor into a single representation required to solve the final task of the entire model or architecture. Furthermore, the inventors realized that each byte of the fused state representation can store more relevant information compared to the encoded sensor data from each sensor. This is because there is often a significant degree of redundancy among multiple sensors within a cluster, and sending sensor data or encoded sensor data inevitably leads to the transmission of redundant information. Therefore, by performing the first fusion step before sharing data between platforms, more relevant information is shared between platforms given a fixed bandwidth constraint between clusters, compared to sharing sensor data or encoded sensor data.
[0042] Furthermore, to maintain safety-critical redundancy with individual platforms, some embodiments in this paper propose that each platform is provided with the ability to generate predictive outputs based solely on the fused state representations generated by its own platform. Therefore, in addition to one or more decoder networks that provide estimated outputs based on the fused state of all platforms (and all sensor clusters), each platform is also provided with a set of equivalent decoder networks trained to provide estimated outputs based on the fused state representations of sensor data output from its own sensor cluster.
[0043] The entire architecture proposed in this paper can be trained in an end-to-end manner. More specifically, during training, gradients can be propagated from the prediction outputs generated individually by each cluster as well as from the prediction outputs based on the combination of all sensor clusters. In the event of a hardware or software failure on one platform, predictions from sensor clusters on another platform can be used.
[0044] It should be noted that although this description typically illustrates various embodiments with only two platforms, those skilled in the art will readily recognize that the same teachings and principles apply to architectures with more than two platforms (each with a corresponding sensor cluster).
[0045] definition
[0046] In the current context, "Automatic Driving System (ADS)" refers to a complex combination of hardware and software components designed to control and operate a vehicle without direct human intervention. ADS technology aims to automate various aspects of driving, such as steering, acceleration, deceleration, and monitoring of the surrounding environment. The primary goal of ADS is to enhance safety, efficiency, and convenience in transportation. The range of ADS can vary from basic driver assistance systems to highly advanced autonomous driving systems, depending on their level of automation, as categorized according to standards such as SAE J3016. These systems utilize various sensors, cameras, radar, lidar, and powerful computer algorithms to perceive the environment and make driving decisions. The specific capabilities and features / functions of ADS can vary considerably, from systems providing limited assistance to those capable of independently handling complex driving tasks under specific conditions.
[0047] Advanced Driver Assistance Systems (ADAS) are technologies that assist drivers during driving, although they do not necessarily provide complete autonomy. ADAS features are often built into ADS. Examples include adaptive cruise control, lane keeping assist, automatic emergency braking, and parking assist. They enhance safety and convenience but typically require a degree of human supervision and intervention. Autonomous Driving (AD), on the other hand, is a technology designed to control and navigate a vehicle without human supervision. Accordingly, the difference between ADAS and AD can be said to lie in the level of autonomy and control. ADAS systems are designed to assist and support the driver, while AD aims to achieve complete control of the vehicle without continuous human supervision. Accordingly, AD aims to achieve a higher level of autonomy (such as Level 4 and Level 5 according to SAE international standards), meaning the vehicle can operate independently in most or all driving scenarios without human intervention. As mentioned above, the term "ADS" is used as a general term in this document that encompasses both ADAS and AD. In the current context, ADS functions or ADS features can be understood as specific functions or features within the entire ADS technology stack, such as high-speed navigation features, traffic congestion navigation features, route planning features, etc.
[0048] In the current context, "machine learning algorithm" or "neural network" refers to a computable model or set of techniques used to enable computers to solve tasks, such as enabling a vehicle's perception system to interpret and understand its surroundings. Perception tasks in ADS involve the vehicle's ability to detect and identify objects, obstacles, road signs, lane markings, pedestrians, other vehicles, and various environmental conditions. ADS can use machine learning algorithms to process sensor data (e.g., data from cameras, LiDAR, radar, and other sensors) to make informed decisions about how to navigate safely. These algorithms use data-driven techniques to analyze and classify objects, understand road geometry, predict the movement of other road users, and / or assess potential risks in real time. Common types of machine learning algorithms used in ADS perception tasks include deep neural networks, convolutional neural networks (CNNs) (e.g., for camera image processing, LiDAR output processing, etc.), recurrent neural networks (RNNs) (e.g., for sequential data), and various other techniques such as support vector machines (SVMs) and decision trees.
[0049] In some embodiments, machine learning algorithms are implemented using publicly available appropriate software development machine learning code elements (e.g., code elements available in PyTorch, Keras, and TensorFlow, or any other appropriate software development platform) in any suitable manner known to a person skilled in the art.
[0050] The terms "encoder" (or "encoder network") and "decoder" (or "decoder network") refer to components in a neural network architecture designed to interpret and process sensed data from the surroundings of a vehicle. The encoder is responsible for processing the raw sensed inputs (e.g., camera images, LiDAR point clouds, or radar signals) and transforming them into compact, abstract representations ("a set of encoded features"). The decoder takes the encoded representations produced by the encoder and transforms them back into a more interpretable output.
[0051] The compressed representation output from the encoder is often referred to as the "latent space" or "feature space." The latent space encodes essential information related to the input data (e.g., an input image) in a compact form. Each dimension in the latent space can represent a different feature or concept. Furthermore, this representation can capture essential features such as object boundaries, relative distances, and object classifications, while discarding irrelevant information. For example, an encoder can process camera images, identify, and represent key features such as pedestrians, lane markings, traffic signs, or other vehicles. Thus, the encoder reduces high-dimensional sensing data to a meaningful set of features that can be used to understand the driving environment. This representation can serve as the basis for various tasks such as object detection, semantic segmentation (understanding the context of objects), and depth estimation (determining distances to objects). For example, in a convolutional neural network (CNN) used for object detection, the encoder extracts hierarchical features (edges, textures, shapes) from camera images and progressively builds a detailed understanding of what is present in the scene.
[0052] As mentioned earlier, the decoder takes the encoded representation (“a set of encoded features”) produced by the encoder and transforms it back into a more interpretable output. This may involve predicting object locations, parts of a labeled image, or providing additional details (such as the orientation or movement of an object). For example, if the perception task is object detection, the decoder can predict bounding boxes around identified objects (e.g., pedestrians, cars) and assign category labels to these objects (e.g., “cars”, “parking signs”, “pedestrians”). Similarly, if the perception task is semantic segmentation, the decoder can take the encoded features and assign a category label to each pixel in the image (e.g., identifying road surfaces, pedestrian zones, or vehicles). Furthermore, if the perception task is depth estimation, the decoder will take the encoded features and predict distances to various objects or points in the environment.
[0053] The term "predicted output" (or "estimated output") refers to the final result generated by a neural network (e.g., a deep neural network) after processing the input, based on patterns learned during its training. In an encoder-decoder architecture, "estimated output" refers to the output of the decoder. Furthermore, predicted output can be interpreted as the neural network using the weights and biases learned during its training phase to attempt to predict outcomes or make decisions about new data it has never seen before. In the context of autonomous driving systems, "predicted output" can refer to the network's prediction of the driving environment or future vehicle behavior based on the sensed data it receives (e.g., from cameras, radar, and lidar). For example, in perception tasks in the form of classification tasks, predicted output could be the network's identification and classification of objects in the environment (such as pedestrians, vehicles, traffic signs, or lane markings). For instance, a neural network could predict the probability that a detected object ahead is a pedestrian or a stationary object such as a mailbox. Additionally, in perception tasks in the form of regression tasks, predicted output can be a continuous value, such as predicting the distance to the nearest obstacle, the speed of adjacent vehicles, or the time before a traffic light turns red. For example, a network can predict how far in advance a vehicle should begin braking to come to a safe stop at a red light.
[0054] In the current context, a "sensor" (or "sensor device") refers to a specialized component or system designed to capture and collect information from the vehicle's surroundings. These sensors play a crucial role in enabling Autonomous Vehicles (ADSs) to perceive and understand their environment, make informed decisions, and navigate safely. Sensor devices are typically integrated into the hardware and software systems of autonomous vehicles to provide real-time data for various tasks, such as obstacle detection, localization, road model estimation, and object recognition. Common types of sensor devices used in autonomous driving include lidar (light detection and ranging), radar, cameras, and ultrasonic sensors. LiDAR sensors use laser beams to measure distances and create high-resolution 3D maps of the vehicle's surroundings. Radar sensors use radio waves to determine the distance and relative speed of objects around the vehicle. Camera sensors capture visual data, enabling the vehicle's computer system to identify traffic signs, lane markings, pedestrians, and other vehicles. Ultrasonic sensors use sound waves to measure proximity to objects. Various machine learning algorithms, such as artificial neural networks, can be used to process the output from sensors to understand the environment. Accordingly, a "cluster" of sensors refers to a group of sensors or a set of sensors, or in other words, multiple sensors.
[0055] The "surrounding environment" of a self-driving vehicle can be understood as a certain area around the self-driving vehicle, in which objects (such as other vehicles, landmarks, obstacles, etc.) can be detected and identified by the vehicle's sensors (radar, lidar, cameras, etc.), that is, within the sensor range of the self-driving vehicle.
[0056] As used herein, the term "in response to" can be interpreted, depending on the context, as meaning "when," "after," or "if." Similarly, the phrases "if determined to be," "when determined to be," or "in the instance of," can be interpreted, depending on the context, as meaning "after determining to be," "in response to determining to be," "after detecting and identifying the occurrence of an event," or "in response to detecting the occurrence of an event." Accordingly, the phrase "if X equals Y" can be interpreted, depending on the context, as "when X equals Y," "when determined to be X equals Y," "in response to X equals Y," or "in response to detecting / determining that X equals Y."
[0057] The term "acquire" is interpreted broadly herein and includes receiving, retrieving, collecting, and acquiring, directly and / or indirectly, between two entities configured to communicate with each other or further with other external entities. However, in some embodiments, the term "acquire" should be interpreted as determining, deriving, forming, calculating, etc. In other words, acquiring the attitude of a vehicle may include determining or calculating the vehicle's attitude based on, for example, GNSS data and / or perception data together with map data. Therefore, as used herein, "acquire" may mean receiving parameters at a first entity / unit from a second entity / unit or determining parameters at a first entity or unit, for example, based on data received from another entity / unit.
[0058] Example
[0059] Figure 1 This is an exemplary flowchart representation of a method S100 for generating predictive output for an automated driving system (ADS) of a vehicle, according to some embodiments. Method S100 is a computer-implemented method S100 that can be executed online (e.g., by a processing system of a vehicle equipped with ADS). The processing system includes a first hardware platform and a second hardware platform. The first hardware platform has one or more processors and one or more memories coupled to the one or more processors. The second hardware platform has one or more processors and one or more memories coupled to the one or more processors. The one or more memories of each hardware platform store one or more programs that, when executed by the one or more processors of each hardware platform, perform the steps, services, and functions of the method S100 disclosed herein. Furthermore, Figure 1 The flowcharts depicted are provided with illustrations to indicate which hardware (“HW”) platform performs the various steps or processes in method S100.
[0060] Method S100 may include using one or more processors of a first hardware platform of the vehicle to obtain (S101) a first sensor dataset generated by a first sensor cluster of the vehicle. Further, method S100 includes using one or more processors of the first hardware platform of the vehicle to encode the first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder networks (S103). In some embodiments, each sensor is provided with one encoder network, meaning the encoder network is sensor-specific. In some embodiments, the number of encoder networks provided is less than the number of sensors in the sensor cluster, meaning at least one encoder network processes sensor outputs from two or more sensors in the cluster. For example, the encoder network may be sensor mode-specific, meaning each encoder network processes all sensor outputs from the sensor cluster that have the same sensor mode.
[0061] Further, method S100 includes using one or more processors of a first hardware platform of the vehicle to fuse the encoded first sensor dataset using a first fusion network (S105) to form a first set of fused encoded sensor data features. In other words, feature vectors (“encoded sensor datasets”) from each encoder are combined (e.g., via concatenation, averaging, or learned transformations). Thus, while combining the outputs of each sensor into a unified representation, their unique properties are preserved. The first fusion network may take the form of a fusion machine learning algorithm (i.e., a neural network configured to fuse the encoded features). The term “fusion network” should be interpreted broadly in this context and may be referred to as a fusion algorithm or fusion module, etc.
[0062] In some embodiments, method S100 further includes using one or more processors of a second hardware platform of the vehicle to obtain (S102) a second sensor dataset generated by a second sensor cluster of the vehicle. Method S100 further includes using one or more processors of the second hardware platform of the vehicle to encode the second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder networks (S104). As previously described, the encoder network may be sensor-specific, or an encoder network may process and encode sensor outputs from two or more sensors in the cluster.
[0063] Furthermore, method S100 includes using one or more processors of a second hardware platform of the vehicle to fuse the encoded second sensor dataset using a second fusion network (S106) to form a second set of fused encoded sensor data features. As previously described, feature vectors (“encoded sensor datasets”) from each encoder are combined (e.g., via concatenation, averaging, or learned transformations). Thus, while combining the outputs of each sensor into a unified representation, their unique properties are preserved. The second fusion network may take the form of a fusion machine learning algorithm (i.e., a neural network configured to fuse the encoded features).
[0064] It can be noted that the individual steps in method S100 (e.g., S101 to S106) do not necessarily have to be executed by the “same” one or more processors on a dedicated hardware platform, but can be executed by different processors in a distributed processing architecture. However, naturally, the individual steps of the method can be executed by the same one or more processors on a single hardware platform.
[0065] As mentioned above, the first and second hardware platforms are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. In some embodiments, the first sensor cluster includes sensors of different modalities, and the second sensor cluster includes sensors of different modalities. In other words, the sensors in the first cluster do not necessarily all be cameras, and the sensors in the second cluster do not necessarily all be LiDAR; rather, each cluster can contain mixed-type / modal sensors.
[0066] Further, method S100 includes using one or more processors of a first hardware platform of the vehicle, or via one or more processors of a second hardware platform of the vehicle, to fuse a first set of fused coded sensor data features and a second set of fused coded sensor data features using a third sensor fusion network (S107) to form a third set of fused coded sensor data features. In other words, a “second-level” fusion (S107) is performed to combine the “first-level” fused representations of the vehicle’s surrounding environment. As previously described, feature vectors (“fused coded sensor data features”) from the first and second fusion networks are combined (e.g., via cascading, averaging, or learned transformations). The third fusion network may take the form of a fusion machine learning algorithm (i.e., a neural network configured to fuse the encoded features).
[0067] Method S100 further includes using one or more processors of a hardware platform that performs second-level fusion (S107) to generate (S108) a predictive output based on a third set of fused encoded sensor data features using one or more decoder networks.
[0068] Each fusion process, S105, S106, and S107, can be considered an "intermediate fusion," a technique used to combine feature representations from different sensor modalities after they have been encoded by their respective networks. In the Bird's-Eye View (BEV) fusion method, particularly suitable for autonomous driving applications, sensor data (from cameras, LiDAR, and radar) is transformed into an encoded BEV representation (e.g., a BEV mesh) before being fused and decoded. The advantage of BEV lies in its ability to provide a top-down, geometrically consistent view of the environment, making it easier to perceive spatial relationships between objects around the vehicle (e.g., roads, cars, pedestrians).
[0069] For example, referring to either of the first-level fusion steps (S105, S106), if we assume the first sensor in the cluster is a camera and the second sensor in the same cluster is a LiDAR, then the image captured by the camera is passed through an encoder network (e.g., a convolutional neural network (CNN) encoder) to extract high-level image features such as edges, textures, and object categories. The output of the camera encoder is typically a set of feature maps (2D). Subsequently, for the second sensor, the point cloud captured by the LiDAR is passed through a voxel-based or point-based encoder (e.g., PointNet, VoxelNet). This transforms the sparse 3D point cloud data into a dense feature representation.
[0070] Before fusion (S105, S106), the encoded data from each sensor (S103, S104) can be projected into a bird's-eye view (BEV) format. More specifically, the encoded sensor dataset (e.g., a CNN-encoded feature map) is transformed from its 2D image plane perspective to a top-down BEV representation. This involves projecting each feature onto a common ground plane (e.g., using a learning matrix or geometric transformation matrix) to align them with the 3D environment. The essentially 3D LiDAR point cloud can be "voxed" into a mesh structure, simplifying the transformation to BEV. Each voxel is dimensionally reduced and mapped to a 2D plane, creating a dense BEV feature map. Projection to BEV allows all sensor modalities to align on a common reference frame, simplifying the fusion (S105, S106) process.
[0071] Once all encoded datasets are in BEV format, they can be fused (S105, S106). This fusion (S105, S106) can be performed at the feature level, combining information from different sensors to create a richer and more robust environmental representation. As mentioned above, there are several ways to achieve intermediate fusion (e.g., via concatenation, averaging, or learned transformations). More specifically, when using the concatenation method, the encoded data features (BEV feature map) from each sensor are concatenated along the channel dimension. For example, if the camera BEV map has 128 channels and the LiDAR BEV map has 64 channels, the resulting fused map will have 192 channels. This method preserves all feature information from each modality. When using the element-wise summation method, the encoded data features from each sensor are summed element-wise, fusing information directly at each spatial location. The element-wise summation method can help balance the contributions of different sensors, although it may lose some modality-specific nuances. Using the learned transformation requires training the fusion network to "learn" a set of weights or transformation functions (such as fully connected layers or convolutional layers) that combine the BEV feature maps from each sensor. This allows the fusion network to learn, context-dependently, which features from each sensor are most important and how to weight them.
[0072] As will be readily understood by those skilled in the art, a similar approach can be used for second-level fusion, the difference being that the input to the third fusion network is two or more vehicle-surrounding environment representations that have already been fused.
[0073] Further, method S100 may include one or more processors of a hardware platform that generates (S108) the predicted output, to transmit (S109) the generated predicted output to one or more downstream functions of the autonomous driving system configured to control the vehicle based on the generated predicted output. The downstream functions of the ADS may be, for example, a path planning module configured to generate candidate paths for the vehicle to execute based at least partially on the predicted output, a positioning module configured to output the vehicle's position or attitude (position and heading) based at least partially on the predicted output, or a decision and control module configured to output control signals to one or more actuators of the vehicle to control the vehicle's movement based at least partially on the predicted output.
[0074] Furthermore, in some embodiments, method S100 further includes, in response to both the first and second hardware platforms being operable, using one or more processors of the first hardware platform of the vehicle to receive a second set of fused coded sensor data from the second hardware platform, and using one or more processors of the first hardware platform of the vehicle to fuse the features of the first set of fused coded sensor data and the features of the second set of fused coded sensor data using a third sensor fusion network (S107) to form a third set of fused coded sensor data features. Method S100 may further include using one or more processors of the first hardware platform of the vehicle to generate (S108) a prediction output based on the third set of fused coded sensor data features using one or more decoder networks. In other words, if both hardware platforms are operable, the first hardware platform can receive first-level fused features from the second platform, perform second-level fusion, and generate a prediction output based on the second-level fused information.
[0075] Similarly, in some embodiments, method S100 further includes, in response to both the first and second hardware platforms being operable, using one or more processors of the second hardware platform of the vehicle to receive a first set of fused coded sensor data from the first hardware platform, and using one or more processors of the second hardware platform of the vehicle to fuse the features of the first set of fused coded sensor data and the features of the second set of fused coded sensor data using a third sensor fusion network (S107) to form a third set of fused coded sensor data features. Method S100 may further include using one or more processors of the second hardware platform of the vehicle to generate (S108) a prediction output based on the third set of fused coded sensor data features using one or more decoder networks. In other words, if both hardware platforms are operable, the second hardware platform can receive first-level fused features from the first platform, perform second-level fusion, and generate a prediction output based on the second-level fused information.
[0076] The choice of which of the two hardware platforms performs the second-level fusion (S107) and subsequent steps can be arbitrary, or it can be based on which of the two hardware platforms has the greatest available computing power at that moment. The report of whether a hardware platform is operable can be provided, for example, by a separate module or system configured to monitor each hardware platform to detect errors or vulnerabilities, as readily understood by those skilled in the art. Furthermore, the term "operable" in relation to a hardware platform means that the hardware platform does not report or exhibit any detectable hardware or software vulnerabilities or errors during operation.
[0077] Further, in some embodiments, method S100 includes, in response to the first hardware platform being inoperable and the second hardware platform being operable, using one or more processors of the second hardware platform of the vehicle to generate (S111) a predictive output based on a second set of fused coded sensor data features using one or more decoder networks. Additionally, method S100 may include using one or more processors of the second hardware platform of the vehicle to transmit (S113) the generated predictive output to one or more downstream functions of the autonomous driving system configured to control the vehicle based on the generated predictive output. In other words, if the first hardware platform is inoperable (i.e., encountering a hardware or software error), the second hardware platform generates (S111) a predictive output using a first-level fused (S106) representation of the vehicle's surrounding environment.
[0078] Similarly, in some embodiments, method S100 includes, in response to the first hardware platform being operational and the second hardware platform being inoperable, using one or more processors of the first hardware platform of the vehicle to generate (S110) a predictive output based on a first set of fused coded sensor data features using one or more decoder networks. Furthermore, method S100 may include using one or more processors of the first hardware platform of the vehicle to transmit (S112) the generated predictive output to one or more downstream functions of the autonomous driving system configured to control the vehicle based on the generated predictive output. As described above, if the second hardware platform is inoperable (i.e., encountering a hardware or software error), the first hardware platform generates (S110) a predictive output using a first-level fused (S105) representation of the vehicle's surrounding environment.
[0079] This provides safety-related redundancy for the vehicle's ADS (Adaptive Digital Support), while still enabling the delivery of necessary predictive outputs smoothly and efficiently, ensuring that the ADS can accurately and safely control the vehicle. Each hardware platform may be configured with one set of decoder networks for generating predictive outputs based on the second-level fused (S107) coded features, and another set of decoder networks for generating predictive outputs based on the first-level fused (S105, S106) coded features. However, in some embodiments, the same decoder networks are used to process both the second-level fused (S107) and first-level fused (S105, S106) coded features. The specific setup will depend on how the decoder networks are trained and the provided system specifications.
[0080] Executable instructions for performing these functions may optionally be included in a non-transitory computer-readable storage medium or in another computer program product configured to be executed by one or more processors.
[0081] Figure 2a , Figure 2b , Figure 3a and Figure 3b This is a schematic block diagram representation of a system 10 for generating predictive outputs for an autonomous driving system 310 of a vehicle 1, according to some embodiments. System 10 includes two separate hardware platforms 210, 211 having control circuitry (e.g., one or more processors) 11a, 11b, which are configured to perform the functions of the method S100 disclosed herein, wherein these functions may be included in a non-transitory computer-readable storage medium 12a, 12b or other computer program product configured to be executed by the control circuitry 11a, 11b. In other words, each hardware platform 210, 211 of system 10 includes one or more memory storage areas 12a, 12b containing program code, which, together with the one or more processors 11a, 11b, cause system 10 to perform method S100 according to any of the embodiments disclosed herein.
[0082] In more detail, Figure 2a and Figure 2b The scenario in which two hardware platforms 210 and 211 are operable is depicted, and the information flow through the system according to some embodiments is schematically illustrated. Figure 3a A scenario is described where the first hardware platform 210 is operable and the second hardware platform 211 is inoperable. Figure 3b A scenario is depicted where the first hardware platform 210 is inoperable and the second hardware platform 211 is operable, and both figures schematically illustrate the information flow through the system according to some embodiments.
[0083] Therefore, system 10 includes a first hardware platform 210 comprising a first sensor cluster 324a and one or more processors 11a, the processors 11a being configured to encode a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder networks 201a, and to fuse the encoded first sensor dataset using a first fusion network 202a to form a first set of fused encoded sensor data features. System 10 further includes a second hardware platform 211 comprising a second sensor cluster 324b and one or more processors 11b, the processors 11b being configured to encode a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder networks 201b, and to fuse the encoded second sensor dataset using a second fusion network 202b to form a second set of fused encoded sensor data features. The second sensor cluster 324b comprises different sensors than the first sensor cluster 324a. In some embodiments, each sensor within a sensor cluster is a different modality than the other sensors within the cluster.
[0084] Furthermore, one or more processors 11a of the first hardware platform 210 or one or more processors 11b of the second hardware platform 211 are further configured to use a third sensor fusion network 206 to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form a third set of fused coded sensor data features, and to use one or more decoder networks 203a, 203b to generate a prediction output based on the third set of fused coded sensor data features.
[0085] Furthermore, one or more processors 11a of the vehicle's first hardware platform or one or more processors 11b of the second hardware platform may be further configured to transmit the generated predictive output to the autonomous driving system 310, which is configured to control one or more downstream functions of the vehicle based on the generated predictive output.
[0086] Accordingly, such as Figure 2a As illustrated, in response to the operability of both hardware platforms 210 and 211, one or more processors 11a of the first hardware platform can be configured to receive a second set of fused coded sensor data from the second hardware platform 211, fuse the features of the first and second set of fused coded sensor data using a third sensor fusion network 206 to form a third set of fused coded sensor data features, and generate a prediction output based on the third set of fused coded sensor data features using one or more decoder networks 203a. Similarly, as Figure 2bAs illustrated, in response to the operability of both hardware platforms 210 and 211, one or more processors 11b of the second hardware platform can be configured to receive a first set of fused coded sensor data from the first hardware platform 210, fuse the features of the first set of fused coded sensor data and the features of the second set of fused coded sensor data using a third sensor fusion network 206 to form a third set of fused coded sensor data features, and generate a prediction output based on the third set of fused coded sensor data features using one or more decoder networks 203b.
[0087] Therefore, when both hardware platforms are operational, either platform transmits its first-level fused representation of its surrounding environment to the other platform. The other platform then performs a second-level fusion operation by combining its own first-level fused representation with the one received from the other platform. Thus, the final prediction output is based on all available sensor measurement data for the vehicle, while maintaining lower bandwidth utilization between platforms compared to directly transmitting sensor data or encoded sensor data from one platform to another.
[0088] Furthermore, such as Figure 3a As described, in response to the first hardware platform 210 being operational and the second hardware platform 211 being inoperable, one or more processors 11a of the first hardware platform 210 of vehicle 1 can be configured to generate a predictive output based on a first set of fused coded sensor data features using one or more decoder networks 203a. In other words, if the second hardware platform exhibits a hardware or software failure, the first hardware platform can provide a predictive output based on its own sensor measurement data, substantially skipping the second-level fusion. This achieves the redundancy required for the safe operation of the ADS, and the system maintains sufficient functionality even if one of the hardware platforms fails.
[0089] Similarly, such as Figure 3b As described herein, in response to the first hardware platform 210 being inoperable and the second hardware platform 211 being operable, one or more processors 11b of the second hardware platform 211 of vehicle 1 may be configured to generate a predictive output based on a second set of fused coded sensor data features using one or more decoder networks 203b.
[0090] The proposed System 10 can be implemented based on a transformer architecture and the concept of a "query." A "query" refers to a component involved in the self-attention mechanism within a transformer-type neural network, with other related components including keys and values. Here, a query is the vector representation of the current token that the model focuses on at a specific step in the transformer's attention mechanism. Essentially, a query asks "how much attention should I give to other tokens" when processing the input sequence. Each token in the input sequence also has a corresponding key, which represents a vector summarizing the importance of that token relative to other tokens based on specific features. The value is a vector containing the actual content or information to be conveyed or focused on during the attention process. Accordingly, for each token, the query vector is compared with the key vectors of all other tokens in the sequence. This comparison is typically achieved by calculating the dot product between the query and the key. The result is a set of attention scores that determine how much attention the token (represented by the query) should give to other tokens (represented by their keys). These attention scores are then used to weight the corresponding values, which represent the actual information conveyed through the attention layer. This self-attention mechanism enables the transformer to capture long-range dependencies and associations between features in a data sequence.
[0091] Accordingly, a query can be interpreted as a high-dimensional learnable vector capable of extracting information from different data sources. According to some embodiments, a first set of queries can be used to extract information from several time steps of an encoded first sensor dataset for a first hardware platform. This enables first-level spatial and temporal fusion. Different queries in the first set can be trained to extract information related to different relevant entities in the surrounding environment (such as vehicles and pedestrians and their behavior, lane markings, traffic signs, etc.). Once this information is extracted, the set of queries corresponds to a highly compressed, low-bandwidth representation of the surrounding environment, which can be sent to other hardware platforms for further processing (second-level fusion). A similar set of queries (a second set of queries) can be used to extract temporal and spatial information from an encoded second sensor dataset on a second hardware platform. Subsequently, these queries can be sent to other hardware platforms for further processing (second-level fusion).
[0092] The second-level fusion process corresponds to the fusion between the first and second sets of queries. Because it does not involve initial sensor data encoding, it is expected to require less computation than the first-level fusion process. Several ways exist to perform the second-level fusion between the two sets of queries. For example, two different sets of queries can extract information from each other. Alternatively, there can be another set of queries (a third set of queries) used to extract information from the first and second sets of queries.
[0093] Figure 4This is a schematic block diagram representation depicting an end-to-end training setup for a system used to generate predictive outputs for an autonomous driving system of a vehicle, according to some embodiments. Here, the dataset is represented by elongated hexagons, while the algorithm or network is represented by rectangles. Furthermore, the downstream data flow (inference process) is represented by solid arrows, while the data flow (training signal) used for end-to-end training of the encoder network, fusion algorithm, and decoder network is represented by dashed lines.
[0094] As mentioned above, encoder networks 201a and 201b can be shared among different sensors in a sensor cluster, or they can be sensor-specific. Furthermore, decoder network 203 can be shared in three scenarios during application: decoding second-level fused codes, decoding first-level fused codes from a first hardware platform, and decoding first-level fused codes from a second hardware platform. However, in some embodiments, three separate decoder networks 203a, 203b, and 203c are provided to handle the fused codes in these three scenarios.
[0095] Furthermore, the training of all decoder networks 203a, 203b, and 203c, or, if three separate decoders are used, the training of all decoder networks 203a, 203b, and 203c within a single decoder group, can be performed simultaneously. In other words, a first decoder network can be trained to generate predicted outputs in the form of pedestrian detection and classification, a second decoder network can be trained to generate predicted outputs in the form of lane marking detection and classification, and a third decoder network can be trained to generate expected trajectories of dynamic objects in the vehicle's surrounding environment. All three decoders (solving different tasks) can be trained synchronously. However, in some embodiments, the training of all decoder networks 203a, 203b, and 203c, or, if three separate decoders are used, the training of all decoder networks 203a, 203b, and 203c within a single decoder group, can be performed individually.
[0096] The entire architecture, including encoder networks 201a, 201b, fusion networks 202a, 202b, 206, and decoder networks 203a, 203b, 203c, can be trained together in an end-to-end manner using supervised learning techniques. More specifically, during the training phase, there exists a training dataset comprising sensor datasets (measurement data from corresponding sensors in each sensor cluster) for forming input objects, and corresponding labeled output sets for each specific perception task for forming the desired output. Therefore, for each sensor dataset, there exists a desired output, which is the output that the entire system, including all networks, is expected to generate for a specific perception task. For example, if the system is used for an object detection task, the labeled dataset could include a 3D scene with manually added 3D bounding boxes.
[0097] Assuming encoder networks 201a, 201b, fusion networks 202a, 202b, 206, and decoder networks 203a, 203b, 203c have been initialized and their parameters set, the input object is fed through the processing chain. This step is commonly referred to as forward propagation. Next, loss calculation is performed, feeding the predicted output and the expected output (labeled task i) together as input to the loss function (also known as the cost function). The loss function represents a specific mathematical function used to quantify the difference between the predicted values (predicted output) and the actual true values (labeled task i) in the training dataset.
[0098] Depending on the specific perception task, different loss functions can be used. For example, cross-entropy loss, classification loss, regression loss, dice loss, or a combination thereof can be used. Once the difference between the predicted and actual values has been quantified, backpropagation is used to compute the gradient of the loss value relative to the model parameters of encoder networks 201a, 201b, fusion networks 202a, 202b, 206, and decoder networks 203a, 203b, 203c. Next, optimization algorithms such as Adaptive Moment Estimation (Adam), Stochastic Gradient Descent (SGD), or Root Mean Square Propagation (RMSprop) are used to update the model parameters of encoder networks 201a, 201b, fusion networks 202a, 202b, 206, and decoder networks 203a, 203b, 203c. This update aims to minimize the loss function. The process is then iterated by repeating the forward propagation, loss calculation, backpropagation, and parameter update steps over multiple rounds until the model performance converges.
[0099] Figure 5 This is a schematic illustration of a vehicle 1 equipped with an ADS, which includes such a system 10. As used herein, "vehicle" means any form of motorized transportation. For example, vehicle 1 can be any road vehicle, such as a car (as illustrated herein), a motorcycle, a (freight) truck, a bus, etc. However, in some embodiments, the vehicle can be in the form of an autonomous aircraft or a vessel.
[0100] System 10 includes two separate hardware platforms, each including its own control circuits 11a, 11b and memories 12a, 12b. Each control circuit 11a, 11b may physically comprise a separate circuit device. Alternatively, the control circuits 11a, 11b may be distributed across several circuit devices. As an example, system 10 may share its control circuits 11a, 11b with other parts of vehicle 1 (e.g., ADS 310). Furthermore, system 10 may form part of ADS 310; that is, system 10 may be implemented as a module or feature of ADS. In other words, ADS can be executed by either of the two hardware platforms.
[0101] Control circuits 11a and 11b may include one or more processors, such as a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, or a microprocessor. These processors may be configured to execute program code stored in memories 12a and 12b to perform various functions and operations of vehicle 1 in addition to the methods disclosed herein. The one or more processors may be or include any number of hardware components for performing data or signal processing or for executing computer code stored in memories 12a and 12b. Memories 12a and 12b may optionally include high-speed random access memory, such as dynamic random access memory (DRAM), static random access memory (SRAM), double-data-rate random access memory (DDR RAM), or other random access solid-state memory devices; and may optionally include non-volatile memory, such as one or more disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memories 12a and 12b may include database components, object code components, script components, or any other type of information structure used to support the various activities described herein.
[0102] In the illustrated example, memories 12a and 12b further store map data 308. Map data 308 can be used, for example, by the ADS 310 of vehicle 1 to perform autonomous functions of vehicle 1. Map data 308 may include high-definition (HD) map data. It is contemplated that memories 12a and 12b, even though illustrated as separate elements from ADS 310, may also be provided as integral elements of ADS 310. In other words, according to the exemplary embodiment, any distributed or local memory device can be used to implement the concepts of the present invention. Similarly, control circuitry 11a and 11b can be distributed, for example, such that one or more processors of control circuitry 11a and 11b are provided as integrated elements of ADS 310 or any other system of vehicle 1. In other words, according to the exemplary embodiment, any distributed or local control circuitry device can be used to implement the concepts of the present invention. ADS 310 is configured to perform autonomous or semi-autonomous functions and operations of vehicle 1. ADS 310 may include multiple modules, each assigned a different function to ADS 310.
[0103] Vehicle 1 comprises multiple components, which are typically found in autonomous or semi-autonomous vehicles. It will be understood that vehicle 1 may have... Figure 5 Any combination of the various elements shown. Furthermore, vehicle 1 may include, but is not limited to, the following: Figure 5 The elements shown are more than just those shown herein. Although various elements are shown herein as being located inside vehicle 1, one or more of these elements may be located outside vehicle 1. For example, map data may be stored in a remote server and accessed by various components of vehicle 1 via communication system 326. Furthermore, even though various elements are depicted herein in a specific arrangement, as will be readily understood by those skilled in the art, various elements may be implemented in different arrangements. It should be further noted that various elements may be communicatively connected to each other in any suitable manner. Because the elements of vehicle 1 can be implemented in several different ways, Figure 5 Vehicle 1 should be considered only as an illustrative example.
[0104] Vehicle 1 further includes a sensor system 320. Sensor system 320 is configured to acquire sensing data about the vehicle itself or its surroundings. Sensor system 320 may include, for example, a Global Navigation Satellite System (GNSS) module 322 (e.g., GPS), configured to collect geographic location data of vehicle 1. Sensor system 320 may further include one or more sensors 324. The one or more sensors 324 may be any type of on-vehicle sensor such as a camera, lidar and radar, ultrasonic sensors, gyroscope, accelerometer, odometer, etc. It should be understood that sensor system 320 may also provide the possibility of acquiring sensing data directly or via dedicated sensor control circuitry in vehicle 1. However, here, the sensors of the sensor system are separated into at least two independent sensor clusters, each of which is associated with a corresponding hardware platform.
[0105] Vehicle 1 further includes a communication system 326. The communication system 326 is configured to communicate with external units such as other vehicles (i.e., via vehicle-to-vehicle (V2V) communication protocols), remote servers (e.g., cloud servers), databases, or other external devices (i.e., vehicle-to-infrastructure (V2I) or vehicle-to-everything (V2X) communication protocols). The communication system 318 can communicate using one or more communication technologies. The communication system 318 may include one or more antennas (not shown). Cellular communication technologies can be used for remote communication, such as to remote servers or cloud computing systems. Furthermore, if the cellular communication technology used has low latency, it can also be used for V2V, V2I, or V2X communication. Examples of cellular radio technologies include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Enhanced Data Rate GSM Evolution (EDGE), Long Term Evolution (LTE), 5G, and 5G NR (New Radio), as well as future cellular solutions. However, in some solutions, short-to-medium range communication technologies such as wireless local area networks (LANs) (e.g., solutions based on IEEE 802.11) can be used to communicate with other vehicles near vehicle 1 or with local infrastructure components. The European Telecommunications Standards Institute (ETSI) is developing cellular standards for vehicle communication, and 5G is considered a suitable solution, for example, due to its low latency and efficient handling of high bandwidth and communication channels.
[0106] The communication system 326 can accordingly provide the possibility of sending outputs to and / or receiving inputs from remote locations (e.g., remote operators or control centers) via one or more antennas. Furthermore, the communication system 326 can be further configured to allow various components of vehicle 1 to communicate with each other. As an example, the communication system can provide a local network setup such as CAN bus, I2C, Ethernet, and fiber optics. Local communication within the vehicle can also be wireless, using protocols such as WiFi, LoRa, Zigbee, Bluetooth, or similar medium / short-range technologies.
[0107] Vehicle 1 further includes a control system 320. The control system 328 is configured to control the handling of vehicle 1. The control system 328 includes a steering module 330 configured to control the direction of travel of vehicle 1. The control system 328 further includes a throttle module 332 configured to control the actuation of the throttle valve of vehicle 1. The control system 328 further includes a braking module 334 configured to control the actuation of the brakes of vehicle 1. The various modules of the control system 328 can also receive manual input from the driver of vehicle 1 (i.e., from the steering wheel, accelerator pedal, and brake pedal, respectively). However, the control system 328 can be communicatively connected to the vehicle's ADS 310 to receive instructions on how the various modules of the control system 328 should operate. Therefore, the ADS 310 can control the handling of vehicle 1, for example, via a decision and control module 318.
[0108] ADS 310 may include a positioning module 312 or a positioning block / system. The positioning module 312 is configured to determine and / or monitor the geographic location and direction of travel of vehicle 1, and may utilize data from sensor system 320 (e.g., data from GNSS module 322). Alternatively or in combination, the positioning module 312 may utilize data from one or more sensors 324. The positioning system may alternatively be implemented as real-time dynamic (RTK) GPS to improve accuracy.
[0109] ADS 310 may further include a perception module 314 or a perception block / system 314. The perception module 314 may refer to any known module and / or functional group, for example, included in one or more electronic control modules and / or nodes of vehicle 1, adapted and / or configured to interpret driving-related sensing data of vehicle 1 to identify, for example, obstacles, lanes, relevant signs, appropriate navigation paths, etc. Therefore, the perception module 314 may be adapted to rely on and receive input from multiple data sources (e.g., motor vehicle imaging, image processing, computer vision, and / or in-vehicle networking, etc.) and combine, for example, sensing data from sensor system 320. System 10 may, for example, be implemented in the perception module 314 of ADS 310.
[0110] The positioning module 312 and / or the sensing module 314 may be communicatively connected to the sensor system 320 to receive sensing data from the sensor system 320. The positioning module 312 and / or the sensing module 314 may further transmit control commands to the sensor system 320. The path planning module 316 and / or the sensing module 314 may be implemented according to the various embodiments disclosed herein.
[0111] The present invention has been described above with reference to specific embodiments. However, other embodiments besides those described above are also possible and within the scope of the present invention. Within the scope of the present invention, method steps different from those described above, performed by hardware or software, can be provided. Therefore, according to an exemplary embodiment, a non-transitory computer-readable storage medium is provided that stores one or more programs configured to be executed by one or more processors of a vehicle control system, the one or more programs including instructions for performing the method according to any of the embodiments discussed above. Alternatively, according to another exemplary embodiment, a cloud computing system can be configured to perform any of the methods presented herein. The cloud computing system may include distributed cloud computing resources that jointly perform the methods presented herein under the control of one or more computer program products.
[0112] Generally, computer-accessible media can include any tangible or non-transitory storage medium or storage medium, such as electrical, magnetic, or optical media (e.g., a disk or CD / DVD-ROM coupled to a computer system via a bus). As used herein, the terms “tangible” and “non-transitory” are intended to describe computer-readable storage media (or “memory”) that do not transmit electromagnetic signals, but are not intended to otherwise limit the type of physical computer-readable storage device covered by the term “computer-readable medium or memory.” For example, the terms “non-transitory computer-readable medium” or “tangible memory” are intended to cover types of storage devices that do not necessarily permanently store information, including, for example, random access memory (RAM). Program instructions and data stored in a non-transitory form on a tangible computer-accessible storage medium can be further transmitted via transmission media or signals such as electronic, electromagnetic, or digital signals, which can be transmitted via communication media such as networks and / or wireless links.
[0113] One or more processors 11a, 11b (associated with system 10) may be, or may include, any number of hardware components for performing data or signal processing operations or for executing computer code stored in memories 12a, 12b. System 10 has associated memories 12a, 12b, and memories 12a, 12b may be one or more devices for storing data and / or computer code to perform or facilitate the various methods described herein. Memory may include volatile memory or non-volatile memory. Memories 12a, 12b may store database components, object code components, script components, or any other type of information structure for supporting the various activities described herein. According to exemplary embodiments, any distributed storage device or local storage device may be used with the systems and methods described herein. According to exemplary embodiments, memories 12a, 12b may be communicatively connected to processors 11a, 11b (e.g., via circuitry or any other wired, wireless, or network connection) and include computer code for performing one or more processes described herein.
[0114] It should be noted that no reference numerals in the drawings limit the scope of the claims. The invention can be implemented at least in part by means of both hardware and software, and several “apparatus” or “units” can be represented by the same hardware item.
[0115] Although the accompanying drawings may show a specific order of method steps, the order of these steps may differ from the depicted order. Furthermore, two or more steps may be performed simultaneously or partially simultaneously. This variation will depend on the chosen software and hardware system and the designer's choices. All these variations are within the scope of this invention. Similarly, software implementation can be accomplished using standard programming techniques with rule-based logic and other logic to perform various receiving, transmitting, encoding, fusing, and generating steps. The embodiments mentioned and described above are given by way of example only and should not be considered as limiting the invention. Other solutions, uses, purposes, and functions within the scope of the invention claimed in the claims of this patent described below will be apparent to those skilled in the art.
Claims
1. A computer implementation method (S100) for generating predictive outputs for an autonomous driving system of a vehicle, the computer implementation method comprising: Through one or more processors of the vehicle's first hardware platform: One or more first encoder networks are used to encode the first sensor dataset generated by the first sensor cluster of the vehicle (S103). The first sensor fusion network is used to fuse the encoded first sensor dataset (S105) to generate a first set of fused encoded sensor data features; Through one or more processors of the vehicle's second hardware platform: One or more second encoder networks are used to encode the second sensor dataset generated by the second sensor cluster of the vehicle (S104). The second sensor fusion network is used to fuse the encoded second sensor dataset (S106) to generate a second set of fused encoded sensor data features; The first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensor in the second sensor cluster; The computer implementation method further includes: Through one or more processors of the first hardware platform or one or more processors of the second hardware platform of the vehicle: A third sensor fusion network is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features (S107) to generate a third set of fused coded sensor data features. (S108) A prediction output is generated using one or more decoder networks based on the third set of fused encoded sensor data features.
2. The computer implementation method (S100) according to claim 1 further includes: Through one or more processors of the first hardware platform or one or more processors of the second hardware platform of the vehicle: The generated prediction output is transmitted (S109) to one or more downstream functions of the autonomous driving system, the downstream functions being configured to control the vehicle based on the generated prediction output.
3. The computer implementation method (S100) according to claim 1 further includes: In response to both the first hardware platform and the second hardware platform being operable: Through one or more processors of the first hardware platform of the vehicle: Receive the second set of fused and coded sensor data from the second hardware platform; The third sensor fusion network is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features (S107) to generate the third set of fused coded sensor data features. as well as The predicted output (S108) is generated using one or more decoder networks based on the third set of fused coded sensor data features; or Through one or more processors of the second hardware platform of the vehicle: Receive the first set of fused and coded sensor data from the first hardware platform; The third sensor fusion network is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features (S107) to generate the third set of fused coded sensor data features. as well as The predicted output is generated (S108) using one or more decoder networks based on the third set of fused encoded sensor data features.
4. The computer implementation method (S100) according to claim 3 further includes: In response to the first hardware platform being inoperable and the second hardware platform being operable: Through one or more processors of the second hardware platform of the vehicle: The predicted output (S111) is generated using one or more decoder networks based on the second set of fused coded sensor data features; and In response to the first hardware platform being operable and the second hardware platform being inoperable: Through one or more processors of the first hardware platform of the vehicle: The predicted output (S110) is generated using one or more decoder networks based on the first set of fused encoded sensor data features.
5. The computer implementation method according to claim 1 (S100), wherein, The one or more first encoder networks, the one or more second encoder networks, the first sensor fusion network, the second sensor fusion network, the third sensor fusion network, and the one or more decoder networks are trained end-to-end.
6. The computer implementation method according to claim 1 (S100), wherein, The first sensor cluster includes sensors of different modes, and the second sensor cluster includes sensors of different modes.
7. A computer program product comprising instructions that, when executed by a computing device of a vehicle, cause the computing device to perform the computer implementation method (S100) according to any one of claims 1 to 6.
8. A non-transitory computer-readable storage medium storing instructions that, when executed by a computing device of a vehicle, cause the computing device to perform the computer-implemented method (S100) according to any one of claims 1 to 6.
9. A system (10) for generating predictive outputs for an autonomous driving system (310) of a vehicle (1), the system comprising: A first hardware platform (210) includes a first sensor cluster (324a) and one or more processors (11a), said one or more processors (11a) being configured to: One or more first encoder networks (201a) are used to encode the first sensor dataset generated by the first sensor cluster of the vehicle; as well as The first sensor fusion network (202a) is used to fuse the encoded first sensor dataset to form a first set of fused encoded sensor data features; The second hardware platform (211) includes a second sensor cluster (324b) and one or more processors (11b), said one or more processors (11b) being configured to: One or more second encoder networks (201b) are used to encode the second sensor dataset generated by the second sensor cluster of the vehicle; as well as A second sensor fusion network (202b) is used to fuse the encoded second sensor dataset to form a second set of fused encoded sensor data features; The first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster (324a) is different from the sensor in the second sensor cluster (324b); The one or more processors (11a) of the first hardware platform (210) or the one or more processors (11b) of the second hardware platform (211) are further configured to: A third sensor fusion network (206) is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form a third set of fused coded sensor data features; as well as One or more decoder networks (203a, 203b) are used to generate predictive outputs based on the third set of fused encoded sensor data features.
10. The system (10) according to claim 9, wherein, The one or more processors (11a) of the first hardware platform or the one or more processors (11b) of the second hardware platform of the vehicle are further configured to: The generated prediction output is transmitted to one or more downstream functions of the autonomous driving system (310), which are configured to control the vehicle based on the generated prediction output.
11. The system (10) according to claim 9 or 10, further comprising: operable in response to the first hardware platform (210) and the second hardware platform (211): The one or more processors (11a) of the first hardware platform of the vehicle (1) are configured to: Receive the second set of fused and coded sensor data from the second hardware platform (211); The third sensor fusion network (206) is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form the third set of fused coded sensor data features; as well as The predicted output is generated using one or more decoder networks (203a) based on the third set of fused coded sensor data features; or The one or more processors (11b) of the second hardware platform (211) of the vehicle (1) are configured to: Receive the first set of fused and coded sensor data from the first hardware platform (210); The third sensor fusion network (206) is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form the third set of fused coded sensor data features; as well as The predicted output is generated using one or more decoder networks (203b) based on the third set of fused encoded sensor data features.
12. The system (10) according to claim 11, further comprising: In response to the first hardware platform (210) being inoperable and the second hardware platform (211) being operable: The one or more processors (11b) of the second hardware platform (211) of the vehicle (1) are configured to: The predicted output is generated using one or more decoder networks (203b) based on the second set of fused coded sensor data features; and In response to the first hardware platform (210) being operable and the second hardware platform (211) being inoperable: The one or more processors (11a) of the first hardware platform (210) of the vehicle (1) are configured to: The predicted output is generated using one or more decoder networks (203a) based on the first set of fused encoded sensor data features.
13. The system (10) according to claim 9, wherein, The one or more first encoder networks (201a), the one or more second encoder networks (201b), the first sensor fusion network (202a), the second sensor fusion network (202b), the third sensor fusion network (206), and the one or more decoder networks (203a, 203b) are trained end-to-end.
14. The system (10) according to claim 9, wherein, The first sensor cluster (324a) includes sensors of different modes, and the second sensor cluster (324b) includes sensors of different modes.
15. A vehicle (1) including a system (10), said system comprising: A first hardware platform (210) includes a first sensor cluster (324a) and one or more processors (11a), said one or more processors (11a) being configured to: One or more first encoder networks (201a) are used to encode the first sensor dataset generated by the first sensor cluster of the vehicle; as well as The first sensor fusion network (202a) is used to fuse the encoded first sensor dataset to form a first set of fused encoded sensor data features; The second hardware platform (211) includes a second sensor cluster (324b) and one or more processors (11b), said one or more processors (11b) being configured to: One or more second encoder networks (201b) are used to encode the second sensor dataset generated by the second sensor cluster of the vehicle; as well as A second sensor fusion network (202b) is used to fuse the encoded second sensor dataset to form a first set of fused encoded sensor data features; The first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster (324a) is different from the sensor in the second sensor cluster (324b); The one or more processors (11a) of the first hardware platform (210) or the one or more processors (11b) of the second hardware platform (211) are further configured to: A third sensor fusion network (206) is used to fuse the first set of fused coded sensor data features and the second set of fused coded sensor data features to form a third set of fused coded sensor data features; as well as One or more decoder networks (203a, 203b) are used to generate predictive outputs based on the third set of fused encoded sensor data features.