Hardware and software architecture for autonomous driving systems

CN122220719APending Publication Date: 2026-06-16ZENSEACT AB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZENSEACT AB
Filing Date
2025-12-16
Publication Date
2026-06-16

Smart Images

  • Figure CN122220719A_ABST
    Figure CN122220719A_ABST
Patent Text Reader

Abstract

Hardware and software architectures and methods for autonomous driving systems are disclosed. The methods include encoding, by a processor of a first hardware platform, using a first encoder layer, a first sensor dataset generated by a first sensor cluster to form an encoded first sensor dataset; encoding, by a processor of a second hardware platform, using a second encoder layer, a second sensor dataset generated by a second sensor cluster to form an encoded second sensor dataset. Encoding, by the processor of the first or second hardware platform: fusing at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using a sensor fusion layer to form a set of fused encoded sensor data features, generating a prediction output based on the set of fused encoded sensor data features using a decoder layer. The first, second encoder layers, the sensor fusion layer, and the decoder layer form an end-to-end trained neural network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The disclosed technologies relate to methods and systems for generating predictive outputs for autonomous driving systems of vehicles. In particular, but not exclusively, the disclosed technologies relate to hardware and software architectures for perception functions within autonomous driving systems to improve availability and ensure safety through redundancy in both software and hardware. Background Technology

[0002] Currently, there is ongoing research and development in many technical areas related to both Advanced Driver Assistance Systems (ADAS) and Automated Driving (AD). ADAS and AD will be collectively referred to herein as the term Automated Driving Systems (ADS), which corresponds to all the different levels of automation, such as those defined by SAE J3016 (1 to 5), and particularly levels 4 and 5. ADS solutions have already entered the market in most new vehicles and have only an increasing prospect of use in the not-too-distant future. ADS can be interpreted as a complex combination of various components, which can be defined as a system in which the perception, decision-making, and operation of the vehicle are performed electronically and mechanically rather than by a human driver, or in collaboration with a human driver, and can be defined as the introduction of automation into road traffic. This includes the handling of the vehicle, destination, and perception of the surrounding environment. While the automated system has control over the vehicle, it allows the human operator to delegate all or at least some of the responsibilities to the system.

[0003] Autonomous driving systems rely on a combination of sensors, such as cameras, radar, lidar, and ultrasonic devices, to perceive and interpret their surroundings. These sensors capture various types of data, including object detection, distance measurement, and environmental mapping. Sensor fusion is the process of integrating data from multiple sensor types to enhance the accuracy, reliability, and robustness of the system's perception capabilities. By combining information from different sources, sensor fusion helps overcome the limitations of individual sensors, such as low visibility or occlusion, and improves decision-making for vehicles operating under ADS (Autonomous Driving Assistance). This technology is crucial for achieving higher levels of autonomy, enabling safer and more efficient driving operations under a variety of conditions.

[0004] Therefore, there is always a need for improvement in autonomous driving systems, especially in optimizing sensor fusion algorithms to enhance real-time performance and ensure the safety of vehicles operating ADS. Summary of the Invention

[0005] The techniques disclosed herein are intended to mitigate, alleviate, or eliminate one or more defects and disadvantages in the prior art to address various issues related to safety-critical redundancy in perception and / or planning functions used in autonomous driving systems.

[0006] The various aspects and implementations of the disclosed technology are defined below and in the appended independent and dependent claims.

[0007] A first aspect of the disclosed technology includes a computer-implemented method for generating predictive outputs for an autonomous driving system of a vehicle. The computer-implemented method includes: encoding a first sensor dataset generated by a first sensor cluster of the vehicle using one or more processors of a first hardware platform of the vehicle using one or more first encoder layers to form an encoded first sensor dataset. The method further includes: encoding a second sensor dataset generated by a second sensor cluster of the vehicle using one or more second encoder layers using one or more second encoder layers to form an encoded second sensor dataset. The first hardware platform and the second hardware platform are separate hardware platforms, wherein each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. The computer-implemented method further includes: fusing at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using one or more sensor fusion layers by one or more processors of the first hardware platform of the vehicle or by one or more processors of the second hardware platform of the vehicle, to form a set of fused encoded sensor data features, and generating a predictive output based on the set of fused encoded sensor data features using one or more decoder layers. One or more first encoder layers, one or more second encoder layers, one or more sensor fusion layers, and one or more decoder layers together form an end-to-end trained neural network.

[0008] The second aspect of the disclosed technology includes a computer program product containing instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a method according to any of the embodiments disclosed herein. Utilizing this aspect of the disclosed technology offers advantages and preferred features similar to those of the other aspects.

[0009] A third aspect of the disclosed technology includes a (non-transitory) computer-readable storage medium containing instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a method according to any of the embodiments disclosed herein. Utilizing this aspect of the disclosed technology offers advantages and preferred features similar to those of the other aspects.

[0010] As used herein, the term "non-transitory" is intended to describe computer-readable storage media (or "memory") that exclude the propagation of electromagnetic signals, but is not intended to otherwise limit the types of physical computer-readable storage devices encompassed by the phrases "computer-readable medium" or "memory." For example, the terms "non-transitory computer-readable medium" or "tangible memory" are intended to encompass types of storage devices that include, for example, random access memory (RAM) that do not necessarily permanently store information. Program instructions and data stored in a non-transitory form on a tangible computer-accessible storage medium can be further transmitted via a transmission medium or a signal such as an electrical signal, electromagnetic signal, or digital signal, which can be transmitted via a communication medium such as a network and / or a wireless link. Therefore, as used herein, the term "non-transitory" is a limitation on the medium itself (i.e., tangible, not a signal), not a limitation on the persistence of data storage (e.g., RAM and ROM).

[0011] A fourth aspect of the disclosed technology includes a system for generating predictive outputs for an autonomous driving system of a vehicle. The system includes: a first hardware platform comprising a first sensor cluster and one or more processors configured to perform the following steps: encoding a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder layers to form an encoded first sensor dataset. The system further includes: a second hardware platform comprising a second sensor cluster and one or more processors configured to perform the following steps: encoding a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder layers to form an encoded second sensor dataset. The first hardware platform and the second hardware platform are separate hardware platforms, wherein each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. Furthermore, one or more processors of the first hardware platform or one or more processors of the second hardware platform are further configured to fuse at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using one or more sensor fusion layers to form a set of fused encoded sensor data features, and to generate a predictive output based on this set of fused encoded sensor data features using one or more decoder layers. One or more first encoder layers, one or more second encoder layers, one or more sensor fusion layers, and one or more decoder layers together form an end-to-end trained neural network. Utilizing this aspect of the publicly available technology offers similar advantages and preferred features as in other aspects.

[0012] The fifth aspect of the disclosed technology includes a vehicle comprising a system according to any embodiment of the fourth aspect disclosed herein. This aspect of the disclosed technology offers advantages and preferred features similar to the other aspects.

[0013] The disclosed aspects and preferred embodiments may be suitably combined with each other in any manner that is obvious to those skilled in the art, such that one or more features or embodiments relating to one aspect may also be considered to relate to embodiments of another aspect or another aspect.

[0014] One advantage of some implementations is that they provide safety-critical redundancy for the ADS’s perception and / or planning functions, and reduce the risk of information loss when exchanging information between two separate computing platforms.

[0015] One advantage of some implementations is that they provide safety-critical redundancy for the ADS's perception and / or planning functions, enabling the ADS to still generate reliable predictive outputs for autonomous vehicle control even if one of the computing platforms or its associated software fails.

[0016] Some implementations offer the advantage of improved performance and redundancy for ADS's sensing and / or planning functions, with reduced bandwidth requirements compared to traditional solutions.

[0017] One advantage of some implementations is that the perception data can be shared between different computing platforms within the vehicle in a bandwidth-efficient manner, thereby increasing the processing speed for generating predictive outputs for autonomously maneuvering the vehicle in a safe manner.

[0018] One advantage of some implementations is that they provide safety-critical redundancy for the ADS’s perception and / or planning functions without limiting the available inputs to the perception and / or planning functions, thereby avoiding performance degradation during full system operation.

[0019] Further embodiments are defined in the dependent claims. It should be emphasized that when the term "comprising" and variations thereof are used in this specification, they are used to specify the presence of a described feature, integer, step, or component. They do not exclude the presence or addition of one or more other features, integers, steps, components, or groups thereof.

[0020] These and other features and advantages of the disclosed technology will be further illustrated below with reference to the embodiments described herein. Attached Figure Description

[0021] The foregoing aspects, features, and advantages of the disclosed technology will be more fully understood when taken in conjunction with the accompanying drawings and by referring to the following illustrative and non-limiting detailed description of exemplary embodiments of the present disclosure, wherein:

[0022] Figure 1 This is a schematic flowchart representation of a method for generating predictive outputs for an autonomous driving system of a vehicle, according to some implementation methods.

[0023] Figure 2a It is a schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some implementation methods.

[0024] Figure 2b It is a schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some implementation methods.

[0025] Figure 3a It is a schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some implementation methods.

[0026] Figure 3b It is a schematic block diagram representation of a system for generating predictive outputs for an autonomous driving system of a vehicle, according to some implementation methods.

[0027] Figure 4 It is a schematic block diagram representation depicting an end-to-end training setup for a system used to generate predictive outputs for an autonomous driving system of a vehicle, according to some implementations.

[0028] Figure 5 It is a vehicle according to some implementations, which includes a system for generating predictive outputs for the vehicle's autonomous driving system. Detailed Implementation

[0029] This disclosure will now be described in detail with reference to the accompanying drawings, in which some exemplary embodiments of the disclosed technology are illustrated. However, the disclosed technology may be embodied in other forms and should not be construed as limited to the exemplary embodiments disclosed. The exemplary embodiments of the disclosure are provided to fully convey the scope of the disclosed technology to those skilled in the art. Those skilled in the art will understand that the steps, services, and functions explained herein can be implemented using separate hardware circuitry, using software that works in conjunction with a programmable microprocessor or general-purpose computer, using one or more application-specific integrated circuits (ASICs), using one or more field-programmable gate arrays (FPGAs), and / or using one or more digital signal processors (DSPs).

[0030] It will also be appreciated that when this disclosure is described in the form of a method, it can also be embodied in a device including one or more processors and one or more memories coupled to the one or more processors, wherein computer code is loaded to implement the method. For example, in some embodiments, the one or more memories may store one or more computer programs that, when executed by the one or more processors, cause the device to perform the steps, services, and functions disclosed herein.

[0031] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. It should be noted that the articles “a,” “an,” “the,” and “said,” as used in the specification and appended claims, are intended to mean the presence of one or more elements unless the context clearly indicates otherwise. Thus, for example, in some contexts, a reference to “unit” or “the unit” may refer to more than one unit, etc. Furthermore, the word “comprising” and its variations do not exclude other elements or steps. It should be emphasized that when the term “comprising” and its variations are used in this specification, they are used to specify the presence of a described feature, integer, step, or component. It does not exclude the presence or addition of one or more other features, integers, steps, components, or groups thereof. The term “and / or” should be interpreted as also meaning “both” and each as an alternative.

[0032] It should also be understood that although the terms first, second, etc., may be used herein to describe various elements or features, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first signal may be referred to as a second signal, and similarly, a second signal may be referred to as a first signal, without departing from the scope of the embodiments. Both the first signal and the second signal are signals, but they are not the same signal.

[0033] Overview

[0034] Automated Driving Systems (ADS) are safety-critical systems, meaning they should not have a single point of failure. Regarding the embedded hardware (“HW”) on the device side that executes the ADS software, this paper proposes to meet this redundancy requirement by enabling the ADS software to execute on at least two separate computing units (or “hardware platforms”). Regarding the device-side sensors, this means that the vehicle is equipped with a large number of complementary sensors, and separate sensor clusters are connected to separate computing units. With this design, if one sensor cluster and / or its corresponding computing unit experiences a hardware or software failure, another sensor cluster and its corresponding computing unit can still be used to safely operate the vehicle.

[0035] Simultaneously, to maximize the performance of the ADS, the system should preferably be able to access information from all sensors. For ADS using neural networks, the information output from all sensors will be fused into a neural network model responsible for environmental perception and / or path or trajectory planning to be performed by the ADS. The optimal performance of such a model itself can be a safety consideration. For example, regardless of hardware or other software malfunctions, inaccurate or even erroneous outputs (e.g., missed detections, misclassifications, inappropriate candidate paths, etc.) will have safety consequences. Therefore, from both performance and safety perspectives, it is desirable for the neural network model to have access to information from all sensors on the vehicle to improve the accuracy of the output and thus reduce the risk of accidents.

[0036] However, providing a well-defined communication interface between two computing units (“hardware platforms”) may lead to the risk of information loss when data is exchanged between the two computing units. In other words, providing a “human-interpretable” interface between two computing units can impair the performance of perception and / or planning functions because these interfaces will inherently create bottlenecks in data exchange, which could raise security issues when information loss involves sensor data loss.

[0037] To address this, this paper proposes providing various data processing models (e.g., encoding, fusion, and decoding) across two computational units as a single layer of a neural network trained end-to-end. Thus, during the training phase of this neural network, the network itself defines the communication interface between the computational units and alleviates the bottleneck effect caused by human-interpretable interfaces. Compared to training models individually on each computational unit, the concept of training processing models end-to-end across two computational units (“layers”) and running these models as a single neural network during inference has the potential to provide more accurate results.

[0038] The proposed architecture is trained in an end-to-end manner. More specifically, during training, gradients can propagate from the prediction outputs generated separately by each cluster, and also from the prediction outputs based on the combination of all sensor clusters. Thus, during inference, if a hardware or software failure occurs on one platform, predictions from sensor clusters from the other platform can be used.

[0039] Accordingly, the implementation described herein introduces a solution to enable the deployment of neural network models that provide optimal performance by operating all sensors across different clusters simultaneously in the absence of hardware failure, while providing redundancy in the event of a hardware or software failure on one of the clusters. This reduces the risk of complete system failure and thus improves the overall system security.

[0040] In other words, the implementation described in this paper proposes an architecture for use with neural network models in autonomous driving systems. These models can access information from all sensors, regardless of which hardware cluster they are assigned to, while still providing redundancy in the event of hardware or software failure within one of these clusters. This enables perception and / or planning functions for autonomous driving systems that meet redundancy requirements, while maintaining high performance during periods without hardware or software failures.

[0041] It should be noted that although this specification typically uses only two platforms to illustrate various implementation methods, those skilled in the art will readily recognize that the same teachings and principles apply to architectures with more than two platforms, each with a corresponding sensor cluster.

[0042] definition

[0043] In this context, an Automated Driving System (ADS) refers to a complex combination of hardware and software components designed to control and operate a vehicle without direct human intervention. ADS technology aims to automate various aspects of driving, such as steering, acceleration, deceleration, and monitoring of the surrounding environment. The primary goal of ADS is to enhance safety, efficiency, and convenience in transportation. The range of ADS can vary from basic driver assistance systems to highly advanced automated driving systems, depending on their level of automation as classified by standards such as SAE J3016. These systems utilize various sensors, cameras, radar, lidar, and powerful computer algorithms to perceive the environment and make driving decisions. The specific capabilities and characteristics / functions of ADS can vary greatly, from systems providing limited assistance to systems capable of independently handling complex driving tasks under specific conditions.

[0044] Advanced Driver Assistance Systems (ADAS) are technologies that assist drivers during driving, although they do not necessarily provide complete autonomy. ADAS features are often used as building blocks for ADS. Examples include adaptive cruise control, lane keeping assist, automatic emergency braking, and parking assist. They enhance safety and convenience but typically require a certain level of human supervision and intervention. Autonomous Driving (AD), on the other hand, is a technology designed to control and navigate a vehicle without human supervision. Accordingly, the difference between ADAS and AD can be said to lie in the level of autonomy and control. ADAS systems are designed to assist and support the driver, while AD aims for complete control of the vehicle without continuous human supervision. Accordingly, AD aims for a higher level of autonomy (such as Level 4 and Level 5 according to SAE international standards), where the vehicle can operate independently in most or all driving scenarios without human intervention. As mentioned earlier, the term "ADS" used in this paper is used as a collective term encompassing both ADAS and AD. In this context, ADS functions or ADS features can be understood as specific functions or features of the entire ADS stack, such as highway navigation features, traffic jam navigation features, route planning features, and so on.

[0045] In the current context, "machine learning algorithm" or "neural network" refers to a computational model or set of techniques used to enable computers to solve tasks, such as interpreting and understanding the surrounding environment in a vehicle's perception system. Perception tasks in ADS involve the vehicle's ability to detect and identify targets, obstacles, road signs, lane markings, pedestrians, other vehicles, and various environmental conditions. ADS can use machine learning algorithms to process sensor data, such as data from cameras, LiDAR, radar, and other sensors, to make informed decisions about how to navigate safely. These algorithms use data-driven techniques to analyze and classify targets, understand road geometry, predict the movement of other road users, and / or assess potential risks in real time. Common types of machine learning algorithms used in ADS perception tasks include deep neural networks, convolutional neural networks (CNNs) (e.g., for camera image processing, LiDAR output processing, etc.), recurrent neural networks (RNNs) (e.g., for sequential data), and various other techniques like support vector machines (SVMs) and decision trees.

[0046] In some implementations, the machine learning algorithm is implemented using publicly available, appropriate software development machine learning code elements in any manner that a person skilled in the art would deem appropriate, such as code elements available in PyTorch, Keras, and TensorFlow, or in any other appropriate software development platform.

[0047] The terms "encoder" (or "encoder layer") and "decoder" (or "decoder layer") refer to components of a neural network architecture designed to interpret and process sensed data from the vehicle's surrounding environment. The encoder is responsible for processing the raw sensed inputs (e.g., camera images, LiDAR point clouds, or radar signals) and transforming them into a compact, abstract representation ("a set of coded features"). The decoder takes the coded representation produced by the encoder and transforms it back into a more interpretable output.

[0048] The compressed representation output from the encoder is often called the "latent space" or "feature space." The latent space encodes key information about the input data (e.g., an input image) in a compact form. Each dimension in the latent space can represent a different feature or concept. Furthermore, this representation can capture, for example, key features such as object boundaries, relative distances, and object classifications, while discarding irrelevant information. For example, an encoder can process camera images and identify and represent key features such as the presence of pedestrians, lane markings, traffic signs, or other vehicles. Thus, the encoder reduces high-dimensional sensed data to a set of meaningful features that can be used to understand the driving environment. This representation can serve as the basis for tasks such as object detection, semantic segmentation (understanding the context of an object), and depth estimation (determining distances to objects). For example, in a convolutional neural network (CNN) used for object detection, the encoder extracts hierarchical features (edges, textures, shapes) from camera images and progressively builds a detailed understanding of what is present in the scene.

[0049] As mentioned, the decoder takes the encoded representation (“a set of encoded features”) produced by the encoder and transforms it back into a more interpretable output. This can involve predicting target locations, labeling portions of an image, or providing additional details such as the target’s orientation or motion. For example, if the perception task is object detection, the decoder can predict bounding boxes around identified targets (e.g., pedestrians, cars) and assign category labels (e.g., “cars”, “stop signs”, “pedestrians”) to these targets. Similarly, if the perception task is semantic segmentation, the decoder can take the encoded features and assign a category label, such as identifying a road surface, pedestrian area, or vehicle, to each pixel in the image. Furthermore, if the perception task is depth estimation, the decoder will take the encoded features and predict distances to various targets or points in the environment.

[0050] Furthermore, terms such as "one or more encoder layers," "one or more sensor fusion layers," or "one or more decoder layers" refer to components or parts of an end-to-end trained neural network. Therefore, the layers together form the entire end-to-end trained neural network, and can thus be understood as a single, larger neural network.

[0051] The term "predictive output" (or "predictive output") refers to the final result generated by a neural network (e.g., a deep neural network) after processing input based on patterns learned during its training. In an encoder-decoder architecture, "predictive output" refers to the output of the decoder. Furthermore, predictive output can be interpreted as the neural network's attempt to predict outcomes or make decisions based on new, previously unseen data using weights and biases learned from its training phase. In the context of autonomous driving systems, "predictive output" can refer to the network's prediction of the future actions of the driving environment or vehicle based on sensor data it receives (e.g., from cameras, radar, and lidar). For example, in perception tasks in the form of classification tasks, predictive output could be the network's identification and classification of targets in the environment, such as pedestrians, vehicles, traffic signs, or lane markings. For instance, a neural network could predict the probability of detecting a pedestrian or a stationary object like a mailbox ahead. Further, in perception tasks in the form of regression tasks, predictive output could be a continuous value such as the predicted distance to the nearest obstacle, the speed of adjacent vehicles, or the time before a traffic light turns red. For example, a network could predict how far in advance a vehicle should begin braking to safely stop at a red light.

[0052] In this context, a "sensor" (or "sensor device") refers to a specialized component or system designed to capture and collect information from the vehicle's surroundings. These sensors play a crucial role in enabling the Autonomous Vehicle System (ADS) to perceive and understand its environment, make informed decisions, and navigate safely. Sensor devices are typically integrated into the hardware and software systems of autonomous vehicles to provide real-time data for various tasks such as obstacle detection, localization, road model estimation, and target recognition. Common types of sensor devices used in autonomous driving include LiDAR (Light Detection and Ranging), radar, cameras, and ultrasonic sensors. LiDAR sensors use laser beams to measure distances and create high-resolution 3D maps of the vehicle's surroundings. Radar sensors use radio waves to determine the distance and relative speed of targets around the vehicle. Camera sensors capture visual data, allowing the vehicle's computer system to identify traffic signs, lane markings, pedestrians, and other vehicles. Ultrasonic sensors use sound waves to measure proximity to targets. Various machine learning algorithms, such as artificial neural networks, can be used to process the output from sensors to understand the environment. Therefore, a "cluster" of sensors refers to a group or set of sensors, or in other words, multiple sensors.

[0053] The "surrounding environment" of a vehicle can be understood as the general area surrounding the vehicle, in which targets (such as other vehicles, landmarks, obstacles, etc.) can be detected and identified by the vehicle's sensors (radar, LIDAR, cameras, etc.), that is, within the sensor range of the vehicle.

[0054] As used herein, the term "in response to" can be interpreted from context as meaning "when," "at," or "if." Similarly, the phrases "if determined," "when determined," or "in the case of," can be interpreted from context as meaning "after determined," "in response to determination," "after detecting and identifying the occurrence of an event," or "in response to detecting the occurrence of an event." Accordingly, the phrase "if X equals Y" can be interpreted from context as "when X equals Y," "when determined to be equal to Y," "in response to X equals Y," or "in response to detecting / determining that X equals Y."

[0055] The term "acquire" is to be interpreted broadly herein and encompasses the direct and / or indirect receipt, retrieval, collection, and acquisition between two entities configured to communicate with each other or further with other external entities. However, in some embodiments, the term "acquire" is to be interpreted as determining, deriving, forming, calculating, etc. In other words, acquiring the pose of a vehicle may include determining or calculating the vehicle's pose based on, for example, GNSS data and / or perception data, together with map data. Thus, as used herein, "acquire" may indicate receiving parameters at a first entity / first unit from a second entity / first unit, or determining parameters at a first entity / first unit, for example, based on data received from another entity / another unit.

[0056] Implementation

[0057] Figure 1 This is a schematic flowchart representation of a method S100 for generating predictive outputs for an automated driving system (ADS) of a vehicle, according to some embodiments. Method S100 is a computer-implemented method S100 that can be executed online (e.g., by a processing system of a vehicle equipped with ADS). The processing system includes: a first hardware platform having one or more processors and one or more memories coupled to the one or more processors; and a second hardware platform having one or more processors and one or more memories coupled to the one or more processors. The one or more memories of each hardware platform store one or more programs that, when executed by the one or more processors of each hardware platform, perform the steps, services, and functions of the method S100 disclosed herein. Furthermore, Figure 1 The flowchart depicted in the diagram provides illustrations indicating which hardware (“HW”) platform will execute the various steps or processes in method S100.

[0058] Method S100 may include: obtaining a first sensor dataset generated by a first sensor cluster of the vehicle in S101 using one or more processors of a first hardware platform of the vehicle. Further, method S100 includes: encoding the first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder layers in S103 using one or more processors of the first hardware platform of the vehicle to form an encoded first sensor dataset. In some embodiments, each sensor is provided with a set of encoder layers, meaning that these encoder layers are sensor-specific. In some embodiments, the number of encoder layer groups provided is less than the number of sensors in the sensor cluster, meaning that at least one group of encoder layers processes sensor outputs from two or more sensors in the cluster. For example, encoder layer groups may be sensor mode-specific, meaning that each group of encoder layers processes all sensor outputs from sensors in the cluster that have the same sensor mode.

[0059] In some embodiments, method S100 further includes: obtaining a second sensor dataset generated by a second sensor cluster of the vehicle using one or more processors of a second hardware platform of the vehicle in S102. Method S100 further includes: encoding the second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder layers in S104 using one or more processors of the second hardware platform of the vehicle to form an encoded second sensor dataset. As previously stated, the one or more encoder layers may include a set or more sets of encoder layers, wherein these encoder layer sets may be sensor-specific, or one set of encoder layers may process and encode sensor outputs from two or more sensors in the cluster.

[0060] As mentioned above, the first and second hardware platforms are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensors in the second sensor cluster. However, in some implementations, the first and second sensor clusters may have one or more common sensors. For example, if the vehicle is equipped with only one specific sensor that is expected to be included in both clusters (e.g., a front-facing wide-angle camera), then that specific sensor may be a common sensor between the clusters.

[0061] In some implementations, the first sensor cluster includes sensors of different modalities, and the second sensor cluster includes sensors of different modalities. In other words, the sensors in the first cluster are not necessarily all cameras, and the sensors in the second cluster are not necessarily all LiDAR, but each cluster may include a mixture of sensor types / modalities.

[0062] Furthermore, method S100 includes: using one or more processors of a first hardware platform of the vehicle, or one or more processors of a second hardware platform of the vehicle, using one or more sensor fusion layers, fusing at least a portion of the encoded first sensor dataset from S107 and at least a portion of the encoded second sensor dataset to form a set of fused encoded sensor data features. In other words, feature vectors (“encoded sensor datasets”) (e.g., transformed by splicing, averaging, or learning) are combined. Thus, while combining the outputs of each sensor into a unified representation, their unique properties are preserved.

[0063] Method S100 further includes: using one or more processors of the hardware platform performing fusion S107, and using one or more decoder layers, generating a prediction output S108 based on the fused encoded sensor data features. One or more first encoder layers, one or more second encoder layers, one or more sensor fusion layers, and one or more decoder layers together form an end-to-end trained neural network.

[0064] Furthermore, method S100 may include: using one or more processors of the hardware platform that generates the predicted output in S108 to transmit the generated predicted output in S109 to one or more downstream functions of the autonomous driving system configured to control the vehicle based on the generated predicted output. When the predicted output is in the form of a perception output, the downstream function of the ADS may be, for example, a path planning module configured to generate candidate paths for execution by the vehicle based at least in part on the predicted output; or a localization module configured to output the vehicle's position or pose (position and heading) based at least in part on the predicted output. However, in a more general case, the downstream function of the ADS may be in the form of a decision and control module configured to output control signals to one or more actuators of the vehicle to control the vehicle's motion based at least in part on the predicted output.

[0065] Fusion S107 can be performed on the encoded sensor dataset as the output from the encoder layer or on the fused encoded sensor dataset, the latter of which will be further explained below. However, regardless of whether fusion is performed on the encoded sensor dataset as the output from the encoder layer or on the fused encoded sensor dataset, fusion S107 is still considered as "fusion of encoded sensor data".

[0066] Accordingly, method S100 can employ a two-step fusion process, wherein the encoded sensor data generated at each individual hardware platform is "pre-fused" before being shared with another platform for a second-stage fusion of the encoded sensor data. This reduces the amount of data shared between hardware platforms, thereby reducing bandwidth requirements between platforms.

[0067] Accordingly, in some implementations, one or more sensor fusion layers (performing fusion S107) are one or more stage two sensor fusion layers, and the fused encoded sensor data features are a set of stage two fused encoded sensor data features.

[0068] Furthermore, method S100 may include: using one or more processors of the vehicle's first hardware platform, fusing at least a portion of the first sensor dataset encoded in S105 using one or more first-stage fusion layers to form a first set of fused encoded sensor data features. In other words, feature vectors from the encoder layer ("encoded sensor dataset") (e.g., transformed by splicing, averaging, or learning) are combined. Thus, while combining the outputs of each sensor into a unified representation, their unique properties are preserved.

[0069] Furthermore, method S100 includes: using one or more processors of a second hardware platform of the vehicle, fusing at least a portion of the second sensor dataset encoded in S106 using one or more second-stage fusion layers to form a second set of fused encoded sensor data features. As before, feature vectors from the encoder layer (“encoded sensor dataset”) (e.g., transformed by splicing, averaging, or learning) are combined. Thus, while combining the outputs of each sensor into a unified representation, their unique properties are preserved.

[0070] Therefore, fusing at least a portion of the first sensor dataset encoded by S107 and at least a portion of the second sensor dataset encoded by S107 may include fusing the first set of fused sensor data features encoded by S105 and the second set of fused sensor data features encoded by S106 using one or more stage two sensor fusion layers to form the set of stage two fused encoded sensor data features.

[0071] In other words, a "second-stage" fusion S107 is performed to combine the "first-stage" fusion representation of the vehicle's surrounding environment. Therefore, the feature vectors ("fused encoded sensor data features") from the first-stage-one fusion layers and the second-stage-one fusion layers (e.g., transformed by splicing, averaging, or learning) are combined.

[0072] As previously described, one or more first encoder layers, one or more second encoder layers, one or more first-stage sensor fusion layers, one or more second-stage sensor fusion layers, one or more second-stage sensor fusion layers, and one or more decoder layers together form an end-to-end trained neural network.

[0073] It can be noted that the individual steps in method S100 (e.g., S101 to S106) are not necessarily executed by the “same” one or more processors on a dedicated hardware platform, but can be executed by different processors in a distributed processing architecture. However, naturally, the individual steps of the method can be executed by the same one or more processors on the hardware platform.

[0074] As mentioned, to allow ADS to access all sensor measurements, different sensor clusters and computing platforms (“computing units”) should be connected. However, the data bandwidth limitation between two hardware units is generally much smaller than the internal bandwidth within each hardware cluster. Therefore, feeding the outputs from all sensors on the vehicle to two hardware platforms at all times is generally considered impractical, and thus developers often face a trade-off between performance and redundancy.

[0075] The inventors recognized specific challenges when designing neural network models or architectures that operate simultaneously on multiple sensors for ADS applications. More specifically, when new data is output by each sensor, this data can be encoded at a specific timestamp by a specific encoder network for each sensor. After a certain time period or interval, the data measured from all sensors during this period / interval is merged into a single representation of the vehicle's surrounding environment (also known as "spatial fusion"). Furthermore, historical representations of the surrounding environment prior to this time interval are stored in memory. Therefore, when representing the current time interval, it can be combined with the historical representation stored in memory to obtain a final representation of the current state of the vehicle's surrounding environment (also known as "temporal fusion"). The final predicted output can then be provided by one or more decoder networks trained to solve various tasks (object detection, object classification, object tracking, object trajectory prediction, path planning, trajectory planning, etc.) that receive the spatial and temporal fused state of the surrounding environment at regular intervals.

[0076] When the design (i) follows the process described above and (ii) is suitable for implementing a neural network model or architecture in a vehicle in which multiple sensors are grouped into different clusters, each cluster having separate computing hardware, the challenge lies in the very limited available data bandwidth between these clusters.

[0077] More specifically, if we consider a scenario where an ADS-equipped vehicle has two hardware platforms, each with its own sensor cluster and computing hardware ("Platform A" and "Platform B"). Information from all the sensors on the vehicle would then be fused simultaneously. Therefore, to run this solution in the vehicle, all sensor data, or more specifically, all encoded sensor data, would need to be sent from Platform A to Platform B, or vice versa. Spatial and temporal fusion could then continue on the receiving platform. However, when new sensor measurements are output, an "update" (temporal fusion) of the representation of the surrounding environment must be performed promptly, otherwise there is a risk that the ADS might not be able to react in time as conditions evolve within the surrounding environment, potentially leading to safety-critical issues. Therefore, sensor data, or encoded sensor data, must be sent from one platform to the other at a sufficiently high rate to ensure timely temporal fusion. However, this may be impossible due to the limited computing power on the vehicle and the implied bandwidth limitations between the hardware platforms. Furthermore, even if sending sensor data or encoded sensor data between platforms is "possible," the requirement to force the sharing of large amounts of data between hardware platforms may inhibit or otherwise hinder the performance of neural network models. This is because it may require smaller sensor data packets or lower-dimensional encoded sensor data, which reduces the amount of information available to generate predictive outputs. Therefore, it is desirable to use higher-dimensional encoding, as it generally translates to better performance; however, if the encoded sensor data is to be transmitted between platforms, this also translates to higher bandwidth requirements.

[0078] Therefore, this paper proposes a multi-stage sensor fusion architecture, in which the complete sensor fusion steps S105 and S106 are executed separately on each hardware platform before sending data about encoded sensor data generated from the platform's own sensor cluster between platforms. Thus, instead of sending sensor data or encoded sensor data to another platform, only the fusion state is shared between platforms. The platform receiving the fusion state can then execute a second fusion step S107 to fuse its own fusion state (derived from sensor measurements of its own cluster) with the received fusion state, providing a final fused representation of the surrounding environment.

[0079] The phased sensor fusion steps S105 and S106 (executed at each platform) combine the information contained in the output of each sensor into a single representation required to solve the final task of the entire model or architecture. Furthermore, the inventors recognized that, on a byte-by-byte basis, more relevant information can be stored in the fused state representation than in the encoded sensor data from each sensor. This is because there is generally a high degree of redundancy among multiple sensors within a cluster, and sending sensor data or encoded sensor data inevitably leads to the transmission of redundant information. Therefore, by performing the first fusion steps S105 and S106 before sharing data between platforms, more relevant information can be shared between platforms under a fixed bandwidth constraint between clusters, compared to sharing sensor data or encoded sensor data.

[0080] Furthermore, in some embodiments, method S100 further includes: in response to both the first hardware platform and the second hardware platform being operable, receiving a second set of fused encoded sensor data from the second hardware platform using one or more processors of the first hardware platform of the vehicle, and fusing the first set of fused encoded sensor data features and the second set of fused encoded sensor data features using one or more stage two sensor fusion layers, in order to form the set of stage two fused encoded sensor data features. Method S100 may further include: using one or more processors of the first hardware platform of the vehicle, and using one or more decoder layers, generating a prediction output based on the set of stage two fused encoded sensor data features, in S108. In other words, if both hardware platforms are operable, the first hardware platform can receive stage one fused features from the second platform, perform stage two fusion, and generate a prediction output based on the information from the second stage fusion.

[0081] Similarly, in some embodiments, method S100 further includes: in response to the fact that both the first hardware platform and the second hardware platform are operable, receiving a first set of fused encoded sensor data from the first hardware platform using one or more processors of the second hardware platform of the vehicle, and fusing the first set of fused encoded sensor data features and the second set of fused encoded sensor data features using one or more stage-two sensor fusion layers, in order to form the set of stage-two fused encoded sensor data features. Method S100 may further include: using one or more processors of the second hardware platform of the vehicle, and using one or more decoder networks, generating a prediction output based on the set of stage-two fused encoded sensor data features, in S108. In other words, if both hardware platforms are operable, the second hardware platform can receive the features from the first stage fusion from the first platform, perform second stage fusion, and generate a prediction output based on the information from the second stage fusion.

[0082] The choice of which hardware platform to use for performing the second-phase fusion S107 and subsequent steps can be arbitrary, or based on which of the two hardware platforms has the most available computing power at any given time. Reports on the operability of the hardware platforms can be provided, for example, by separate modules or systems configured to monitor each hardware platform to detect errors or vulnerabilities readily understood by those skilled in the art. Furthermore, the term "operational" in relation to a hardware platform means that the hardware platform does not report or exhibit any detectable hardware or software vulnerabilities or failures during operation.

[0083] Furthermore, in some embodiments, one or more decoder layers are one or more stage 2 decoder layers. Accordingly, in response to the first hardware platform being operable and the second hardware platform being inoperable, method S100 may further include: generating an S110 prediction output by one or more processors of the vehicle's first hardware platform using one or more first stage 1 decoder layers based on a first set of fused encoded sensor data features. Additionally, method S100 may include: transmitting the generated S110 prediction output to an autonomous driving system configured to control one or more downstream functions of the vehicle based on the generated prediction output using one or more processors of the vehicle's first hardware platform, in step S112. In other words, if the second hardware platform is inoperable (i.e., encountering a hardware or software error), the first hardware platform generates the S110 prediction output using a first-stage fused representation of the vehicle's surrounding environment, in step S105.

[0084] Similarly, in response to the inoperability of the first hardware platform and the operability of the second hardware platform, method S100 may further include: generating a prediction output S111 by one or more processors of the vehicle's second hardware system using one or more second-stage decoder layers based on a second set of fused encoded sensor data features. Furthermore, method S100 may include: transmitting the generated prediction output S113 to one or more downstream functions of the autonomous driving system configured to control the vehicle based on the generated prediction output using one or more processors of the vehicle's second hardware platform. In other words, if the first hardware platform is inoperable (i.e., encountering a hardware or software error), the second hardware platform uses the first-stage fused representation S106 of the vehicle's surrounding environment to generate the prediction output S111.

[0085] Accordingly, to maintain safety-critical redundancy with separate platforms, some implementations of this paper propose that each platform is provided with the ability to generate predictive outputs only on the fused state representations generated by its own platform. Therefore, in addition to one or more decoder networks that provide predictive outputs based on the fused state of all platforms (and all sensor clusters), each platform can also be provided with a set of equivalent decoder networks trained to provide predictive outputs based on the fused state representations of sensor data output from its own sensor cluster. By training the entire software architecture in an end-to-end manner, this ability to generate predictive outputs only on the fused state representations generated by its own platform is seamlessly provided to generate predictive inputs on the combined outputs of both platforms.

[0086] This provides safety-related redundancy for the vehicle's ADS while still providing the necessary predictive output in a smooth and efficient manner, ensuring that the ADS can accurately and safely control the vehicle. Each hardware platform can be configured with a set of decoder layers for generating predictive outputs based on the S107 encoded features from the second-stage fusion and a set of decoder layers for generating predicted outputs based on the S105 and S106 encoded features from the first-stage fusion.

[0087] However, in some implementations, the same decoder layer is used to process the features encoded in S107 of the second-stage fusion and the features encoded in S105 and S106 of the first-stage fusion. The specific settings will depend on how the decoder layer is trained and on the specifications provided to the system. Furthermore, even when the predicted output is generated based on sensor output from only one of the clusters, the operable hardware platform can still perform fusion steps S105 / S106 and S107 according to the specific settings. Even if the second-stage fusion may seem redundant in this case, the data flow can be configured such that the decoder layer receives input only from the second-stage fusion layer.

[0088] Various fusion processes S105, S106, and S107 can serve as "intermediate fusion" and are techniques that can be used to combine feature representations from different sensor modalities after they have been encoded by their respective networks. In bird's-eye view (BEV) fusion methods, which are particularly suitable for use in autonomous driving applications, sensor data (from cameras, LiDAR, and radar) is converted into encoded BEV representations (e.g., BEV meshes) before being fused and decoded. BEV is advantageous because it provides a top-down, geometrically consistent view of the environment, making it easier to perceive spatial relationships between targets around the vehicle (e.g., roads, cars, pedestrians).

[0089] For example, regarding either of the first-stage fusion steps S105 and S106, let's assume the first sensor in the cluster is a camera, and the second sensor in the same cluster is a LiDAR. The image captured by the camera is then passed through an encoder network (e.g., a Convolutional Neural Network (CNN) encoder) to extract high-level image features such as edges, textures, and object categories. The output of the camera encoder is typically a set of feature maps (two-dimensional). Then, for the second sensor, the point cloud captured by the LiDAR can be passed through a voxel-based or point-based encoder (e.g., PointNet, VoxelNet). This transforms the sparse 3D point cloud data into a dense feature representation.

[0090] Before fusing S105 and S106, the encoded S103 and S104 data from each sensor can be projected into the BEV format. More specifically, the encoded sensor datasets (e.g., CNN-encoded feature maps) are transformed from their two-dimensional image plane perspective into a top-down BEV representation. This involves (e.g., using learned transformation matrices or geometric transformation matrices) projecting the features onto a common ground plane to align them with the three-dimensional environment. Essentially, the three-dimensional LiDAR point cloud can be “voxed” into a mesh-like structure, making the transformation to BEV straightforward. Each voxel is folded into a two-dimensional plane, creating a dense BEV feature map. Projecting to BEV allows all sensor modalities to align on a common reference frame that simplifies the S105 and S106 fusion process.

[0091] Once all encoded datasets are in BEV format, they can be fused into S105 and S106. This fusion of S105 and S106 can be done at the feature level, combining information from different sensors to create a richer and more robust representation of the environment. As mentioned above, several methods exist for implementing intermediate fusion (e.g., splicing, averaging, or learned transformations). More specifically, using a splicing method, the encoded data features (BEV feature maps) from each sensor are spliced ​​along the channel dimension. For example, if the camera BEV map has 128 channels and the LiDAR BEV map has 64 channels, the resulting fused map will have 192 channels. This method preserves all feature information from each modality. Using an element-wise summation method, the encoded data features from each sensor are summed element-wise, fusing information directly at each spatial location. The element-wise summation method helps balance the contributions of different sensors, although it may lose some modality-specific nuances. Using the learned transformation requires the fusion network to be trained to "learn" a set of weights or transformation functions (such as fully connected layers or convolutional layers) that combine the BEV feature maps from each sensor. This allows the fusion network to learn, based on the scene, which features from each sensor are most important and how to weight them.

[0092] As will be readily understood by those skilled in the art, a similar approach can be used for the second-stage fusion, the difference being that the input to the third fusion network is two or more already fused representations of the vehicle's surrounding environment.

[0093] Executable instructions for performing these functions may optionally be included in a non-transitory computer-readable storage medium or other computer program product configured for execution by one or more processors.

[0094] Figure 2a , Figure 2b , Figure 3a and Figure 3bThis is a schematic block diagram representation of a system 10 for generating predictive outputs for an autonomous driving system 310 of a vehicle 1, according to some embodiments. System 10 includes two separate hardware platforms 210, 211 having control circuitry (e.g., one or more processors) 11a, 11b, the control circuitry being configured to perform the functions of the method S100 disclosed herein, wherein the functions may be included in non-transitory computer-readable storage media 12a, 12b, or included in other computer program products configured for execution by the control circuitry 11a, 11b. In other words, each hardware platform 210, 211 of system 10 includes one or more memory storage areas 12a, 12b including program code, the one or more memory storage areas 12a, 12b and the program code being configured, together with one or more processors 11a, 11b, to enable system 10 to perform method S100 according to any of the embodiments disclosed herein.

[0095] More in detail, Figure 2a and Figure 2b A scenario is depicted in which both hardware platforms 210 and 211 are operable, and the information flow through the system is schematically illustrated according to some implementation methods. Figure 3a The scenario depicts a situation where the first hardware platform 210 is operable and the second hardware platform 211 is not operable. Figure 3b The scenario depicts a situation where the first hardware platform 210 is inoperable and the second hardware platform 211 is operable, and both figures schematically illustrate the information flow through the system according to some embodiments.

[0096] Therefore, system 10 includes: a first hardware platform 210, which includes a first sensor cluster 324a and one or more processors 11a configured to encode a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder layers 201a to form an encoded first sensor dataset. System 10 further includes: a second hardware platform 211, which includes a second sensor cluster 324b and one or more processors 11b configured to encode a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder layers 201b. The second sensor cluster 324b includes different sensors compared to the first sensor cluster 324a. In some embodiments, each sensor within a sensor cluster is a different mode compared to other sensors within the cluster. However, in some embodiments, the first sensor cluster and the second sensor cluster may have one or more common sensors. For example, if the vehicle is equipped with only one specific sensor (e.g., a front-facing wide-angle camera) that is expected to be included in both clusters, then that specific sensor may be a common sensor between the clusters.

[0097] Furthermore, one or more processors 11a of the first hardware platform 210 or one or more processors 11b of the second hardware platform 211 are further configured to fuse at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using one or more sensor fusion layers 206 to form a set of fused encoded sensor data features. A prediction output is then generated based on this set of fused encoded sensor data features using one or more decoder layers 203c. Additionally, one or more first encoder layers 201a, one or more second encoder layers 201b, one or more sensor fusion layers 206, and one or more decoder layers 203a, 203b together form an end-to-end trained neural network.

[0098] Furthermore, one or more processors 11a of the vehicle's first hardware platform or one or more processors 11b of the vehicle's second hardware platform may be further configured to transmit the generated predictive output to the autonomous driving system 310, which is configured to control one or more downstream functions of the vehicle based on the generated predictive output.

[0099] Furthermore, in some embodiments, one or more sensor fusion layers 206 are one or more stage-two sensor fusion layers 106, and the fused encoded sensor data features are a set of stage-two fused encoded sensor data features. One or more processors 11a of the first hardware platform 210 may be further configured to fuse at least a portion of the encoded first sensor dataset using one or more first stage-one sensor fusion layers 202a to form a first set of fused encoded sensor data features. Furthermore, one or more processors 11b of the second hardware platform 211 may be further configured to fuse at least a portion of the encoded second sensor dataset using one or more second stage-one sensor fusion layers 202b to form a second set of fused encoded sensor data features.

[0100] Accordingly, the previously described fusion can be performed as a stage-two fusion. Therefore, the fusion of at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset can include fusing the encoded sensor data features of the first set of fusions and the encoded sensor data features of the second set of fusions using one or more stage-two sensor fusion layers 206 to form the set of stage-two fusion encoded sensor data features. Here, one or more first encoder layers 201a, one or more second encoder layers 201b, one or more first stage-one sensor fusion layers 202a, one or more second stage-one sensor fusion layers 202b, one or more stage-two sensor fusion layers 206, and one or more decoder layers 203c together form an end-to-end trained neural network.

[0101] Accordingly, such as Figure 2a As illustrated, in response to the operability of both hardware platforms 210 and 211, one or more processors 11a of the first hardware platform can be configured to receive a second set of fused encoded sensor data from the second hardware platform 211, fuse the features of the first set of fused encoded sensor data and the second set of fused encoded sensor data using one or more stage-two sensor fusion layers 206 to form the set of stage-two fused encoded sensor data features, and generate a prediction output based on the set of fused stage-two encoded sensor data features using one or more decoder networks 203c. Similarly, as Figure 2b As illustrated, in response to both hardware platforms 210 and 211 being operable, one or more processors 11b of the second hardware platform can be configured to receive a first set of fused encoded sensor data from the first hardware platform 210, fuse the features of the first set of fused encoded sensor data and the features of the second set of fused encoded sensor data using a stage two sensor fusion network 206 to form the set of fused stage two encoded sensor data features, and generate a prediction output based on the set of stage two fused encoded sensor data features using one or more decoder networks 203c.

[0102] Therefore, when both hardware platforms are operational, either platform transmits a first-stage fused representation of its surrounding environment to the other platform, which then performs a second-stage fusion by combining its own first-stage fused representation with the fused representation received from the other platform. Thus, the final prediction output is based on all available sensor measurements of the vehicle, while maintaining lower bandwidth usage between platforms compared to directly transmitting sensor data or encoded sensor data from one platform to another.

[0103] Furthermore, such as Figure 3a As described, in response to the first hardware platform 210 being operable and the second hardware platform 211 being inoperable, one or more processors 11a of the first hardware platform 210 of vehicle 1 can be configured to generate a predictive output based on a first set of fused encoded sensor data features using one or more decoder layers 203a, 203c. In this case, the one or more decoder layers can be stage decoder layer 203a, which is specifically configured to generate the predictive output based on the first set of fused encoded sensor data, or the previously used decoder layer 203c.

[0104] In other words, if the second hardware platform exhibits some hardware or software failures, the first hardware platform can provide predictive output based on its own sensor measurements and effectively skip the second-stage fusion. This achieves the redundancy required for safe operation of ADS, and the system maintains sufficient functionality even if one of the hardware platforms fails. However, as... Figure 3a As indicated, the first hardware platform 210 can still perform stage two fusion on the first set of fused encoded features and feed the output to one or more decoder layers 203c. This variant will depend on the specific implementation of the architecture, as mentioned.

[0105] Similarly, such as Figure 3b As described, in response to the inoperability of the first hardware platform 210 and the operability of the second hardware platform 211, one or more processors 11b of the second hardware platform 211 of vehicle 1 can be configured to generate a predictive output based on a second set of fused encoded sensor data features using one or more decoder layers 203b, 203c. As previously described, the one or more decoder layers can be stage decoder layer 203b, which is specifically configured to generate a predictive output based on the first set of fused encoded sensor data, or the previously used decoder layer 203c.

[0106] The proposed System 10 can be implemented based on a transformer architecture and the concept of a "query." A "query" refers to the component in a transformer-type neural network involved in a self-attention mechanism, with other related components being keys and values. Here, a query refers to the vector representation of the current token that the model focuses on at a specific step of the transformer's attention mechanism. When processing the input sequence, the query essentially asks, "How much attention should I give to the other tokens?" Each token in the input sequence also has a corresponding key, which represents a vector summarizing how important that token is relative to other tokens based on specific features. The value is a vector containing the actual content or information that is passed or noticed during the attention process. Accordingly, for each token, the query vector is compared with the key vectors of all other tokens in the sequence. This comparison is typically done by computing the dot product between the query and the key. The result is a set of attention scores that determine how much attention a token (represented by the query) should place on other tokens (represented by their keys). These attention scores are then used to weight the corresponding values, which represent the actual information passed through the attention layer. This self-attention mechanism enables the transformer to capture long-distance dependencies and relationships between features in a data sequence.

[0107] Accordingly, a query can be interpreted as a high-dimensional learnable vector from which information can be extracted from different data sources. According to some implementations, a first set of queries may be intended to extract information from several time steps of an encoded first sensor dataset used on a first hardware platform. This enables the first stage of spatial and temporal fusion. Different queries in the first set can be trained to extract information about different relevant entities in the surrounding environment, such as vehicles and pedestrians and their behavior, lane markings, traffic signs, etc. Once this information is extracted, this set of queries corresponds to a highly compressed, low-bandwidth representation of the surrounding environment, which can be sent to different hardware platforms for further processing (second-stage fusion). A similar set of queries (a second set of queries) may be intended to extract temporal and spatial information from an encoded second sensor dataset on a second hardware platform. These queries can then be sent for further processing by other hardware platforms (second-stage fusion).

[0108] The second-stage fusion process corresponds to the fusion between the first and second sets of queries. Because it does not involve initial sensor data encoding, it is expected to require less computation than the first-stage fusion process. Several methods exist for performing this second-stage fusion between the two sets of queries. For example, two different sets of queries can be made to extract information from each other. Alternatively, another set of queries (a third set of queries) can be made with the aim of extracting information from the first and second sets of queries.

[0109] Even if the software (SW) architecture described herein is applied to two separate hardware platforms (“Compute Units”), the same SW architecture can be applied to the same hardware platform (“Compute Unit”) depending on some implementations. While complete hardware redundancy cannot be achieved in the event of hardware failure, software redundancy can still be achieved in the event of software failure. For autonomous driving systems with lower safety requirements (e.g., ADAS and AD), software redundancy may be sufficient.

[0110] Figure 4 This is a schematic block diagram representation depicting an end-to-end training setup for a system used to generate predictive outputs for an autonomous driving system of a vehicle, according to some implementations. Here, datasets are indicated by thin hexagons, while algorithms or layers are indicated by rectangles. Furthermore, downstream data flows (inference) are indicated by solid arrows, while data flows (training signals) used for end-to-end training of the encoder layer, fusion layer, and decoder layer are indicated by dashed lines.

[0111] As mentioned above, encoder layers 201a and 201b can be shared among different sensors within a sensor cluster, or they can be sensor-specific. Furthermore, when one or more decoder layers 203 are applied, they can be shared in three scenarios: decoding the second-stage fused encoding, decoding the first-stage fused encoding from a first hardware platform, and decoding the first-stage fused encoding from a second hardware platform. This means using the same one or more decoder layers regardless of whether the input is a set of stage-one fused sensor data or a set of stage-two fused sensor data, and regardless of whether the input originates from a first hardware platform, a second hardware platform, or both. However, in some embodiments, three separate decoder layers 203a, 203b, and 203c are provided to process the fused encoding in all three scenarios.

[0112] Furthermore, if three separate decoder groups are used, training can be performed simultaneously on one or more common decoder layers or on the three separate decoder layers 203a, 203b, and 203c. Additionally, within one or more decoder layers 203a, 203b, and 203c, there can be subsets of decoder layers configured to solve different tasks (e.g., ...). Figure 4 The code indicates decoder 1...decoder K). In other words, a first subset of the decoder layers (e.g., decoder 1) can be trained to generate predicted outputs in the form of pedestrian detection and classification, a second subset of the decoder layers (e.g., decoder 2) can be trained to generate predicted outputs in the form of lane marking detection and classification, and a third subset of the decoder layers (e.g., decoder K) can be trained to generate expected trajectories of dynamic targets in the vehicle's surrounding environment. All these subsets of the decoder layers (solving different tasks) can be trained simultaneously. However, in some implementations, training of all decoder layers or all subsets of decoder layers within a set of decoder layers can be performed separately.

[0113] The entire architecture, including encoder layers 201a and 201b, fusion layers 202a, 202b, and 206, and decoder layers 203a, 203b, and 203c, is trained together end-to-end using supervised learning techniques. More specifically, during the training phase, a training dataset is provided, comprising: a sensor dataset forming the input objects (measurements from the corresponding sensors of each sensor cluster), and a set of annotated outputs for each specific perception task, forming the desired output. Therefore, for each sensor dataset, there exists a desired output, and the entire system, including all layers, is designed to generate this desired output for a specific perception task. For example, if the system is designed for an object detection task, the annotated dataset could include a 3D scene with manually added 3D bounding boxes.

[0114] Assuming encoder layers 201a, 201b, fusion layers 202a, 202b, 206, and decoder layers 203a, 203b, 203c are initialized and the parameters of each network are set, the input object is fed through the processing chain. This step is commonly referred to as forward propagation. Next, loss calculation is performed, where the predicted output and the expected output (annotation task i) are fed as input to a loss function, also known as a cost function. The loss function represents a specific mathematical function that quantifies the difference between the predicted values ​​(predicted outputs) and the actual ground truth values ​​(annotation task i) in the training dataset.

[0115] Depending on the specific perception task, different loss functions can be used. For example, cross-entropy loss, classification loss, regression loss, dice loss, or combinations thereof can be used. Once the difference between the predicted and actual ground truth values ​​is quantified, backpropagation is used to compute the gradient of the loss relative to the model parameters of encoder layers 201a, 201b, fusion layers 202a, 202b, 206, and decoder layers 203a, 203b, 203c. Next, the model parameters of encoder layers 201a, 201b, fusion layers 202a, 202b, 206, and decoder layers 203a, 203b, 203c are updated using optimization algorithms such as Adam, stochastic gradient descent, or RMSprop. The update aims to minimize the loss function. This process is then iterated multiple times by repeating the forward propagation, loss calculation, backpropagation, and parameter update steps until the model performance converges.

[0116] Figure 5 This is an illustrative illustration of a vehicle 1 equipped with an ADS, including such a system 10. As used herein, "vehicle" means any form of motorized transport. For example, vehicle 1 can be any road vehicle such as a car (as illustrated herein), a motorcycle, a (freight) truck, a bus, etc. However, in some embodiments, the vehicle can be in the form of an autonomous aircraft or a vessel.

[0117] System 10 includes two separate hardware platforms, each including its own control circuits 11a, 11b and memories 12a, 12b. Each control circuit 11a, 11b may physically comprise a single circuit device. Alternatively, the control circuits 11a, 11b may be distributed across several circuit devices. As an example, system 10 may share its control circuits 11a, 11b with other parts of vehicle 1 (e.g., ADS 310). Furthermore, system 10 may form part of ADS 310, i.e., system 10 may be implemented as a module or feature of ADS. In other words, ADS can be executed by either of the two hardware platforms.

[0118] Control circuits 11a and 11b may include one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, or a microprocessor. One or more processors may be configured to execute program code stored in memories 12a and 12b to perform various functions and operations of vehicle 1 in addition to the methods disclosed herein. The processor may be or may include any number of hardware components for performing data or signal processing or for executing computer code stored in memories 12a and 12b. Memories 12a and 12b optionally include high-speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid-state storage devices; and optionally include non-volatile memory such as one or more disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memories 12a and 12b may include database components, object code components, script components, or any other type of information structure used to support various activities of this specification.

[0119] In the illustrated example, memories 12a and 12b further store map data 308. Map data 308 can be used, for example, by the ADS 310 of vehicle 1 to perform autonomous functions of vehicle 1. Map data 308 may include high-definition (HD) map data. It is conceivable that even though memories 12a and 12b are illustrated as separate elements from ADS 310, they can also be provided as integrated elements of ADS 310. In other words, according to the exemplary embodiment, any distributed or local memory device can be used to implement the inventive concept. Similarly, control circuitry 11a and 11b can be distributed, for example, such that one or more processors of control circuitry 11a and 11b are provided as integrated elements of ADS 310 or any other system of vehicle 1. In other words, according to the exemplary embodiment, any distributed or local control circuitry device can be used to implement the inventive concept. ADS 310 is configured to perform autonomous or semi-autonomous functions and operations of vehicle 1. ADS 310 may include multiple modules, each responsible for a different function of ADS 310.

[0120] Vehicle 1 includes many components typically found in autonomous or semi-autonomous vehicles. It should be understood that vehicle 1 may have... Figure 5 Any combination of the various elements shown herein. Furthermore, vehicle 1 may include, but is not limited to, any combination of the various elements shown herein. Figure 5The elements shown herein are further elements. Although the various elements are shown herein as being located inside vehicle 1, one or more of the elements may be located outside vehicle 1. For example, map data may be stored in a remote server and accessed by various components of vehicle 1 via communication system 326. Furthermore, although the various elements are depicted herein in certain arrangements, as will be readily understood by those skilled in the art, the various elements may be implemented in different arrangements. It should be further noted that the various elements may be communicatively connected to each other in any suitable manner. Figure 5 Vehicle 1 should be considered only as an illustrative example, as the components of vehicle 1 can be implemented in several different ways.

[0121] Vehicle 1 further includes a sensor system 320. Sensor system 320 is configured to acquire sensing data about the vehicle itself or its surroundings. Sensor system 320 may, for example, include a Global Navigation Satellite System (GNSS) module 322 (such as GPS) configured to collect geographic location data of vehicle 1. Sensor system 320 may further include one or more sensors 324. Sensors 324 may be any type of airborne sensor such as a camera, LiDAR and RADAR, ultrasonic sensors, gyroscopes, accelerometers, odometers, etc. It should be appreciated that sensor system 320 may also provide the possibility of acquiring sensing data directly or via dedicated sensor control circuitry in vehicle 1. However, the sensors of the sensor system here are divided into at least two independent sensor clusters, each of which is associated with a corresponding hardware platform.

[0122] Vehicle 1 further includes a communication system 326. Communication system 326 is configured to communicate with external units such as other vehicles (i.e., via vehicle-to-vehicle (V2V) communication protocols), remote servers (e.g., cloud servers), databases, or other external devices, i.e., vehicle-to-infrastructure (V2I) or vehicle-to-everything (V2X) communication protocols. Communication system 318 can communicate using one or more communication technologies. Communication system 318 may include one or more antennas (not shown). Cellular communication technologies can be used for remote communication, such as to remote servers or cloud computing systems. Additionally, if the cellular communication technology used has low latency, it can also be used for V2V, V2I, or V2X communication. Examples of cellular radio technologies include GSM, GPRS, EDGE, LTE, 5G, 5G NR, and so on, including future cellular solutions. However, in some solutions, short-to-medium range communication technologies such as wireless local area networks (LANs) (e.g., solutions based on IEEE 802.11) can be used for communication with other vehicles near vehicle 1 or with local infrastructure components. ETSI is developing cellular standards for vehicle communications, and 5G is considered a suitable solution, for example, due to its high bandwidth, low latency, and efficient processing of communication channels.

[0123] The communication system 326 can accordingly provide the possibility of sending outputs to and / or receiving inputs from remote locations (e.g., remote operation or control centers) via one or more antennas. Furthermore, the communication system 326 can be further configured to allow various components of vehicle 1 to communicate with each other. As an example, the communication system can provide a local network setup such as CAN bus, I2C, Ethernet, fiber optics, etc. Local communication within the vehicle can also be a wireless type with protocols such as WiFi, LoRa, Zigbee, Bluetooth, or similar medium / short-range technologies.

[0124] Vehicle 1 further includes a control system 320. The control system 328 is configured to control the handling of vehicle 1. The control system 328 includes a steering module 330 configured to control the heading of vehicle 1. The control system 328 further includes a throttle module 332 configured to control the actuation of the throttle valve of vehicle 1. The control system 328 further includes a braking module 334 configured to control the actuation of the brakes of vehicle 1. The various modules of the control system 328 can also receive manual input from the driver of vehicle 1 (i.e., from the steering wheel, accelerator pedal, and brake pedal, respectively). However, the control system 328 can be communicatively connected to the vehicle's ADS 310 to receive instructions on how the various modules of the control system 328 should operate. Therefore, the ADS 310 can control the handling of vehicle 1, for example, via a decision and control module 318.

[0125] ADS 310 may include a positioning module 312 or a positioning block / system. The positioning module 312 is configured to determine and / or monitor the geographic location and heading of vehicle 1, and may utilize data from sensor system 320, such as data from GNSS module 322. Alternatively or in combination, the positioning module 312 may utilize data from one or more sensors 324. For improved accuracy, the positioning system may alternatively be implemented as Real-Time Kinematic (RTK) GPS.

[0126] ADS 310 may further include a perception module 314 or a perception block / system 314. The perception module 314 may refer to any known module and / or function, for example, included in one or more electronic control modules and / or nodes of vehicle 1, adapted and / or configured to interpret sensing data related to driving of vehicle 1, to identify, for example, obstacles, lanes, relevant signs, appropriate navigation paths, etc. The perception module 314 may therefore be adapted to rely on and receive input from multiple data sources such as automotive imaging, image processing, computer vision, and / or in-vehicle networks, and to combine with sensing data, for example, from sensor system 320. System 10 may be implemented, for example, in the perception module 314 of ADS 310.

[0127] The positioning module 312 and / or the sensing module 314 may be communicatively connected to the sensor system 320 to receive sensing data from the sensor system 320. The positioning module 312 and / or the sensing module 314 may further transmit control commands to the sensor system 320. The path planning module 316 and / or the sensing module 314 may be implemented according to the embodiments disclosed herein.

[0128] The present invention has been presented above with reference to specific embodiments. However, other embodiments besides those described above are possible and within the scope of the invention. Within the scope of the invention, method steps that are different from those described above, performed by hardware or software, can be provided. Thus, according to an exemplary embodiment, a non-transitory computer-readable storage medium is provided storing one or more programs configured to be executed by one or more processors of a vehicle control system, the one or more programs including instructions for performing the method according to any of the embodiments described above. Alternatively, according to another exemplary embodiment, a cloud computing system can be configured to perform any of the methods presented herein. The cloud computing system may include distributed cloud computing resources that jointly perform the methods presented herein under the control of one or more computer program products.

[0129] Generally, computer-accessible media can include any tangible or non-transitory storage medium or storage media such as electrical, magnetic, or optical media—for example, a disk or CD / DVD-ROM bus-connected to a computer system. The terms “tangible” and “non-transitory” as used herein are intended to describe computer-readable storage media (or “memory”) excluding those that transmit electromagnetic signals, but are not intended to otherwise limit the types of physical computer-readable storage devices encompassed by the phrases “computer-readable medium” or “memory.” For example, the terms “non-transitory computer-readable medium” or “tangible memory” are intended to encompass types of storage devices that include, for example, random access memory (RAM) that do not necessarily permanently store information. Program instructions and data stored in non-transitory form on tangible computer-accessible storage media can be further transmitted via transmission media or signals such as electrical, electromagnetic, or digital signals, which can be transmitted via communication media such as networks and / or wireless links.

[0130] Processors 11a and 11b (as associated with system 10) may be or may include any number of hardware components for performing data or signal processing or for executing computer code stored in memories 12a and 12b. System 10 has associated memories 12a and 12b, and memories 12a and 12b may be one or more means for storing data and / or computer code for performing or facilitating the various methods described herein. Memory may include volatile memory or non-volatile memory. Memory 12a and 12b may include database components, object code components, script components, or any other type of information structure for supporting the various activities of this specification. According to exemplary embodiments, any distributed or local storage device may be used with the systems and methods of this specification. According to exemplary embodiments, memories 12a and 12b (e.g., via circuitry or any other wired, wireless, or network connection) may be communicatively connected to processors 11a and 11b and include computer code for performing one or more processes described herein.

[0131] It should be noted that any reference numerals in the drawings do not limit the scope of the claims. The invention can be implemented, at least in part, by both hardware and software means, and several “apparatus” or “units” can be represented by items of the same hardware.

[0132] Although the accompanying drawings may show a specific order of method steps, the order of steps may differ from what is depicted. Furthermore, two or more steps may be performed simultaneously or partially simultaneously. This variation will depend on the chosen software and hardware system and the designer's choices. All such variations are within the scope of this invention. Similarly, software implementation can be accomplished using standard programming techniques based on rule-based logic and other logics to complete various connection, processing, encoding, fusion, and generation steps. The embodiments mentioned and described above are given by way of example only and should not be limited to this invention. Other solutions, uses, purposes, and functions within the scope of this invention claimed in the patent claims described below will be apparent to those skilled in the art.

Claims

1. A computer-implemented method (S100) for generating predictive outputs for an autonomous driving system of a vehicle, the computer-implemented method comprising: One or more processors of the first hardware platform of the vehicle: The first sensor dataset generated by the first sensor cluster of the vehicle is encoded using one or more first encoder layers (S103) to form an encoded first sensor dataset; One or more processors from the second hardware platform of the vehicle: The second sensor dataset generated by the second sensor cluster of the vehicle is encoded using one or more second encoder layers (S104) to form an encoded second sensor dataset; Wherein, the first hardware platform and the second hardware platform are separate hardware platforms, and each sensor in the first sensor cluster is different from the sensor in the second sensor cluster; The computer-implemented method further includes: The processor is one or more processors of the first hardware platform of the vehicle or one or more processors of the second hardware platform of the vehicle: At least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset are fused (S107) using one or more sensor fusion layers to form a set of fused encoded sensor data features; A predictive output is generated (S108) using one or more decoder layers based on the set of fused encoded sensor data features; The one or more first encoder layers, the one or more second encoder layers, the one or more sensor fusion layers, and the one or more decoder layers together form an end-to-end trained neural network.

2. The computer-implemented method (S100) according to claim 1, further comprising: The processor is one or more processors of the first hardware platform of the vehicle or one or more processors of the second hardware platform of the vehicle: The generated predictive output is transmitted (S109) to the autonomous driving system configured to control one or more downstream functions of the vehicle based on the generated predictive output.

3. The computer-implemented method (S100) according to claim 1, wherein, The one or more sensor fusion layers are one or more stage two sensor fusion layers, and wherein the set of fused encoded sensor data features is a set of stage two fused encoded sensor data features, the computer-implemented method further includes: One or more processors of the first hardware platform of the vehicle: At least a portion of the encoded first sensor dataset is fused using one or more first-stage sensor fusion layers (S105) to form a first set of fused encoded sensor data features. One or more processors of the second hardware platform of the vehicle: At least a portion of the encoded second sensor dataset is fused using one or more second-stage sensor fusion layers (S106) to form a second set of fused encoded sensor data features; Wherein, fusing (S107) at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset includes: fusing the first set of fused encoded sensor data features and the second set of fused encoded sensor data features using the one or more stage two sensor fusion layers to form the set of stage two fused encoded sensor data features; The one or more first encoder layers, the one or more second encoder layers, the one or more first-stage sensor fusion layers, the one or more second-stage sensor fusion layers, the one or more second-stage sensor fusion layers, and the one or more decoder layers together form the end-to-end trained neural network.

4. The computer-implemented method according to claim 3, wherein, The one or more decoder layers are one or more stage 2 decoder layers, and the computer-implemented method further includes: In response to the first hardware platform being operable and the second hardware platform being inoperable: The predicted output is generated (S110) by one or more processors of the first hardware platform of the vehicle using one or more first-stage decoder layers based on the first set of fused encoded sensor data features; In response to the first hardware platform being inoperable and the second hardware platform being operable: The predicted output is generated (S111) by one or more processors of the second hardware platform of the vehicle using one or more second-stage decoder layers based on the second set of fused encoded sensor data features; The one or more first encoder layers, the one or more second encoder layers, the one or more first-stage sensor fusion layers, the one or more second-stage sensor fusion layers, the one or more stage-two sensor fusion layers, the one or more first-stage decoder layers, the one or more second-stage decoder layers, and the one or more stage-two decoder layers together form the end-to-end trained neural network.

5. The computer-implemented method (S100) according to claim 1, wherein, The first sensor cluster includes sensors of different modes, and the second sensor cluster includes sensors of different modes.

6. The computer-implemented method (S100) according to claim 1, wherein, The end-to-end trained neural network is a single end-to-end trained neural network.

7. A computer program product comprising instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a computer-implemented method (S100) according to any one of claims 1 to 6.

8. A non-transitory computer-readable storage medium comprising instructions that, when executed by a computing device of a vehicle, cause the computing device to perform a computer-implemented method (S100) according to any one of claims 1 to 6.

9. A system (10) for generating predictive outputs for an autonomous driving system (310) of a vehicle (1), the system comprising: A first hardware platform (210) including a first sensor cluster (324a) and one or more processors (11a) configured to perform the following steps: encoding a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder layers (201a) to form an encoded first sensor dataset; A second hardware platform (211) including a second sensor cluster (324b) and one or more processors (11b) configured to perform the following steps: encoding a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder layers (201b) to form an encoded second sensor dataset; Wherein, the first hardware platform and the second hardware platform are separate hardware platforms, and wherein each sensor in the first sensor cluster (324a) is different from the sensor in the second sensor cluster (324b); The one or more processors (11a) of the first hardware platform (210) or the one or more processors (11b) of the second hardware platform (211) are further configured to: fuse at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using one or more sensor fusion layers (206) to form a set of fused encoded sensor data features; and generate a prediction output based on the set of fused encoded sensor data features using one or more decoder layers. The one or more first encoder layers (201a), the one or more second encoder layers (201b), the one or more sensor fusion layers (206), and the one or more decoder layers (203a, 203b, 203c) together form an end-to-end trained neural network.

10. The system (10) according to claim 9, wherein, The one or more processors (11a) of the first hardware platform of the vehicle or the one or more processors (11b) of the second hardware platform of the vehicle are further configured to: The generated prediction output is transmitted to the autonomous driving system (310) which is configured to control one or more downstream functions of the vehicle based on the generated prediction output.

11. The system (10) according to claim 9 or 10, wherein, The one or more sensor fusion layers are one or more stage two sensor fusion layers (206), and wherein the set of fused encoded sensor data features is a set of stage two fused encoded sensor data features. The one or more processors (11a) of the first hardware platform (210) are further configured to fuse at least a portion of the encoded first sensor dataset using one or more first-stage sensor fusion layers (202a) to form a first set of fused encoded sensor data features. The one or more processors (11b) of the second hardware platform (211) are further configured to fuse at least a portion of the encoded second sensor dataset using one or more second-stage sensor fusion layers (202b) to form a second set of fused encoded sensor data features. The fusion of at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset includes: fusing the first set of fused encoded sensor data features and the second set of fused encoded sensor data features using the one or more stage two sensor fusion layers (206) to form the set of stage two fused encoded sensor data features. The one or more first encoder layers (201a), the one or more second encoder layers (201b), the one or more first-stage sensor fusion layers (202a), the one or more second-stage sensor fusion layers (202b), the one or more second-stage sensor fusion layers (206), and the one or more decoder layers (203a, 203b, 203c) together form the end-to-end trained neural network.

12. The system (10) according to claim 11, wherein, The one or more decoder layers are one or more stage 2 decoder layers; The one or more processors (11a) of the first hardware platform (210) are further configured to generate the prediction output based on the first set of fused encoded sensor data features using one or more first-stage decoder layers in response to the first hardware platform being operable and the second hardware platform being inoperable. The one or more processors (11b) of the second hardware platform (211) are further configured to generate the prediction output based on the second set of fused encoded sensor data features using one or more second-stage decoder layers in response to the first hardware platform being inoperable and the second hardware platform being operable. The one or more first encoder layers (201a), the one or more second encoder layers (201b), the one or more first-stage sensor fusion layers (202a), the one or more second-stage sensor fusion layers (202b), the one or more stage-two sensor fusion layers (206), the one or more first-stage decoder layers, the one or more second-stage decoder layers, and the one or more stage-two decoder layers together form the end-to-end trained neural network.

13. The system (10) according to claim 9, wherein, The first sensor cluster (324a) includes sensors of different modes, and the second sensor cluster (324b) includes sensors of different modes.

14. The system (10) according to claim 9, wherein, The end-to-end trained neural network is a single end-to-end trained neural network.

15. A vehicle (1) including a system (10), said system (10) comprising: A first hardware platform (210) including a first sensor cluster (324a) and one or more processors (11a) configured to perform the following steps: encoding a first sensor dataset generated by the first sensor cluster of the vehicle using one or more first encoder layers (201a) to form an encoded first sensor dataset; A second hardware platform (211) including a second sensor cluster (324b) and one or more processors (11b) configured to perform the following steps: encoding a second sensor dataset generated by the second sensor cluster of the vehicle using one or more second encoder layers (201b) to form an encoded second sensor dataset; Wherein, the first hardware platform and the second hardware platform are separate hardware platforms, and wherein each sensor in the first sensor cluster (324a) is different from the sensor in the second sensor cluster (324b); The one or more processors (11a) of the first hardware platform (210) or the one or more processors (11b) of the second hardware platform (211) are further configured to: fuse at least a portion of the encoded first sensor dataset and at least a portion of the encoded second sensor dataset using one or more sensor fusion layers (206) to form a set of fused encoded sensor data features; and generate a prediction output based on the set of fused encoded sensor data features using one or more decoder layers. The one or more first encoder layers (201a), the one or more second encoder layers (201b), the one or more sensor fusion layers (206), and the one or more decoder layers (203a, 203b) together form an end-to-end trained neural network.