Automatic driving fault-tolerant method and electronic device
By employing targeted fault-tolerant strategies to compensate parameters and restart functional units in autonomous driving systems that lack redundant design, the problem of low fault tolerance coverage is solved, thereby improving system reliability and safety, optimizing resource utilization, and reducing costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2026-05-26
- Publication Date
- 2026-06-30
AI Technical Summary
Many functional units in existing autonomous driving systems lack redundancy design, resulting in low fault tolerance coverage and low reliability, which affects system safety.
A targeted fault-tolerance strategy is adopted, including parameter compensation and restart processing. Fault-tolerant functional units without redundant design backup units are handled by acquiring redundant design information, execution fault characteristic information and current status information, and then implementing parameter compensation and restart processing.
It expands the fault tolerance coverage, improves the reliability and safety of autonomous driving systems, shortens fault recovery time, optimizes resource utilization, and reduces development and maintenance costs.
Smart Images

Figure CN122300531A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of autonomous driving technology, and in particular to an autonomous driving fault-tolerant method and electronic device. Background Technology
[0002] With the rapid development of autonomous driving technology, high-level autonomous driving systems place extremely high demands on the reliability and fault tolerance of their software frameworks. In-vehicle heterogeneous distributed computing systems integrate various computing devices and run key functional units such as perception, decision-making, and control; any single point of failure could lead to a serious safety accident.
[0003] In related technologies, redundancy and fault tolerance designs are typically implemented for critical functional units. When the original critical functional unit fails, a backup critical functional unit based on the redundancy design takes over the operation. However, because redundancy and fault tolerance designs for different functional units require development tailored to their specific functions, and the development costs are high, as well as consuming significant onboard resources, many functional units in current autonomous driving systems cannot be designed with redundancy and fault tolerance. When such functional units fail, it will affect the safety of autonomous driving, i.e., reduce the reliability of autonomous driving. Summary of the Invention
[0004] This application provides an autonomous driving fault-tolerant method and electronic device to at least solve the problem of low fault tolerance coverage in related technologies, which reduces the reliability of autonomous driving.
[0005] This application provides an autonomous driving fault-tolerant method, including: Obtain redundant design information of faulty functional units in the vehicle's autonomous driving system; If the redundancy design information of any faulty functional unit indicates that the faulty functional unit has no backup unit, the faulty functional unit is regarded as a fault-tolerant unit. If the fault-tolerant unit is an actuator, obtain the execution fault characteristic information of the fault-tolerant unit; After the fault-tolerant unit determines the action to be executed in response to the action control command, the parameter compensation of the action to be executed of the fault-tolerant unit is performed according to the execution fault characteristic information of the fault-tolerant unit, so as to match the actual execution action of the fault-tolerant unit with the action control command. If the fault-tolerant unit belongs to the target functional unit, obtain the current state information of the fault-tolerant unit; When the current status information of the fault-tolerant unit meets the preset restart conditions, the fault-tolerant unit is restarted to eliminate the original fault.
[0006] This application also provides an autonomous driving fault-tolerant device, including: The first acquisition module is used to acquire redundant design information of faulty functional units in the vehicle's autonomous driving system. The determination module is used to identify the faulty functional unit as a fault-tolerant unit when the redundancy design information of any faulty functional unit characterizes the faulty functional unit and there is no spare unit. The second acquisition module is used to acquire the execution fault characteristic information of the fault-tolerant unit when the fault-tolerant unit belongs to the actuator. The compensation module is used to compensate the parameters of the action to be executed by the fault-tolerant unit after the fault-tolerant unit determines the action to be executed in response to the action control command, based on the execution fault characteristic information of the fault-tolerant unit, so as to match the actual execution action of the fault-tolerant unit with the action control command. The third acquisition module is used to acquire the current status information of the fault-tolerant unit when the fault-tolerant unit belongs to the target functional unit. The restart module is used to restart the fault-tolerant unit when the current status information of the unit meets the preset restart conditions, so as to eliminate the original fault.
[0007] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for implementing the steps of any of the above-described autonomous driving fault-tolerant methods when executing the computer program.
[0008] This application also provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-described autonomous driving fault-tolerant methods.
[0009] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described autonomous driving fault-tolerant methods.
[0010] This application implements a targeted fault-tolerance strategy for faulty functional units that lack redundant backup units, based on their type. The fault-tolerance strategy includes parameter compensation and restart processing. This achieves fault-tolerance processing for functional units that are not designed with redundancy due to cost and resource constraints. It solves the technical problem of low fault tolerance coverage and low reliability of autonomous driving systems caused by the lack of redundant design in a large number of functional units in related technologies. This achieves the technical effect of expanding the fault tolerance coverage and improving the reliability and safety of autonomous driving. Attached Figure Description
[0011] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1 This is a schematic diagram of the structure of the autonomous driving fault-tolerant system on which the embodiments of this application are based; Figure 2 A flowchart illustrating the autonomous driving fault-tolerant method provided in an embodiment of this application; Figure 3 A schematic diagram of an exemplary registration process provided for embodiments of this application; Figure 4 This is a schematic diagram of the overall process of the autonomous driving fault-tolerant method provided in the embodiments of this application; Figure 5 This is an application architecture diagram of the autonomous driving fault-tolerant method provided in the embodiments of this application; Figure 6 This is a schematic diagram of the structure of the autonomous driving fault-tolerant device provided in the embodiments of this application; Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0013] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.
[0014] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0015] In practical applications of autonomous driving, vehicles face numerous challenges, such as unpredictable weather and complex road conditions. This necessitates extremely high stability and self-recovery capabilities after failures. Therefore, L4-level autonomous driving hardware designs often employ redundancy, while the software application layer also incorporates redundancy and fault tolerance. Currently, software fault tolerance is largely function-specific and limited; each fault tolerance solution is developed for a specific function, lacking a universal framework. Adding new fault tolerance functions requires redevelopment, making upgrades and maintenance difficult. Static redundancy configurations lead to resource waste. There is a lack of unified node identification and automatic discovery mechanisms. Fault detection and recovery processes are slow, impacting system availability.
[0016] To address the aforementioned technical problems, this application provides an autonomous driving fault-tolerant method and electronic device. By implementing targeted fault-tolerant strategies based on the type of faulty functional units lacking redundant backup units, the fault-tolerant strategies include parameter compensation and restart processing. This achieves fault-tolerant processing for functional units that were not designed with redundancy due to cost and resource constraints. It solves the technical problem of low fault-tolerant coverage and low reliability in autonomous driving systems caused by the lack of redundancy in a large number of functional units in related technologies, achieving the technical effect of expanding fault-tolerant coverage and improving the reliability and safety of autonomous driving. Furthermore, multiple registration channels are used to register the unique target registration information of functional units, realizing a unified node identification and automatic discovery mechanism.
[0017] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0018] The specific application environment architecture or specific hardware architecture on which the execution of the autonomous driving fault-tolerant method depends is described here.
[0019] First, the structure of the autonomous driving fault-tolerant system on which this application is based will be described: The autonomous driving fault-tolerant method and electronic device provided in this application are applicable to fault-tolerant processing of functional units and hardware nodes in a vehicle autonomous driving system. For example... Figure 1 The diagram shown is a structural schematic of the autonomous driving fault-tolerant system based on the embodiments of this application. It mainly includes multiple hardware nodes and an autonomous driving fault-tolerant device. At least one functional unit is deployed on the hardware node. The functional unit is software. The autonomous driving fault-tolerant device is used to perform fault-tolerant processing on the hardware unit and the functional unit to ensure the stable and reliable operation of the vehicle's autonomous driving system.
[0020] This application provides an autonomous driving fault-tolerant method for fault-tolerant processing of functional units and hardware nodes in a vehicle autonomous driving system. The execution subject of this application embodiment is an electronic device, such as a server, desktop computer, laptop computer, tablet computer, or other electronic devices that can be used to perform fault-tolerant processing of functional units and hardware nodes in a vehicle autonomous driving system.
[0021] like Figure 2 The diagram shown is a flowchart of an autonomous driving fault-tolerant method provided in an embodiment of this application. The method includes: Step 201: Obtain the redundancy design information of the faulty functional unit in the vehicle's autonomous driving system.
[0022] Among them, redundancy design information is a backup strategy predefined for each functional unit during the design phase of the vehicle's autonomous driving system. It is used to record whether a functional unit has a backup unit, the number of backup units, their location, and the switching logic.
[0023] Step 202: If the redundancy design information of any faulty functional unit indicates that the faulty functional unit has no backup unit, then the faulty functional unit is regarded as a fault-tolerant unit.
[0024] Accordingly, if the redundancy design information of the faulty functional unit indicates that the faulty functional unit has a backup unit, then the system's predefined backup strategy is adopted, and the preset backup unit is used to perform fault tolerance processing on the faulty functional unit.
[0025] Step 203: If the fault-tolerant unit is an actuator, obtain the execution fault characteristic information of the fault-tolerant unit.
[0026] The actuators include a steering motor, brake calipers, and throttle controller, and the fault characteristic information includes at least the deviation in the actuation force. For example, if the steering motor needs to control the steering wheel to rotate 30 degrees, but the steering wheel actually only rotates 20 degrees, the fault characteristic information is a rotation angle deviation of -10 degrees.
[0027] Step 204: After the fault-tolerant unit determines the action to be executed in response to the action control command, the parameter compensation of the action to be executed of the fault-tolerant unit is performed according to the execution fault characteristic information of the fault-tolerant unit, so as to match the actual execution action of the fault-tolerant unit with the action control command.
[0028] Specifically, for actuators, a direct restart may interrupt critical control actions, leading to driving hazards. Therefore, this application's embodiments employ a safer parameter compensation strategy to compensate the parameters of the actions to be executed by the fault-tolerant unit. For example, when the steering motor currently responds to the motion control command and determines that the action to be executed is a 20-degree right turn of the steering wheel, and the execution fault characteristic information is a rotation angle deviation of -10 degrees, then a -10-degree parameter compensation is performed on the action to be executed so that the actual execution action of the steering motor is 30 degrees.
[0029] Step 205: If the fault-tolerant unit belongs to the target functional unit, obtain the current status information of the fault-tolerant unit.
[0030] It should be noted that the functional units of a vehicle autonomous driving system are divided into two types: critical functional units and general functional units. Critical functional units generally directly affect the safety of autonomous driving, so critical functional units usually have backup units. General functional units have a smaller impact on the safety of autonomous driving in the system. The target functional unit in the embodiments of this application can specifically refer to the above-mentioned general functional unit.
[0031] Step 206: When the current status information of the fault-tolerant unit meets the preset restart conditions, the fault-tolerant unit is restarted to eliminate the original fault.
[0032] The current status information of the target functional unit includes the CPU utilization rate, memory status, and whether it responds to heartbeat detection.
[0033] Specifically, based on the current status information of the fault-tolerant unit, it can be predicted whether it is possible to recover to normal after restarting, i.e., to eliminate the original fault. If so, the current status information of the fault-tolerant unit is determined to meet the preset restart conditions, and then it is restarted. Conversely, if it is predicted that it is impossible to recover to normal after restarting, in order to avoid restarting causing secondary faults, it is not restarted, but only the fault is reported.
[0034] Based on the above embodiments, as an implementable approach, in one embodiment, before obtaining the redundancy design information of the faulty functional unit in the vehicle autonomous driving system, the method further includes: Step 301: Obtain the unit registration information of the vehicle's autonomous driving system; Step 302: Collect operational characteristic information from multiple registered units based on their registration information; Step 303: When the operating characteristic information of any registered unit meets the corresponding fault conditions, the registered unit is determined to be a faulty functional unit.
[0035] The unit registration information contains a structured information database of all successfully identified functional units in the system. Each registration record corresponds to a unique identifier and related attributes of a functional unit.
[0036] Specifically, based on the unit registration information, it can be determined which functional units need to have their operational characteristic information collected, and then the corresponding monitor can be activated to collect the operational characteristic information of multiple registered units. When the operational characteristic information of any registered unit exceeds the safe range, it is determined that it meets the corresponding fault conditions, and then it is identified as a faulty functional unit.
[0037] Specifically, in one embodiment, the target registration information of each functional unit in the vehicle autonomous driving system can be determined based on the vehicle domain, location, hardware type, function type, and instance unique identifier of each functional unit. For any functional unit, multiple parallel registration channels are used to register the target registration information of the functional unit to the discovery cache, central registration center, and distributed neighbor table of the preset monitoring node of other hardware nodes besides the local hardware node where the functional unit is located. Among them, other hardware nodes and local hardware nodes belong to the same discovery domain, and preset monitoring nodes and local hardware nodes belong to the same local area network. The currently registered target registration information in the discovery cache, central registration center, and distributed neighbor table of preset monitoring nodes of each hardware node is processed by taking the union of the data to obtain the unit registration information of the vehicle autonomous driving system.
[0038] The unit registration information includes the target registration information for each registered unit.
[0039] For example, such as Figure 3 The diagram illustrates an exemplary registration process provided in this application embodiment. During the system initialization phase, each hardware node constructs a unique URI identifier based on its actual role and deployment location in the autonomous driving system. This process begins with the node's power-on self-test, reading the fixed hardware configuration information, including basic attributes such as hardware type, installation location, and function definition. The URI construction adopts a five-layer structure specifically designed for autonomous driving hardware: the vehicle domain identifies the functional domain to which the hardware belongs, such as the perception domain for various environmental perception sensors, the positioning domain for GPS and IMU devices, the planning domain for decision computing units, and the control domain for vehicle actuators; the location field precisely describes the physical installation location of the hardware on the vehicle body, such as the front, rear, roof, bumper, etc., which is crucial for sensor coverage monitoring and fault location; the hardware type field distinguishes different categories of devices such as cameras, LiDAR, millimeter-wave radar, ultrasonic sensors, ECUs, HPCs, etc.; the function field clarifies the specific purpose of the hardware, such as forward collision detection, surround view monitoring, path planning, braking control, etc.; and the instance ID is used to distinguish multiple hardware of the same type in the same location.
[0040] For hardware requiring special configuration, operating parameters are specified in the parameter section, such as camera resolution and frame rate, LiDAR detection range, and radar operating frequency. This structured naming convention enables the system to automatically understand the hardware's functional characteristics, deployment context, and performance parameters based on the URI, laying the foundation for intelligent fault tolerance. After the node constructs its URI, it enters a multi-level automatic registration phase. This phase employs three parallel registration mechanisms (registration channels) to ensure reliable node discovery and management under various operating environments.
[0041] Specifically, the DDS discovery layer registration channel targets sensors and control nodes with high real-time requirements. It automatically propagates node information within the domain by embedding URI information into the user data fields of DDS participants and utilizing the built-in DDS discovery protocol. This mechanism offers millisecond-level real-time performance, enabling rapid establishment of communication links between nodes, making it particularly suitable for sensing devices with stringent real-time requirements, such as cameras and radar. The centralized registration center channel sends registration requests containing detailed information such as complete URIs, capability descriptions, and service endpoints to the central management node via an HTTP REST API. The registration center persistently stores this information, providing a unified node management interface and supporting complex query and filtering operations. Nodes periodically send heartbeat messages to maintain their registration status; nodes that do not receive a heartbeat within a timeout are marked as abnormal. The distributed multicast discovery registration channel is based on the UDP multicast protocol. Nodes join predefined multicast groups and periodically broadcast announcement messages. Other nodes within the same vehicular network discover new nodes and maintain a dynamic neighbor table by listening to multicast messages. This mechanism does not rely on a central node, exhibiting excellent decentralization characteristics, and is suitable for maintaining basic node communication capabilities during backbone network failures.
[0042] The three registration channels described above operate in parallel, providing mutual backup and jointly constructing a robust node discovery system. The system continuously evaluates the health status of each registration channel and dynamically adjusts the registration strategy to ensure effective node management under any network conditions.
[0043] Furthermore, in one embodiment, for any registered unit, the target information to be collected can be determined based on the vehicle domain and function type represented by the target registration information of the registered unit; the target collection strategy can be determined based on the location and hardware type represented by the target registration information of the registered unit; and the target information to be collected can be collected according to the target collection strategy to obtain the operating characteristic information of the registered unit.
[0044] The vehicle domain is divided into at least three types: environmental perception domain, localization domain, and planning domain.
[0045] It should be noted that the vehicle domain defines the core responsibilities of each unit. For example, the environmental perception domain is responsible for identifying the environment, the localization domain for determining its own location, and the planning domain for decision-making and planning. The function type defines the specific task. The system has a built-in policy mapping table, which assigns the most relevant performance indicators to units with different functions. For example, for a unit whose function type is traffic light recognition and belongs to the environmental perception domain, the target information to be collected may include data processing latency, recognition accuracy, and message release frequency. For a unit whose function type is path planning and belongs to the planning domain, the target information to be collected may include planning cycle, decision rationality, and computation time.
[0046] Specifically, target acquisition strategies can be determined based on location and hardware type, including GPUs, CPUs, and LiDAR, among others. These strategies encompass the necessary technical interfaces and tools. By employing target acquisition strategies to collect information from specific targets, operational characteristic information accurately reflecting the unit's working status can be obtained, improving monitoring efficiency and accuracy and laying a reliable data foundation for subsequent precise fault diagnosis.
[0047] Based on the above embodiments, as an implementable approach, in one embodiment, the method further includes: Step 401: Obtain the hardware characteristic information of each hardware node; Step 402: When the hardware characteristic information of any hardware node meets the corresponding fault condition, the hardware node is determined to be a faulty hardware node. Step 403: If the redundancy design information of the faulty hardware node indicates that the faulty hardware node has no backup hardware, restart the faulty hardware node according to its hardware type.
[0048] The hardware nodes include sensors and computing devices (computing units).
[0049] Specifically, the content of hardware characteristic information is determined according to the type of hardware node. For example, the hardware characteristic information of computing devices includes CPU / GPU core utilization, memory usage and error rate, chip temperature, power consumption, fan speed, and average system load. Sensors (including cameras, LiDAR, and millimeter-wave radar, etc.) have hardware characteristic information including data output frame rate, signal strength, signal-to-noise ratio, internal temperature, calibration status, and power supply voltage.
[0050] Specifically, sensor restarts can be achieved by disconnecting and reconnecting its power (hard reset) or by sending a reset command via a dedicated control bus (soft reset). For example, for an unresponsive camera, the system might first attempt a soft reset; if that fails, it would then trigger a hard reset of the power management unit it resides in. Restarting a computing device typically means restarting the entire computing unit (such as a domain controller). This terminates all functional units running on that unit, thus requiring coordination with upper-level fault-tolerant processes. For example, it might first attempt to restart the operating system; if that fails, it might then perform a power-down and power-on of the entire hardware.
[0051] Accordingly, in one embodiment, if the redundancy design information of the faulty hardware node indicates that the faulty hardware node has backup hardware, the current service of the faulty hardware node can be migrated to the corresponding backup hardware so that the backup hardware can act as the master node to process the current service in place of the faulty hardware node.
[0052] Specifically, the core of the fault recovery phase is a resource-aware dynamic scheduling algorithm. This algorithm intelligently selects the optimal recovery plan based on the current system resource status and hardware characteristics. When a sensor node failure occurs, the system first assesses the impact of the failure on the sensing system. For critical sensors such as the forward-facing main camera, the system prioritizes a local restart recovery strategy, checking the device's resource reserves and health status to ensure that the restart process does not trigger secondary failures. During the restart process, the system coordinates other sensors to temporarily enhance the sensing capabilities of the relevant areas, filling data gaps.
[0053] Specifically, in the event of a computing unit failure, the computing tasks (current business) running on the failed unit can be migrated to other healthy computing nodes (backup hardware). The current resource database is queried to find candidate nodes with sufficient computing power, moderate load, and high reliability. During the selection process, the matching degree of various resource types such as CPU / GPU computing power, memory capacity, storage I / O, and network bandwidth can be comprehensively considered, as well as the historical reliability record and current temperature status of the target node.
[0054] Specifically, in one embodiment, when there is not a unique backup hardware, the static and dynamic capabilities of each backup hardware can be obtained; based on the static and dynamic capabilities of each backup hardware, a comprehensive weight value for each backup hardware can be determined; based on the comprehensive weight value of each backup hardware, a target backup hardware can be selected from multiple backup hardware; the current service of the faulty hardware node can be migrated to the corresponding target backup hardware, so that the target backup hardware can act as the master node to replace the faulty hardware node in processing the current service.
[0055] Specifically, for critical hardware nodes, an intelligent primary / backup election mechanism based on multi-dimensional weights can be used to coordinate the operation of multiple redundant nodes, ensuring seamless failover when a single node fails. Each node calculates its own comprehensive weight value upon startup, taking into account both its static capabilities and dynamic state. Static capabilities include hardware performance specifications, historical reliability records, and hardware acceleration features (such as the high efficiency of GPUs in image processing). Dynamic state includes current load levels, resource availability, temperature conditions, and communication quality. The system also supports administrators configuring priority weights to express the business importance of specific nodes.
[0056] The weighting calculation employs a weighted summation method, assigning different weight coefficients to different factors based on their importance. Node weights are periodically recalculated to reflect dynamic changes in node status. The node with the highest weight is elected as the master node, responsible for normal business processing and data output; nodes with lower weights serve as backup nodes, receiving the same input data and performing synchronous calculations, maintaining a hot standby state but not outputting results externally to avoid resource conflicts and data corruption.
[0057] The primary and backup nodes maintain state consistency through a dedicated state synchronization channel. The primary node periodically serializes and sends key state information, processing context, configuration parameters, etc., to the backup node. State synchronization employs incremental updates and compressed transmission mechanisms, transmitting only the changed state data, significantly reducing network bandwidth consumption. For sensor nodes, the synchronization content also includes calibration parameters and environmental adaptation data; for computing nodes, it synchronizes intermediate algorithm states and model parameters; and for control nodes, it synchronizes control context and execution status.
[0058] When the monitoring system detects a primary node failure, it immediately triggers a primary / standby failover process. The system first suspends sending new processing requests to the failed primary node to prevent command loss or out-of-order delivery; then, it promotes the highest-weighted standby node to the new primary node and loads the latest synchronization state; next, it updates service discovery and routing information in the system to ensure subsequent requests are correctly directed to the new primary node; finally, it broadcasts the primary node change notification to relevant nodes, completing the failover process. The entire failover process is completely transparent to upper-layer applications; applications do not need to be aware of changes to the underlying nodes, ensuring business continuity.
[0059] If the original master node subsequently recovers, the system will reassess its role based on its current weight value. If its weight is still the highest, it can re-elect the master node position; if its weight is no longer the highest, it will continue to operate as a backup node. This flexible role management mechanism ensures that the system always has the most suitable node undertaking critical tasks, optimizing overall performance and reliability.
[0060] Specifically, in one embodiment, the communication link quality characteristics between each hardware node can be obtained; based on the communication link quality characteristics between each hardware node, faulty communication links and corresponding faulty network devices can be screened; and based on the device type of the faulty communication links and corresponding faulty network devices, a fault recovery strategy for the faulty communication links can be determined.
[0061] The fault recovery strategy includes at least restarting the faulty network device.
[0062] Specifically, such as Figure 4 The diagram shown illustrates the overall process of the autonomous driving fault-tolerant method provided in this application embodiment. After the system enters the normal operation phase, a comprehensive intelligent monitoring system is established to monitor and analyze the system status in real time from multiple dimensions, including hardware resources, functional performance, and communication quality. The device node monitoring module continuously collects operational status data of various hardware resources through the operating system interface and hardware sensors. For computing units, it monitors indicators such as CPU / GPU utilization, memory usage, cache hit rate, temperature, and power consumption; for sensor devices, it monitors parameters such as data acquisition frequency, signal quality, and calibration status; for actuators, it monitors information such as response latency, execution accuracy, and wear status, thus obtaining multi-source monitoring data. This monitoring data is compared with preset safety thresholds to promptly detect resource anomalies and performance degradation for anomaly detection and aggregation, and to perform fault type analysis. Fault types are categorized as sensor faults, computing unit faults, actuator faults, and network communication faults.
[0063] In terms of multi-source monitoring data, a functional node monitoring module can be used for application-level performance monitoring, checking business indicators such as data processing latency, message publishing frequency, calculation accuracy, and functional integrity of each functional node. The health status of functional nodes is assessed by analyzing high-level indicators such as the detection confidence of perception algorithms, the rationality of planning decisions, and the smoothness of control commands. For functional nodes configured with fault tolerance, special attention is paid to the status synchronization of primary and backup nodes to ensure that backup nodes can seamlessly take over operations at any time. A communication status monitoring module checks the quality of communication links between nodes, monitoring indicators such as network connection status, communication latency, packet loss rate, and bandwidth utilization. Combined with URI naming conventions (target registration information), this module can accurately locate the location and scope of communication problems, distinguishing between single node failures and network partitioning issues, providing crucial information for fault diagnosis. Finally, a monitoring information collection module collects and preprocesses data from various monitoring sources, including data cleaning, format standardization, and timestamp alignment. Through multi-dimensional data correlation analysis, potential fault modes and impact chains are identified. Based on historical data and machine learning algorithms, early warning and trend prediction of faults are achieved, transforming passive response into proactive prevention.
[0064] Furthermore, when the monitoring system detects an anomaly, it triggers an intelligent fault type analysis process. This analysis comprehensively considers factors such as the severity, scope of impact, frequency of occurrence, and difficulty of recovery to accurately classify and rate the fault. Based on the analysis results, the system automatically selects the optimal recovery strategy to ensure rapid restoration of system functionality while minimizing performance impact.
[0065] Specifically, regarding sensor failures, the impact on perception is assessed. When a sensor node failure occurs, the extent of its impact on the perception system is first evaluated. For critical sensors such as the forward-facing main camera, a local restart recovery strategy is prioritized. The device's resource reserves and health status are checked to ensure the restart process does not trigger secondary failures. During the restart process, other sensors are coordinated to temporarily enhance the perception capabilities of the relevant areas and fill data gaps.
[0066] Accordingly, in the event of a computing unit failure, the computing load is checked, and the computing tasks running on the failed unit are migrated to other healthy computing nodes. The resource scheduling module queries the current resource database to find candidate nodes with sufficient computing power, moderate load, and high reliability. During the selection process, the matching degree of various resource types such as CPU / GPU computing power, memory capacity, storage I / O, and network bandwidth is comprehensively considered, as well as the historical reliability record and current temperature status of the target node.
[0067] Accordingly, for actuator failures, the execution response is tested, and different recovery strategies are adopted according to the severity of the failure. For minor failures, such as increased response delay, compensation can be made by adjusting the control algorithm parameters; for severe failures, such as complete failure, redundant actuators need to be activated to take over the work, and the vehicle's control strategy needs to be replanned.
[0068] Furthermore, in one embodiment, if the original fault of the faulty hardware node or the fault-tolerant unit is not eliminated after restarting the faulty hardware node or restarting the fault-tolerant unit, then the corresponding system downgrade processing is performed on the vehicle's autonomous driving system, and a system downgrade processing notification is generated for the driver.
[0069] Specifically, in extreme situations such as severe resource shortages or multiple failures, the system can activate a tiered safety degradation mode. First, it ensures the vehicle's most basic safety functions, such as emergency braking and minimum-risk maneuvering; then, based on remaining resources, it gradually restores other critical functions; simultaneously, it issues clear system status alerts and takeover requests to the driver and passengers. The system continuously monitors resource conditions, and once conditions improve, it immediately attempts to restore the degraded non-critical functions.
[0070] In-situ node updates involve restarting and parameter compensation, while function migration and deployment involve using backup nodes. After the entire recovery process is complete, a comprehensive verification of the recovery effect is performed. Functional testing and performance benchmark checks confirm whether the fault has been truly resolved and whether the system functions have returned to normal. After successful verification, the system's global status information is updated to ensure that all nodes have a consistent understanding of the system's current state, thus restoring normal operation.
[0071] For example, such as Figure 5 The diagram shows the application architecture of the autonomous driving fault-tolerant method provided in this application embodiment. It is mainly divided into an autonomous driving application layer, a fault-tolerant architecture core module, a communication middleware layer, and an in-vehicle heterogeneous hardware platform. The autonomous driving application layer includes functional units such as perception algorithms, planning and decision-making, and control execution, achieving highly reliable autonomous driving functions. The fault-tolerant framework core module integrates modules such as URI registration, fault-tolerant communication, configuration management, monitoring mobile phones, device monitoring, function monitoring, decision-making modules, and resource scheduling. The communication middleware layer includes DDS, HTTP, and multicast discovery, achieving triple parallel registration. The in-vehicle heterogeneous hardware platform includes resources such as sensors, computing units, actuators, and network devices. Within the fault-tolerant architecture core module, the unified URI naming and automatic registration module (URI registration) is responsible for providing standardized identifiers and multi-level automatic registration for all nodes. The URI structure is: / {vehicle domain} / {location} / {hardware type} / {function} / {instance ID} parameter. The registration mechanism is a triple registration: DDS discovery layer registration, centralized registration center, and distributed multicast discovery. The fault-tolerant communication infrastructure module is responsible for automatic primary / backup node election, state synchronization, and seamless failover, featuring intelligent election based on multi-dimensional weights, real-time state synchronization, and millisecond-level fault switching. The fault-tolerant functional node configuration module (configuration management module) supports users in dynamically configuring hardware nodes requiring fault tolerance and their fault-tolerant policies, featuring a graphical configuration interface, hot policy loading, and hierarchical permission management. The device node monitoring module (monitoring and collection module) is responsible for real-time monitoring of the operating status and health of hardware resources, with monitoring indicators including CPU / GPU utilization, memory usage, temperature, power consumption, and communication quality. The functional node monitoring module (device monitoring module) monitors the operating status and performance indicators of application functional nodes, with monitoring scope including data processing latency, message frequency, calculation accuracy, and functional integrity. The monitoring information collection module (functional monitoring module) is responsible for the unified collection, preprocessing, and analysis of multi-source monitoring data, performing data cleaning, anomaly detection, correlation analysis, and trend prediction. The decision module is responsible for intelligently analyzing fault types, assessing impact, and formulating optimal recovery strategies, performing multi-dimensional fault analysis, hierarchical recovery strategies, and risk assessment and prediction. Finally, there is the resource scheduling and execution module. This node performs recovery operations based on a resource-aware dynamic scheduling algorithm, which is used for resource adequacy assessment, reliability history analysis, and performance matching optimization.
[0072] Specifically, in one embodiment, the decision module in the current architecture primarily relies on predefined rules and thresholds for fault diagnosis and response strategy selection. In practical applications, it can be upgraded to an intelligent decision-making core with online learning and evolution capabilities. This extended module continuously receives real-time system status, fault characteristics, and historical fault-tolerance feedback from monitoring, equipment monitoring, and functional monitoring, constructing a mapping model of system health-fault-fault-tolerance actions. Through reinforcement learning algorithms, the module can dynamically evaluate the long-term expected benefits of different fault-tolerance strategies (such as restart, compensation, migration, and degradation) in the current complex context (such as vehicle speed, road environment, task criticality, and resource bottlenecks), rather than simply following static rules. For example, when system resources are strained, facing multiple non-critical faults, it can learn that delaying restart and prioritizing control links yields higher overall reliability than immediately restarting all faulty components. This extension enables the system not only to handle known fault modes but also to adaptively optimize response strategies for unknown or complex faults, achieving progress from rule-based fault tolerance to adaptive fault tolerance based on goal optimization and learning. This improves the level of intelligence in extreme edge scenarios and further enhances the safety and reliability of autonomous driving.
[0073] Specifically, in a distributed vehicular environment, maintaining the consistency of the state of each node is crucial for system security and performance. The system employs a multi-mechanism collaborative approach to ensure the reliable propagation and consistency maintenance of state information.
[0074] For real-time status information, such as sensor data, control commands, and vehicle attitude, DDS's real-time data distribution mechanism is used for propagation. Utilizing DDS's built-in Quality of Service (QoS) policies, such as reliability, durability, and deadlines, it ensures that critical status information reaches all relevant nodes in a timely and reliable manner. DDS's topic-based publish-subscribe model naturally supports one-to-many communication, making it suitable for broadcasting important system status.
[0075] For management-related status information, such as node configuration, resource allocation, and fault statistics, a centralized registry center is used for unified management and distribution. As the authoritative data source, the registry center maintains a consistent view of the global status and provides services such as status query, update, and subscription. Nodes obtain the latest status information through periodic polling or change notification mechanisms to ensure consistent global understanding.
[0076] In the event of network partitions, central node failures, or other anomalies, a multicast mechanism is used to maintain basic state synchronization within the local network. Upon detecting a loss of connection with the central node, a node automatically switches to distributed negotiation mode, using multicast communication to synchronize critical state information within its reach and maintain basic system functionality.
[0077] Regularly perform state consistency checks by comparing key state information across different nodes to identify potential inconsistencies. The check process employs multiple methods, including sampling verification, checksum comparison, and logical consistency verification. When inconsistencies are detected, appropriate remediation procedures are initiated based on the severity and scope of the inconsistency. Minor inconsistencies can be automatically corrected through state synchronization; severe inconsistencies involving safety-critical states may require isolating the relevant nodes and requesting manual intervention.
[0078] Through the coordinated operation of the above mechanisms, the consistency of the state can be maintained under various operating conditions, providing a solid foundation for reliable fault-tolerant processing and ensuring that the autonomous driving system can make safe and reliable decisions based on accurate and consistent state information under any circumstances.
[0079] The autonomous driving fault-tolerant method provided in this application implements a targeted fault-tolerant strategy for faulty functional units that lack redundant backup units, based on their type. The fault-tolerant strategy includes parameter compensation and restart processing, thereby achieving fault-tolerant processing for functional units that are not designed with redundancy due to cost and resource constraints. This solves the technical problem of low fault tolerance coverage and low reliability of autonomous driving systems caused by the lack of redundant design in a large number of functional units in related technologies, and achieves the technical effect of expanding the fault tolerance coverage and improving the reliability and safety of autonomous driving.
[0080] Furthermore, the fault recovery time is significantly shortened. Traditional solutions typically require manual recovery at the minute level, while the embodiments of this application achieve automatic recovery at the second level, with critical faults switching at the millisecond level. System reliability is improved, and resource utilization efficiency is optimized. Traditional solutions use static redundancy, while the embodiments of this application use dynamic shared redundancy, resulting in higher resource utilization.
[0081] Furthermore, development costs are significantly reduced. Traditional solutions typically require developing fault-tolerant logic separately for each hardware node, while this application provides a unified fault-tolerant framework that allows hardware nodes to be plug-and-play. Maintenance costs are also significantly reduced. Traditional solutions generally involve decentralized maintenance, which is time-consuming for fault location. This application provides centralized management, intelligent diagnosis, and rapid fault location. System lifecycle costs are optimized. Traditional solutions typically require redeveloping fault-tolerant logic for hardware upgrades, while in this application, the framework supports seamless hardware upgrades and expansions.
[0082] Furthermore, it offers enhanced flexibility and excellent scalability. It supports dynamic plugging and hot-swapping of hardware nodes, enabling rapid adaptation to different vehicle models and hardware configurations, and supports online fault tolerance strategy adjustments and function upgrades. Its modular design supports rapid integration of new sensors and computing units. The embodiments in this application feature a distributed architecture, supporting large-scale system expansion from single-domain to cross-domain, and it has open interfaces, supporting collaborative fault tolerance and data analysis with the cloud. Multiple security mechanisms ensure that single-point failures do not trigger system-level failures, enabling fault isolation and impact control, and limiting the scope of fault propagation. It incorporates machine learning-based fault prediction and health management; intelligent resource scheduling that dynamically optimizes resource allocation based on business needs; and adaptive fault tolerance strategies that automatically adjust based on the operating environment and system status.
[0083] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.
[0084] The embodiments of this application also provide an autonomous driving fault-tolerant device for executing the autonomous driving fault-tolerant method provided in the above embodiments.
[0085] like Figure 6 The diagram shown is a structural schematic of an autonomous driving fault-tolerant device provided in an embodiment of this application. The autonomous driving fault-tolerant device 60 includes: a first acquisition module 601, a determination module 602, a second acquisition module 603, a compensation module 604, a third acquisition module 605, and a restart module 606.
[0086] The system comprises the following modules: a first acquisition module for acquiring redundant design information of faulty functional units in the vehicle's autonomous driving system; a determination module for identifying a faulty functional unit as a unit to be fault-tolerant if the redundant design information indicates that the faulty functional unit has no backup unit; a second acquisition module for acquiring execution fault characteristic information of the unit to be fault-tolerant if it is an actuator; a compensation module for compensating parameters of the unit to be fault-tolerant's execution action based on its execution fault characteristic information after the unit determines the action to be executed in response to the action control command, so as to match the actual execution action of the unit to be fault-tolerant with the action control command; a third acquisition module for acquiring the current state information of the unit to be fault-tolerant if it is a target functional unit; and a restart module for restarting the unit to be fault-tolerant when its current state information meets preset restart conditions, in order to eliminate the original fault.
[0087] For a description of the features in the embodiments corresponding to the autonomous driving fault-tolerant device, please refer to the relevant descriptions in the embodiments corresponding to the autonomous driving fault-tolerant method, which will not be repeated here.
[0088] Embodiments of this application also provide an electronic device, such as... Figure 7 The diagram shown is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, including a processor 10 and a memory 20. The memory 20 stores a computer program, and the processor 10 is configured to run the computer program to execute the steps in any of the above-described embodiments of the autonomous driving fault-tolerant method.
[0089] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above embodiments of the autonomous driving fault-tolerant method when running.
[0090] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0091] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the autonomous driving fault-tolerant method.
[0092] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above embodiments of the autonomous driving fault-tolerant method.
[0093] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be executed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.
[0094] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0095] The above provides a detailed description of an autonomous driving fault-tolerant method and electronic device provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are only intended to help understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.
Claims
1. A fault-tolerant method for autonomous driving, characterized in that, include: Obtain redundant design information of faulty functional units in the vehicle's autonomous driving system; If the redundancy design information of any of the aforementioned faulty functional units indicates that the faulty functional unit has no backup unit, the faulty functional unit shall be regarded as a fault-tolerant unit. If the fault-tolerant unit is an actuator, obtain the execution fault characteristic information of the fault-tolerant unit; After the fault-tolerant unit determines the action to be executed in response to the action control command, the parameter compensation is performed on the action to be executed of the fault-tolerant unit according to the execution fault characteristic information of the fault-tolerant unit, so as to match the actual execution action of the fault-tolerant unit with the action control command. If the fault-tolerant unit belongs to the target functional unit, obtain the current state information of the fault-tolerant unit; When the current status information of the fault-tolerant unit meets the preset restart conditions, the fault-tolerant unit is restarted to eliminate the original fault.
2. The autonomous driving fault-tolerant method according to claim 1, characterized in that, Before obtaining the redundancy design information of the faulty functional unit in the vehicle's autonomous driving system, the method further includes: Obtain the unit registration information of the vehicle's autonomous driving system; Based on the unit registration information, operational characteristic information is collected from multiple registered units; When the operational characteristic information of any of the registered units meets the corresponding fault conditions, the registered unit is determined to be a faulty functional unit.
3. The autonomous driving fault-tolerant method according to claim 2, characterized in that, The step of obtaining the unit registration information of the vehicle's autonomous driving system includes: Based on the vehicle domain, location, hardware type, function type and instance unique identifier of each functional unit in the vehicle autonomous driving system, determine the target registration information of each functional unit. For any of the aforementioned functional units, multiple parallel registration channels are used to register the target registration information of the functional unit to the discovery cache, central registration center, and distributed neighbor table of the preset listening node of other hardware nodes besides the local hardware node where the functional unit is located; wherein, the other hardware nodes and the local hardware node belong to the same discovery domain, and the preset listening node and the local hardware node belong to the same local area network. The unit registration information of the vehicle autonomous driving system is obtained by performing a union operation on the currently registered target registration information in the distributed neighbor tables of the discovery cache of each hardware node, the central registration center, and the preset listening node. The unit registration information includes the target registration information for each registered unit.
4. The autonomous driving fault-tolerant method according to claim 3, characterized in that, The step of collecting operational characteristic information from multiple registered units based on the unit registration information includes: For any of the registered units, the target information to be collected is determined based on the vehicle domain and function type represented by the target registration information of the registered unit. Based on the location and hardware type represented by the target registration information of the registered units, a target acquisition strategy is determined; According to the target acquisition strategy, the target information to be acquired is acquired to obtain the operational characteristic information of the registered unit; The vehicle domain is divided into at least three types: environmental perception domain, positioning domain, and planning domain.
5. The autonomous driving fault-tolerant method according to claim 3, characterized in that, The method further includes: Obtain hardware characteristic information for each hardware node; When the hardware characteristic information of any of the hardware nodes meets the corresponding fault conditions, the hardware node is determined to be a faulty hardware node. If the redundancy design information of the faulty hardware node indicates that the faulty hardware node has no backup hardware, the faulty hardware node shall be restarted according to the hardware type of the faulty hardware node. The hardware nodes include sensors and computing devices.
6. The autonomous driving fault-tolerant method according to claim 5, characterized in that, The method further includes: When the redundancy design information of the failed hardware node indicates that the failed hardware node has backup hardware. The current services of the faulty hardware node are migrated to the corresponding backup hardware, so that the backup hardware acts as the master node to handle the current services in place of the faulty hardware node.
7. The autonomous driving fault-tolerant method according to claim 5, characterized in that, The method further includes: Obtain the communication link quality characteristics between each of the hardware nodes; Based on the quality characteristics of the communication links between the hardware nodes, faulty communication links and corresponding faulty network devices are screened. Based on the faulty communication link and the corresponding faulty network device type, determine the fault recovery strategy for the faulty communication link; The fault recovery strategy includes at least restarting the faulty network device.
8. The autonomous driving fault-tolerant method according to claim 6, characterized in that, The step of migrating the current services of the failed hardware node to the corresponding backup hardware, so that the backup hardware acts as the master node to handle the current services in place of the failed hardware node, includes: When the backup hardware is not unique, obtain the static and dynamic capabilities of each backup hardware. Based on the static and dynamic capabilities of each of the backup hardware devices, a comprehensive weight value for each backup hardware device is determined. Based on the comprehensive weight value of each of the backup hardware devices, a target backup hardware device is selected from among the multiple backup hardware devices; The current services of the faulty hardware node are migrated to the corresponding target backup hardware, so that the target backup hardware acts as the master node to handle the current services in place of the faulty hardware node.
9. The autonomous driving fault-tolerant method according to claim 5, characterized in that, The method further includes: If the original fault of the faulty hardware node or the fault-tolerant unit is not eliminated after restarting the faulty hardware node or the fault-tolerant unit, the corresponding system downgrade processing is performed on the vehicle autonomous driving system, and a system downgrade processing notification is generated to the driver.
10. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for implementing the steps of the autonomous driving fault-tolerant method as described in any one of claims 1 to 9 when executing the computer program.