Satellite computer fault-tolerant method without redundant failure reconstruction

By acquiring the health status and retry count of the onboard computer's storage devices through the BIOS, and combining this with a hardware watchdog and network boot entries, the onboard computer achieves autonomous dynamic fault recovery in case of failure. This solves the problems of easy paralysis and inability to recover autonomously in existing technologies, and provides real-time fault recovery and network boot capabilities.

CN120929290BActive Publication Date: 2026-06-19BEIJING ZHONGKE TIANSUAN TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING ZHONGKE TIANSUAN TECHNOLOGY CO LTD
Filing Date
2025-07-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the startup process of onboard computers relies heavily on a single, fixed physical storage device, which is prone to data errors or physical damage due to space radiation or device aging, leading to permanent computer failure. Furthermore, autonomous real-time fault recovery cannot be achieved in deep space exploration missions.

Method used

The system uses the BIOS to access the hardware status manager to obtain the health status data and retry count of the storage device, dynamically selects the normal device for booting, and triggers a system hardware reset after a timeout through a hardware watchdog. Combined with the network boot option, it automatically switches to network boot when the local storage medium fails.

Benefits of technology

It enables autonomous dynamic fault recovery of the onboard computer in case of failure, avoids system paralysis caused by single point of failure, and has the ability to recover from faults in real time without human intervention and the ultimate fault tolerance effect of network guidance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120929290B_ABST
    Figure CN120929290B_ABST
Patent Text Reader

Abstract

This invention relates to the field of spacecraft electronic systems technology, and discloses a fault-tolerant method for redundancy-free fault reconstruction of an onboard computer, comprising: Step 1: After system startup, the Basic Input / Output System (BIOS) accesses the Hardware Status Manager to obtain health status data recording the health status of the local storage device, the current startup target ID indicating the current startup target, and the retry count associated with the current startup target ID. By using the BIOS to access the Hardware Status Manager to obtain the health status data and retry count of the local storage device, and autonomously selecting a normal device for startup, a fault-tolerant effect is achieved by dynamically avoiding faulty startup paths. Compared to existing technologies where the computer startup process highly relies on a single, fixed physical storage device, this method solves the single-point failure risk of permanent system failure due to the failure of a single storage device.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of spacecraft electronic systems technology, specifically to a fault-tolerant method for redundancy-free fault reconstruction of onboard computers. Background Technology

[0002] As the core control and data processing unit of a spacecraft, the reliability and stability of the onboard computer directly determine the success or failure of the space mission. The boot process is the initial and crucial step in establishing the computer system's normal operating state. It relies on the Basic Input / Output System (BIOS) to load the operating system bootloader and core files from fixed physical storage devices in a preset order. In conventional ground-based applications, the boot mechanism is mature and reliable. However, in long-term, complex missions such as deep space exploration and on-orbit servicing, onboard computers face multiple severe challenges, including high-energy particle radiation in space, intense temperature fluctuations, and natural aging of components.

[0003] Existing technologies have the following technical shortcomings in addressing startup failures of onboard computers:

[0004] The startup process of existing spaceborne computers relies heavily on a single, fixed physical storage device. When space radiation effects or the aging of the device itself cause data errors or physical damage to the storage device, the computer will lose its boot path, resulting in the computing system being unable to start and falling into a permanent state of paralysis, exposing the inherent risk of a single point of failure.

[0005] Existing technologies employ redundancy backup schemes to address single-point failures. After the primary system fails, the switching and activation of the backup system relies entirely on remote control commands from the ground center. In deep space exploration missions with long communication link delays, the fault handling mode requiring manual intervention from the ground suffers from severe lag and cannot meet the real-time fault recovery requirements for autonomous operation of spacecraft in orbit.

[0006] In the face of extreme situations where onboard storage devices experience complete physical failure, existing technologies do not provide a mechanism that enables the computer to autonomously switch boot modes. The system cannot automatically revert to obtaining a boot image from a healthy node in the satellite network after a local boot failure, thus lacking onboard autonomous recovery capabilities and making it difficult to cope with the problem of permanent failure of storage media.

[0007] To address these issues, this invention proposes a fault-tolerant method for redundancy-free fault reconstruction of onboard computers. Summary of the Invention

[0008] To address the shortcomings of existing technologies, this invention provides a fault-tolerant method for redundancy-free fault reconstruction of onboard computers, thereby solving the problems mentioned in the background section.

[0009] To achieve the above objectives, the present invention provides the following technical solution: a fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer, comprising:

[0010] Step 1: After the system starts up, the BIOS accesses the hardware status manager to obtain health status data that records the health status of the local storage device, the current boot target ID that indicates the current boot target, and the retry count value associated with the current boot target ID;

[0011] Step 2: The BIOS determines the device to be started based on the health status data and the current boot target ID obtained in Step 1. The determination process involves selecting a local storage device marked as normal in the health status data, or setting a preset network boot item as the device to be started when all local storage devices are marked as faulty, and updating the current boot target ID in the hardware status manager with the ID corresponding to the device to be started.

[0012] Step 3: Before the operating system is loaded on the device for this startup as determined in Step 2, the Basic Input / Output System (BIOS) determines whether the retry count value obtained in Step 1 has reached the preset fault threshold. If it has not reached the threshold, the retry count value is incremented and updated to the Hardware Status Manager, and then the hardware watchdog built into the Hardware Status Manager is activated.

[0013] Step 4: After activating the hardware watchdog in Step 3, if the operating system successfully loads within the predetermined timeout period, the operating system sends a feed command to the hardware watchdog and resets the retry count value associated with the device being started in the hardware status manager; if the hardware watchdog is triggered due to timeout, the hardware watchdog sends a hardware reset signal to the computer system, and the hardware reset signal restarts the computer system.

[0014] Preferably, in step 2, setting the preset network boot item as the current boot device specifically means that the Basic Input / Output System (BIOS) updates the current boot target ID to a specific ID pointing to the radiation-hardened network interface controller, so as to perform network booting through the preset extensible firmware interface PXE protocol.

[0015] Preferably, in step 3, if the BIOS determines that the retry count value obtained in step 1 has reached the preset fault threshold, the BIOS updates the health status data, marks the device to be started as faulty, and returns to step 2 to determine the next device to be started.

[0016] In step 3, activating the hardware watchdog includes: the BIOS writing a preset timeout value to the watchdog control register in the hardware status manager, and setting the enable bit in the watchdog control register.

[0017] Preferably, in step 4, the operating system accesses the hardware status manager by loading a driver to perform the operations of sending the dog-feeding command and resetting the retry count value;

[0018] In step 4, after the operating system sends the watchdog command for the first time, it continuously sends the watchdog command to the hardware watchdog through the system service at a fixed period that is less than the predetermined timeout.

[0019] Preferably, step 1 further includes:

[0020] Sub-step 1.1: The Basic Input / Output System (BIOS) determines, based on a preset hardware status manager base address, a first register address for obtaining the health status data, a second register address for obtaining the current boot target ID, and a third register address for obtaining the retry count value. The first register address, the second register address, and the third register address are all calculated using the following formula:

[0021] Addr i =A base +Offset i ,

[0022] Among them, Addr i For the address of the first, second, or third register, A base The base address of the hardware state manager, Offset i This is a preset address offset corresponding to health status data, the current startup target ID, or the retry count value;

[0023] In sub-step 1.2, the BIOS accesses the first register address determined in sub-step 1.1, reads the original health status value, and determines whether the full failure condition is met based on the total number of local storage devices. The formula for determining the full failure condition is as follows:

[0024] C AF =(V HS ==2 N -1),

[0025] Among them, C AF V represents the Boolean decision result under all fault conditions. HS For health status data, N is the total number of local storage devices;

[0026] In sub-step 1.3, after obtaining the health status data from sub-step 1.2, the BIOS accesses the second and third register addresses determined in sub-step 1.1 to obtain the current boot target ID and the retry count value. The BIOS then determines the validity of the current boot target ID based on the total number of local storage devices and the network boot item ID. The validity determination formula is as follows:

[0027] C IDV = (0≤D) ID <N)∨(D ID ==ID PXE ),

[0028] Among them, C IDV D represents the Boolean determination of the validity of the target ID at the current startup. ID Here is the target ID for the current startup, and N is the total number of local storage devices. PXE This is the preset network startup item ID.

[0029] Preferably, step 2 further includes:

[0030] Sub-step 2.1: The Basic Input / Output System (BIOS) uses the current boot target ID obtained in step 1 as the initial index to perform a cyclic search within a preset set of local device IDs, and checks whether the local storage device corresponding to the initial index is functioning correctly using the following health determination formula:

[0031] C H =(V HS &(1<<D ID ))==0,

[0032] Among them, C H V is the Boolean result of the health assessment formula. HS For health status data, D ID The target ID for the current startup;

[0033] If the Boolean result is true, the local storage device is determined to be normal.

[0034] Sub-step 2.2: If the Boolean result of the health determination formula in sub-step 2.1 is false, the BIOS updates the current index value using an index increment formula to check the next local storage device, until the local device ID set has been traversed. The index increment formula is ID. current :

[0035]

[0036] in, For the next index value after the update, The current index value used for health assessment, where N is the total number of local storage devices;

[0037] Sub-step 2.3: After the loop search in sub-step 2.2 is completed, if there is one or more sets of local storage devices with true Boolean results, the first local storage device that is determined to be normal is identified as the device to be started this time.

[0038] If the Boolean result of all devices in the local device ID set is false, the preset network startup item is determined as the startup device for this time, and the current startup target ID in the hardware status manager is updated with the new ID of the determined startup device for this time.

[0039] Preferably, step 3 further includes:

[0040] Sub-step 3.1: The Basic Input / Output System (BIOS) first performs a fault threshold determination, which compares the retry count value obtained in step 1 with a preset fault threshold. The formula for determining the fault threshold is:

[0041] C FT =(V RC ≥T F ),

[0042] Among them, C FT V is the Boolean result of the fault threshold determination. RC T is the retry count value. F Preset fault threshold;

[0043] Sub-step 3.2: If the fault threshold determination Boolean result of sub-step 3.1 is false, the BIOS updates the retry count value using a counting increment formula, writes the updated retry count value to the hardware status manager, and then activates the hardware watchdog. The counting increment formula is:

[0044]

[0045] in, V is the updated retry count value. RC This is the retry count value;

[0046] Sub-step 3.3: If the fault threshold determination Boolean result of sub-step 3.1 is true, the BIOS performs a status update operation, modifying the health status data obtained in step 1 using a bitwise OR operation formula, and marking the health status bit associated with the device being started in step 2 as faulty. The bitwise OR operation formula is as follows:

[0047]

[0048] in, For the updated health status data, V HS For health status data, D ID This is the target ID for the current startup.

[0049] Preferably, step 4 further includes:

[0050] Sub-step 4.1, the condition for successful loading of the operating system is determined by the fact that, within the predetermined timeout period, the dedicated driver has sent the "feed the watchdog" instruction to the hardware watchdog. The formula for determining this condition is:

[0051] C S =(T boot <T WDT ),

[0052] Among them, C S T represents the Boolean result indicating successful loading of the operating system. boot The startup time consumed by the dog-feeding command, T WDT The scheduled timeout time set for the hardware watchdog;

[0053] Sub-step 4.2: If the Boolean decision result of sub-step 4.1 is true, the operating system immediately executes the operation of resetting the retry count value. The operation updates the retry count value associated with the device started in this step 2 to zero using a counter reset formula, which is:

[0054]

[0055] in, This is the new retry count value after the reset. This is the original retry count value;

[0056] Sub-step 4.3: If the Boolean determination result of sub-step 4.1 is false, it indicates that the hardware watchdog timer has been triggered due to timeout. The trigger will cause the hardware watchdog timer to send the hardware reset signal to the computer system. The timeout trigger condition of the hardware watchdog timer is determined by the following formula:

[0057] C TO =(T elapsed ≥T WDT )∧(F feed ==0),

[0058] Among them, C TO T is the Boolean result of the timeout trigger condition. elapsed T represents the elapsed time after the hardware watchdog is activated. WDTThe scheduled timeout time set for the hardware watchdog, F feed This is a hardware flag used to indicate whether the dog-feeding command has been received.

[0059] Preferably, the hardware status manager is a radiation-hardened field-programmable gate array (FPGA), and the health status data, the current startup target ID, and the retry count value obtained in step 1 are all stored in the non-volatile memory unit integrated in the radiation-hardened FPGA.

[0060] Preferably, in step 4, after the computer system restarts due to the hardware reset signal issued by the hardware watchdog, the health status data and retry count value obtained by the BIOS when returning to execute step 1 are values ​​that have been updated before the hardware reset signal is triggered. These values ​​are maintained during the hardware reset process because they are stored in the non-volatile storage medium of the hardware status manager.

[0061] This invention provides a fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer. It has the following beneficial effects:

[0062] 1. This invention employs a method where the BIOS accesses the hardware status manager to obtain health status data and retry count values ​​of the local storage device, and then autonomously selects a normal device for startup. This achieves a fault-tolerant effect by dynamically avoiding faulty startup paths. Compared to existing technologies where the computer startup process highly depends on a single physical storage device, this invention solves the single-point failure risk of permanent system failure due to the damage of a single storage device.

[0063] 2. This invention employs an autonomous closed-loop control method in which a hardware watchdog triggers a system hardware reset after a startup timeout, and the Basic Input / Output System (BIOS) executes a new round of startup decisions based on the retry count and health status data updated before the reset. This achieves real-time fault recovery without human intervention. Compared to the redundant backup scheme in the prior art where the switching between primary and backup systems relies entirely on remote control commands from the ground center, this invention addresses the shortcomings of severely delayed fault handling response caused by long communication link delays.

[0064] 3. This invention adopts a method of automatically setting a preset network boot device as the boot device when all local storage devices are marked as faulty. This achieves the ultimate fault tolerance effect of falling back to network boot after the complete failure of local storage media. Compared with the prior art, which does not provide a mechanism to automatically switch to network to obtain boot image after local boot failure, this invention solves the problem that the system cannot cope with permanent failure of storage media due to the lack of onboard autonomous recovery capability when facing a complete physical failure of onboard storage devices. Attached Figure Description

[0065] Figure 1 This is a flowchart of the present invention;

[0066] Figure 2 This is a system architecture diagram of the present invention;

[0067] Figure 3 This is a schematic diagram of the bit field definition of the FPGA register group according to the present invention;

[0068] Figure 4 This is a BIOS boot flowchart of the present invention;

[0069] Figure 5 This is a flowchart of the FPGA startup item acquisition logic diagnosis process of the present invention. Detailed Implementation

[0070] To enable those skilled in the art to understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are some, but not all, of the embodiments of the present invention. Other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort should fall within the scope of protection of the present invention.

[0071] The present invention will now be described in detail with reference to the accompanying drawings:

[0072] Example 1: Please refer to the appendix Figure 1 This invention provides a fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer, comprising:

[0073] Step 1: After the system starts up, the BIOS accesses the hardware status manager to obtain health status data that records the health status of the local storage device, the current boot target ID that indicates the current boot target, and the retry count value associated with the current boot target ID;

[0074] Sub-step 1.1: The BIOS, based on the preset hardware status manager base address, determines the address of the first register used to obtain health status data, the address of the second register used to obtain the current boot target ID, and the address of the third register used to obtain the retry count value. The addresses of the first, second, and third registers are all calculated using the following formula:

[0075] Addr i =A base +Offset i ,

[0076] Among them, Addr i For the address of the first, second, or third register, A baseThe base address of the hardware state manager, Offset i This is a preset address offset corresponding to health status data, the current startup target ID, or the retry count value;

[0077] In sub-step 1.2, the BIOS accesses the first register address determined in sub-step 1.1, reads the original health status value, and determines whether the full failure condition is met based on the total number of local storage devices. The formula for determining the full failure condition is as follows:

[0078] C AF =(V HS ==2 N -1),

[0079] Among them, C AF V represents the Boolean decision result under all fault conditions. HS For health status data, N is the total number of local storage devices;

[0080] In sub-step 1.3, after obtaining the health status data from sub-step 1.2, the BIOS accesses the second and third register addresses determined in sub-step 1.1 to obtain the current boot target ID and retry count. It then determines the validity of the current boot target ID based on the total number of local storage devices and the network boot item ID. The validity determination formula is as follows:

[0081] C IDV = (0≤D) ID <N)∨(D ID ==ID PXE ),

[0082] Among them, C IDV D represents the Boolean determination of the validity of the target ID at the current startup. ID Here is the target ID for the current startup, and N is the total number of local storage devices. PXE This is the preset network startup item ID;

[0083] Step 2: The BIOS determines the boot device based on the health status data and the current boot target ID obtained in Step 1. The determination process involves selecting the local storage device marked as normal in the health status data, or setting the preset network boot item as the boot device when all local storage devices are marked as faulty, and updating the current boot target ID in the hardware status manager with the ID corresponding to the boot device.

[0084] Sub-step 2.1: The Basic Input / Output System (BIOS) uses the current boot target ID obtained in step 1 as the initial index to perform a loop search within the preset set of local device IDs, and checks whether the local storage device corresponding to the initial index is functioning correctly using the following health judgment formula:

[0085] C H =(V HS &(1<<D ID ))==0,

[0086] Among them, C H V is the Boolean result of the health assessment formula. HS For health status data, D ID The target ID for the current startup;

[0087] If the Boolean result is true, the local storage device is considered to be functioning correctly.

[0088] In sub-step 2.2, if the Boolean result of the health determination formula in sub-step 2.1 is false, the BIOS updates the current index value using an index increment formula to check the next local storage device, until the local device ID set has been traversed. The index increment formula is ID. current :

[0089]

[0090] in, For the next index value after the update, The current index value used for health assessment, where N is the total number of local storage devices;

[0091] Sub-step 2.3: After the loop search in sub-step 2.2 is completed, if there is one or more sets of local storage devices with true Boolean results, the first local storage device that is judged to be normal will be determined as the device to be started this time.

[0092] If the Boolean result of all devices in the local device ID set is false, the preset network boot item is determined as the boot device for this time, and the current boot target ID in the hardware status manager is updated with the new ID of the determined boot device for this time.

[0093] Step 3: Before the operating system is loaded on the device determined in Step 2, the BIOS of the Basic Input / Output System determines whether the retry count value obtained in Step 1 has reached the preset fault threshold. If it has not reached the threshold, the retry count value is incremented and updated to the Hardware Status Manager, and then the hardware watchdog built into the Hardware Status Manager is activated.

[0094] Sub-step 3.1: The Basic Input / Output System (BIOS) first performs a fault threshold determination, comparing the retry count obtained in step 1 with a preset fault threshold. The formula for determining the fault threshold is:

[0095] C FT =(V RC ≥T F ),

[0096] Among them, C FT V is the Boolean result of the fault threshold determination. RC T is the retry count value. F Preset fault threshold;

[0097] In sub-step 3.2, if the Boolean result of the fault threshold determination in sub-step 3.1 is false, the BIOS updates the retry count value using the incrementing formula and writes the updated retry count value to the hardware status manager. Then, the hardware watchdog is activated. The incrementing formula is:

[0098]

[0099] in, V is the updated retry count value. RC This is the retry count value;

[0100] Sub-step 3.3: If the Boolean result of the fault threshold determination in sub-step 3.1 is true, the BIOS performs a status update operation, modifying the health status data obtained in step 1 using a bitwise OR operation formula, and marking the health status bit associated with the device being booted in step 2 as faulty. The bitwise OR operation formula is as follows:

[0101]

[0102] in, For the updated health status data, V HS For health status data, D ID The target ID for the current startup;

[0103] Step 4: After activating the hardware watchdog in Step 3, if the operating system successfully loads within the predetermined timeout period, the operating system sends a feed command to the hardware watchdog and resets the retry count value associated with the device being started in the hardware status manager; if the hardware watchdog is triggered due to timeout, the hardware watchdog sends a hardware reset signal to the computer system, and the hardware reset signal causes the computer system to restart.

[0104] Sub-step 4.1, the condition for successful operating system loading is that the dedicated driver has sent a "feed the watchdog" command to the hardware watchdog within the predetermined timeout period. The formula for determining this condition is:

[0105] C S =(T boot <T WDT ),

[0106] Among them, C S T represents the Boolean result indicating successful loading of the operating system. boot The startup time consumed by the dog-feeding command, T WDT The scheduled timeout time set for the hardware watchdog;

[0107] In sub-step 4.2, if the Boolean decision result of sub-step 4.1 is true, the operating system immediately executes the operation of resetting the retry count value. This operation updates the retry count value associated with the device being started in step 2 to zero using the counter reset formula:

[0108]

[0109] in, This is the new retry count value after the reset. This is the original retry count value;

[0110] In sub-step 4.3, if the Boolean decision result of sub-step 4.1 is false, it indicates that the hardware watchdog timer has been triggered due to a timeout. The trigger will cause the hardware watchdog timer to send a hardware reset signal to the computer system. The timeout trigger condition of the hardware watchdog timer is determined by the following formula:

[0111] C TO =(T elapsed ≥T WDT )∧(F feed ==0),

[0112] Among them, C TO T is the Boolean result of the timeout trigger condition. elapsed T represents the elapsed time after the hardware watchdog is activated. WDT The scheduled timeout time set for the hardware watchdog, F feed This is a hardware flag used to indicate whether the dog-feeding command has been received.

[0113] The benefit of Step 1 is establishing a hardware-level root of trust in the system's state, providing a reliable basis for autonomous decision-making. All critical state information required for system startup, including the health status of local storage devices, the current priority startup target, and the number of startup attempts for that target, is centrally acquired through a dedicated hardware state manager. By entrusting persistent storage and management of data to hardware, the uncertainty and volatility of the software layer are fundamentally isolated. This ensures that regardless of the system's failures, the acquired state information remains accurate and consistent, providing a reliable data foundation for the correct execution of all subsequent fault-tolerant decisions.

[0114] Step 2 enables dynamic and intelligent boot path planning, proactively avoiding known failures. Its advantage lies in giving the BIOS (Basic Input / Output System) the ability to make autonomous decisions based on real-time health data. Instead of following a statically fixed boot sequence, it uses a health blueprint provided by the hardware status manager as a guide, starting with the currently recorded boot target and intelligently and cyclically searching for and selecting devices marked as "normal" as the boot path. This dynamic path selection mechanism ensures the system proactively bypasses known faulty devices, avoiding wasted time on failed paths and significantly improving boot success rate and efficiency. Importantly, in the extreme case where all local devices fail, it provides a final fallback path to network boot mode.

[0115] Step 3 introduces a timeout monitoring mechanism with limited attempts, preventing the system from entering an infinite restart loop. By combining an incremental retry count with a hardware watchdog timer, a strict execution time window with a failure limit is set for the boot process. The hardware watchdog ensures that no boot attempt is suspended indefinitely; a timeout forces a system restart, maintaining system liveness. The retry count prevents the system from endlessly retrying devices with intermittent or permanent faults. When the number of attempts reaches a preset fault threshold, the device is officially marked as faulty and completely excluded from subsequent boot decisions, which is crucial for achieving fault isolation and system "learning" capabilities.

[0116] Step 4 forms a closed-loop fault-tolerant control process, ensuring the ultimate realization of system recovery or reconstruction. This step is the endpoint and closing point of the fault-tolerant method, defining the final handling logic for both successful and failed scenarios. If the operating system loads successfully, "feeding the watchdog" removes monitoring and "clears" the retry count, representing a successful recovery and the system returning to stable operation. If the watchdog times out and triggers a hardware reset, control is returned to Step 1. However, the updated retry count or health status data in the hardware status manager will guide the next startup attempt to a different path, ensuring that each failed attempt accumulates information for the next successful attempt, until a usable startup path is found or all possibilities are exhausted before switching to network repair, guaranteeing the final convergence of the fault-tolerant process.

[0117] Example 2: Autonomous Recovery from Boot Disk Failure Caused by Single-Effect Flip

[0118] This embodiment simulates a scenario where an on-orbit spaceborne computer suffers data corruption in its main boot storage device due to a single-event event in space, demonstrating how the method of this invention can automatically complete fault isolation and boot path switching.

[0119] In the initial state of this embodiment, the onboard computer is configured with four sets of local storage devices, labeled Disk1, Disk2, Disk3, and Disk4. The hardware state manager is a radiation-hardened FPGA, and its internal non-volatile memory contains:

[0120] The value of the health status data register is 0000b, which means that all four groups of devices are healthy (by convention, the least significant bit corresponds to Disk1, the most significant bit corresponds to Disk4, 0 indicates normal, and 1 indicates fault).

[0121] The current value of the boot target ID register is 0, pointing to Disk1;

[0122] The retry count value associated with each device is 0;

[0123] The preset fault threshold TFTF is set to 3;

[0124] The specific fault recovery process is as follows:

[0125] Failure and Initial Startup Failure: During spacecraft operation in orbit, a high-energy particle penetrated the shielding layer and struck the storage controller associated with Disk1, causing a single-event upset (SEE) that corrupted critical boot data in its master boot record. The system began executing the method of this invention after a routine reboot.

[0126] Step 1: The BIOS accesses the FPGA and obtains the health status data as 0000b, the current boot target ID is 0, and the retry count associated with ID0 is 0;

[0127] Step 2: Based on the acquired data, the BIOS determines that the boot device is Disk1.

[0128] Step 3: The BIOS determines that the retry count value of Disk1 is 0, which has not reached the preset fault threshold 3. Therefore, the BIOS increments the retry count value of Disk1 to 1 and updates it to the FPGA, then activates the hardware watchdog built into the FPGA and sets the predetermined timeout time to T_WDT;

[0129] The BIOS attempted to load the operating system from Disk1, but the loading operation could not be completed within T_WDT because the boot data was corrupted.

[0130] Step 4: The hardware watchdog timer triggers due to timeout, sending a hardware reset signal to the computer system, and the system restarts;

[0131] Repeated attempts and fault confirmation: After the system restarts, since the data in the FPGA is retained during the reset, the BIOS repeats the above process after the first two restarts, updating the retry count value of Disk1 to 2 and 3 in turn.

[0132] After restarting for the third timeout:

[0133] Step 1: The BIOS obtains from the FPGA that the retry count value of Disk1 is 3;

[0134] Step 3: The BIOS determines that the retry count value of 3 has reached the preset fault threshold of 3. At this time, the BIOS performs a status update operation, changing the value of the health status data register from 0000b to 0001b, and marking the corresponding health status bit of Disk1 as faulty through a bitwise OR operation.

[0135] Step 2: After marking Disk1 as faulty, the BIOS returns to the boot device determination step. Starting with the current boot target ID0, it checks the health status data 0001b. Finding ID0 faulty, it checks ID1 by incrementing the index. Disk2 is found to be healthy. Therefore, the BIOS determines Disk2 as the new boot device and updates the current boot target ID register in the FPGA with Disk2's ID1.

[0136] Successful switchover and system recovery:

[0137] Step 3: The BIOS begins preparing to load the operating system from Disk2. At this point, the retry count of Disk2 is 0, which has not reached the threshold. The BIOS will increment it to 1 and update it to the FPGA, and then activate the hardware watchdog.

[0138] The BIOS successfully loaded the operating system from Disk2.

[0139] Step 4: After the operating system loads, the dedicated driver immediately sends a "feed the watchdog" command to the FPGA's hardware watchdog to disable timeout monitoring. The driver then performs a reset operation, clearing the retry count associated with Disk2 in the FPGA from 1 to zero.

[0140] Thus, the method of this invention autonomously diagnoses and isolates the fault in Disk1 without human intervention, and successfully switches to the healthy Disk2 for startup, ensuring the continuous availability of the onboard computer. The fault status of Disk1 is persistently recorded in the FPGA, and this device will be skipped directly in subsequent system restarts.

[0141] Example 3: PXE rollback and remote repair under total disk failure

[0142] This embodiment simulates an extreme failure scenario in which all local storage devices permanently fail due to long-term cumulative damage from space radiation, demonstrating how the present invention can utilize network booting capabilities to achieve space-to-ground collaborative repair.

[0143] In the initial state of this embodiment, all four local storage devices Disk1 to Disk4 of the onboard computer are physically damaged and unable to boot. In the hardware status manager:

[0144] The health status data register has a value of 1111b, indicating that all devices have been marked as faulty.

[0145] The default network boot entry is defined with a specific ID, such as 0xFF, which points to the onboard radiation-hardened network interface controller.

[0146] The specific fault recovery process is as follows:

[0147] Local boot failure and PXE activation:

[0148] Step 1: The system starts up, the BIOS accesses the FPGA, and obtains the health status data as 1111b;

[0149] Step 2: Based on health status data 1111b, the BIOS performs a loop search within the local device ID set {0,1,2,3} and finds that all local storage devices are marked as faulty. At this point, the BIOS triggers the condition that "all local storage devices are marked as faulty" and determines the preset network boot entry as the boot device for this operation. The BIOS then updates the current boot target ID register in the FPGA with the specific ID 0xFF of the network boot entry.

[0150] Network booting and remote image loading:

[0151] Based on the current boot target ID 0xFF, the BIOS switches to the boot process of the preset extensible firmware interface;

[0152] The BIOS initializes the radiation-hardened network interface controller and broadcasts a DHCP discovery request via the onboard communication link.

[0153] The ground control center or another healthy node on the satellite responds to the request as a DHCP server, assigns a temporary IP address to the faulty computer, and informs it of the address of the TFTP server, which contains a dedicated network repair boot program.

[0154] The BIOS downloads the network repair bootloader from a specified address via TFTP, loads it into memory, and executes it.

[0155] Remote repair and system regeneration:

[0156] After the network repair bootloader takes over control, it will establish a stable communication connection with the ground control center.

[0157] Ground maintenance personnel can use this connection to send lightweight diagnostic operating system images or toolsets for repairing storage devices to the onboard computer.

[0158] This image or toolset runs in computer memory, allowing ground personnel to perform in-depth status diagnostics on Disk1 through Disk4, attempt low-level formatting, or rewrite the operating system on disks with remaining intact areas.

[0159] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer, characterized in that, include: Step 1: After the system starts up, the BIOS accesses the hardware status manager to obtain health status data that records the health status of the local storage device, the current boot target ID that indicates the current boot target, and the retry count value associated with the current boot target ID; Step 2: The BIOS determines the device to be started based on the health status data and the current boot target ID obtained in Step 1. The determination process involves selecting a local storage device marked as normal in the health status data, or setting a preset network boot item as the device to be started when all local storage devices are marked as faulty, and updating the current boot target ID in the hardware status manager with the ID corresponding to the device to be started. Step 3: Before the operating system is loaded on the device for this startup as determined in Step 2, the Basic Input / Output System (BIOS) determines whether the retry count value obtained in Step 1 has reached a preset fault threshold. If the retry count has reached the preset fault threshold, the BIOS updates the health status data, marks the device to be started as faulty, and returns to step 2 to determine the next device to be started. If the retry count value does not reach the preset fault threshold, the retry count value is incremented and updated to the hardware status manager, and then the hardware watchdog built into the hardware status manager is activated. Step 4: After activating the hardware watchdog in Step 3, if the operating system successfully loads within the predetermined timeout period, the operating system sends a feed command to the hardware watchdog and resets the retry count value associated with the device being started in the hardware status manager; if the hardware watchdog is triggered due to timeout, the hardware watchdog sends a hardware reset signal to the computer system, and the hardware reset signal restarts the computer system.

2. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, In step 2, the preset network boot item is set as the boot device for this boot. Specifically, the BIOS updates the current boot target ID to a specific ID pointing to the radiation-hardened network interface controller, so as to perform network boot through the preset extensible firmware interface PXE protocol.

3. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, In step 3, activating the hardware watchdog includes: the BIOS writing a preset timeout value to the watchdog control register in the hardware status manager, and setting the enable bit in the watchdog control register.

4. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, In step 4, the operating system accesses the hardware status manager by loading the driver to execute the operations of sending the dog-feeding command and resetting the retry count value; In step 4, after the operating system sends the watchdog command for the first time, it continuously sends the watchdog command to the hardware watchdog through the system service at a fixed period that is less than the predetermined timeout.

5. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, Step 1 further includes: The BIOS determines, based on a preset hardware status manager base address, a first register address for obtaining the health status data, a second register address for obtaining the current boot target ID, and a third register address for obtaining the retry count value. The BIOS accesses the first register address, reads the original health status value, and determines whether the full failure condition is met based on the total number of local storage devices. After acquiring the health status data, the BIOS accesses the second register address and the third register address to obtain the current boot target ID and the retry count value, and determines the validity of the current boot target ID based on the total number of local storage devices and the network boot item ID.

6. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, Step 2 further includes: The BIOS uses the current boot target ID obtained in step 1 as the initial index to perform a loop search within a preset set of local device IDs to check whether the local storage device corresponding to the initial index is normal. If the check result is abnormal, the BIOS updates the current index value to check the next local storage device until the set of local device IDs has been traversed. After the loop search is completed, if there is a local storage device that is determined to be normal, the first local storage device that is determined to be normal will be identified as the device to be started this time. If none of the devices in the local device ID set are determined to be normal, the preset network startup item is determined as the startup device for this time, and the new ID of the determined startup device for this time is used to update the current startup target ID in the hardware status manager.

7. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, Step 3 further includes: The Basic Input / Output System (BIOS) first performs a fault threshold determination, comparing the retry count value obtained in step 1 with a preset fault threshold. If the fault threshold determination result is that the preset fault threshold has not been reached, the basic input / output system BIOS updates the retry count value, writes the updated retry count value into the hardware status manager, and then activates the hardware watchdog. If the fault threshold determination result is that the preset fault threshold has been reached, the BIOS of the Basic Input / Output System performs a status update operation, modifies the health status data obtained in step 1, and marks the health status bit associated with the current startup device determined in step 2 as faulty.

8. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, Step 4 further includes: The condition for successful loading of the operating system is that the dedicated driver has sent the "feed the watchdog" command to the hardware watchdog within the predetermined timeout period. If the condition is met, the operating system immediately executes the operation of resetting the retry count value, updating the retry count value associated with the device to be started in this step 2 to zero; If the conditions are not met, it indicates that the hardware watchdog timer has been triggered due to a timeout, and the hardware watchdog timer sends a hardware reset signal to the computer system.

9. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, The hardware status manager is a radiation-hardened field-programmable gate array (FPGA). The health status data, the current startup target ID, and the retry count value obtained in step 1 are all stored in the non-volatile memory unit integrated in the radiation-hardened FPGA.

10. The fault-tolerant method for redundancy-free fault reconstruction of a spaceborne computer according to claim 1, characterized in that, In step 4, after the computer system restarts due to the hardware reset signal issued by the hardware watchdog, the health status data and retry count value obtained by the BIOS when returning to execute step 1 are values ​​that have been updated before the hardware reset signal is triggered. These values ​​are maintained during the hardware reset process because they are stored in the non-volatile storage medium of the hardware status manager.