Onboard computer fault-tolerant storage architecture based on multi-disk cross-backup
The onboard computer's fault-tolerant storage architecture, which uses multiple disks for cross-backup, enables rapid fault switching and repair in high-energy particle radiation environments. This solves the problem of easily damaged storage media in onboard computers, improves the reliability and space utilization of the storage architecture, and meets the high-efficiency and autonomous requirements of space missions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING ZHONGKE TIANSUAN TECHNOLOGY CO LTD
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-30
AI Technical Summary
The storage media of spaceborne computers are easily damaged in high-energy particle radiation environments, leading to system failures. Existing technical solutions are costly, wasteful of resources, or have lengthy recovery processes, which cannot meet the reliability and efficiency requirements of space missions.
The system adopts a fault-tolerant storage architecture for onboard computers based on multi-disk cross-backup. It achieves bidirectional mutual support through a fault-tolerant management module and a switching middleware. Any physical storage module can run as the primary boot disk and also serve as the backup and guardian disk for the other, using cross-backup data for rapid switching and repair.
It enables fault switching within seconds, improves the reliability of the storage architecture and the utilization of storage space, simplifies the recovery process, and meets the requirements of high efficiency, autonomy and reliability for aerospace missions.
Smart Images

Figure CN122309240A_ABST
Abstract
Description
Technical Field
[0001] This application generally relates to the field of spaceborne solid-state storage technology. More specifically, this application relates to a fault-tolerant storage architecture for spaceborne computers based on multi-disk cross-backup. Background Technology
[0002] Spaceborne computers operate for extended periods in the high-energy particle radiation environment of space. Their storage media (such as Flash and SSDs) are highly susceptible to single-event faults or multiple bit faults, which can corrupt the operating system bootloader, kernel, and critical data, leading to system boot failures or runtime crashes. Given the extreme reliability requirements of space missions and the impossibility of physical repairs, designing a storage architecture capable of autonomously handling storage failures is a key challenge in the field of aerospace electronics.
[0003] Currently, the main solutions and their inherent limitations are as follows: Option 1: Hardware-level system triple modular redundancy (TMR) Description: Three completely independent sets of computer hardware are used to perform the same task, and the output is voted on by a voting machine.
[0004] Drawbacks: This solution increases cost, weight, and power consumption by at least 200%, severely violating the stringent requirements of modern microsatellites and payload equipment for lightweight and low power consumption. Furthermore, this solution primarily protects against computational and logical errors, offering limited protection against simultaneous, unrelated soft errors occurring within the three storage devices, and it cannot address the system boot chain breakage caused by the overall damage of local blocks in the storage medium.
[0005] Option 2: Software verification and simple backup recovery Description: Add data verification (such as CRC, ECC) at the operating system level and create a simple backup partition on the disk to store the system image. When an error is detected, attempt to restore from the backup partition.
[0006] Limitations: First, software verification cannot correct complete damage at the storage block level. Second, and most critically, the backup partition itself lacks high reliability. In the space radiation environment, the backup partition also faces the risk of data corruption. Once the backup data fails, the entire recovery mechanism will completely collapse, becoming useless. Furthermore, the recovery process is typically lengthy and requires ground intervention or complex scripts, failing to meet the requirements of efficiency and autonomy for spaceborne systems.
[0007] Option 3: Full-disk real-time triple-redundant storage Description: All critical data is synchronously generated into three copies upon writing, stored in different locations, and real-time voting is performed upon reading.
[0008] Drawbacks: This approach introduces significant write amplification and performance overhead. Each data write becomes three times, which not only accelerates the wear and tear on the onboard storage medium with its limited number of erase / write cycles but also consumes valuable system bus and processor resources. For data such as operating system partitions, which are primarily read-based during on-orbit operation and only occasionally updated, implementing real-time, full-domain triple redundancy is extremely inefficient and represents an over-design and waste of resources. Summary of the Invention
[0009] To address at least one or more of the technical problems mentioned above, this application proposes a fault-tolerant storage architecture for onboard computers based on multi-disk cross-backup, as follows: The onboard computer fault-tolerant storage architecture based on multi-disk cross-backup proposed in this application includes: The system includes a fault-tolerant management module, a switching middleware, and at least two physical storage modules. The at least two physical storage modules include a first physical storage module and a second physical storage module. The first physical storage module and the second physical storage module have the same partition structure and functional roles. The first physical storage module stores a first image of data formed by combining local EFI partition data and local Root partition data, and a second image of data formed by combining EFI partition data and Root partition data from the second physical storage module. The second physical storage module stores both the first image data and the second image data. The switching middleware has an input terminal electrically connected to the fault-tolerant management module and an output terminal electrically connected to at least two of the physical storage modules. It is used to switch the boot channel to the corresponding physical storage module in response to an instruction issued by the fault-tolerant management module.
[0010] In this system, two or more physical storage modules have identical partition structures and functional roles, with no master-slave distinction. Any disk can serve as the primary boot disk to run the system, while also acting as a "backup and guardian disk" for the others. When any disk fails, the healthy disk can both take over the operation and serve as a recovery source to repair the failed disk. This fundamentally solves the problem of system-wide paralysis caused by the overall failure of a single-disk architecture. This bidirectional repair capability is something that traditional master-slave hot standby solutions cannot provide, thus improving the reliability of repairs.
[0011] In some examples, the fault tolerance management module is also configured to send a switching command to the switching middleware in response to detecting a fault in the first physical storage module, switching the boot channel from the first physical storage module to the second physical storage module.
[0012] In some examples, the fault tolerance management module is also used to obtain first image data from the second physical storage module; and to repair the first physical storage module based on the first image data.
[0013] In some examples, the fault tolerance management module is also configured to, in response to detecting a fault in the second physical storage module, obtain the second mirror data from the first physical storage module; and repair the first physical storage module based on the second mirror data.
[0014] In some examples, the first image data and the second image data are backed up and stored in a triple redundancy manner, forming multiple backup data corresponding to the first image data and the second image data, respectively, thereby improving the reliability of the backup data.
[0015] In some examples, the fault tolerance management module is also used to calculate the checksums of multiple backup data corresponding to the first mirror data and the second mirror data respectively; based on the checksums, it is determined whether the multiple backup data corresponding to the first mirror data and the second mirror data are complete, so as to ensure the integrity of the backup data.
[0016] In some examples, the fault tolerance management module is also used to set the first physical storage module or the second physical storage module to a read-only state, allowing only itself to write to it, when the first physical storage module or the first physical storage module is working normally. During normal operation, hardware write protection is provided for the cross-backup partition, blocking all write operations at the circuit level. Main system anomalies, radiation disturbances, and program errors cannot tamper with or damage the backup image, ensuring the permanent integrity and reliability of the recovery source. Only the fault tolerance management module has temporary repair write permissions; write protection is immediately restored upon completion of the repair, following the principle of least privilege. This satisfies fault repair needs while eliminating the risk of unauthorized writes to backup data, balancing reliability and maintainability.
[0017] In some examples, the checksum is backed up and stored in a triple-modular redundancy manner, which can improve the reliability of the checksum.
[0018] In some examples, the switching middleware is a multiplexer that enables physical-level startup path switching without relying on software, firmware, or logic circuits. This completely avoids startup switching failures caused by single-event upsets, program crashes, or system crashes, and provides fast fault switching response and strong radiation resistance.
[0019] In some examples, the fault tolerance management module exists in the form of firmware and is deployed on the motherboard, controller, or logic circuit, thus sinking the fault tolerance capability to the hardware layer, which is the core design concept of a highly reliable, efficient, and secure hardware system.
[0020] Compared with the prior art, the present invention has the following beneficial effects: (1) It can completely eliminate single points of failure at the equipment level. Even if one physical storage module suffers irreversible physical damage, the other physical storage module still has a complete, immediately bootable system environment and stores a system backup of the failed disk. It can switch to the healthy disk within seconds, fundamentally solving the problem of system paralysis caused by the failure of the entire device in a single-disk architecture.
[0021] (2) Reciprocal design enables two-way mutual assistance Multiple SSDs play equal roles, with any one of them serving as the primary boot drive to run the system, while also acting as a "backup guardian" for the others. When any drive fails, the healthy drive can both take over the operation and serve as a recovery source to repair the failed drive. This two-way recovery capability is something that traditional master-slave hot standby solutions cannot provide.
[0022] (3) High storage space utilization Compared to traditional dual-disk mirroring solutions that require twice the space (50% utilization), this architecture only retains system backups of the counterpart disk on each disk, rather than mirroring the entire disk data. Data partitions do not require duplicate storage and their capacity can be independently configured according to task requirements. Under typical configurations, storage utilization can reach 70%~90%, allowing more task data to be handled with the same hardware resources.
[0023] (4) The recovery path is simple and reliable. This architecture requires only two basic recovery operations: "switchover startup" and "image writeback." It does not rely on complex data verification, incremental synchronization, or multi-mode decision-making. Its logic is simple, easy to verify, and suitable for the high-reliability software engineering requirements of aerospace. Attached Figure Description
[0024] The above and other objects, features, and advantages of exemplary embodiments of this application will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of this application are illustrated by way of example and not limitation, and the same or corresponding reference numerals denote the same or corresponding parts, wherein: Figure 1 This is a schematic diagram of an exemplary framework structure for a spaceborne computer fault-tolerant storage architecture based on multi-disk cross-backup provided in an embodiment of this application. Detailed Implementation
[0025] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0026] It should be understood that the terms "comprising" and "including" as used in the specification and claims of this application indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0027] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this specification and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.
[0028] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."
[0029] like Figure 1 As shown, the fault-tolerant storage architecture for onboard computers based on multi-disk cross-backup provided in this application includes: a fault-tolerant management module, a switching middleware, and at least two physical storage modules, wherein the at least two physical storage modules include a first physical storage module and a second physical storage module.
[0030] The first physical storage module and the second physical storage module have the same partition structure and functional roles.
[0031] Specifically, both the first physical storage module and the second physical storage module include: The primary boot partition (EFI partition) stores the bootloader used to boot from this disk; The main system root partition stores the operating system kernel, drivers, and core files required for this disk to run; The backup partition, which includes three sub-storage areas, is one of the core innovations. It internally stores two logically independent and complete backup image files. The data partition stores task data generated by the system running on this disk.
[0032] The first physical storage module contains backup storage of first image data formed by combining local EFI partition data and local Root partition data, and second image data formed by combining EFI partition data and Root partition data from the second physical storage module.
[0033] The second physical storage module contains backup storage for both the first and second image data.
[0034] Specifically, the first and second mirror data are compressed and encrypted data. The backup partition of the first physical storage module SSD-A stores the EFI partition data and root partition data of the second physical storage module SSD-B, and the backup partition of the second physical storage module SSD-B stores the EFI partition data and root partition data of the first physical storage module SSD-A, forming a bidirectional cross-backup chain. When there are multiple physical storage modules, a multi-directional cross-backup chain will be formed. The partition structure and functional roles of two or more physical storage modules are completely identical, with no master-slave distinction. Any disk can serve as the primary boot disk to run the system, while also acting as a "backup guardian disk" for the others. When any disk fails, the healthy disk can both take over operation and serve as a recovery source to repair the failed disk, fundamentally solving the problem of system paralysis caused by the overall failure of a single-disk architecture. This bidirectional repair capability is unavailable in traditional master-slave hot standby solutions, improving the reliability of repair. In the storage architecture provided by this invention, each physical storage module only stores the system data (EFI partition data and root partition data) of other physical storage modules, rather than a full disk data mirror. Task data within a data partition does not need to be stored repeatedly, and its capacity can be configured independently according to task requirements. Under typical configuration, storage utilization can reach 70%~90%, and more task data can be supported with the same hardware resources.
[0035] The switching middleware has an input terminal electrically connected to the fault-tolerant management module and an output terminal electrically connected to at least two of the physical storage modules, and is used to switch the startup channel to the corresponding physical storage module in response to the instruction issued by the fault-tolerant management module.
[0036] This embodiment takes the onboard computer of a low-orbit remote sensing satellite as an example to illustrate the working principle of the fault-tolerant storage architecture of the onboard computer based on multi-disk cross-backup provided by the present invention. The storage architecture adopts a dual solid-state drive (SSD) cross-backup architecture and boots from SSD-A by default.
[0037] Example 1: Basic Fault Tolerance Based on Cold Backup Switching Step 1: Ground Initial Deployment Install two aerospace-grade SATA SSDs (labeled SSD-A and SSD-B) into the computer.
[0038] Create partitions on Disk0: sda1 (A_EFI), sda2 (A_Root), sda3 (A_Backup), sda4 (A_Data). Create symmetric partitions on Disk1: sdb1 (B_EFI), sdb2 (B_Root), sdb3 (B_Backup), sdb4 (B_Data).
[0039] Install the same onboard operating system (such as Linux) on partitions sda2 and sdb2.
[0040] Initialization tools for running the fault tolerance management module: Package the data A_EFI and A_Root from partitions sda1 and sda2 into an image data A_Local_Image and store it in partition sda3. At the same time, store this image data as A_Remote_Image (i.e., the system backup data of SSD-A) in a specified location in partition sdb3.
[0041] Symmetrically, the data in partitions sdb1 and sdb2 are packaged to generate image data B_Local_Image and stored in partition sdb3. At the same time, this image data is stored in sda3 as B_Remote_Image (i.e., the system backup data of SSD-B).
[0042] Configure the switch middleware; the default selection is to SSD-A.
[0043] Step Two: Normal Operation in Orbit After the satellite enters orbit, the system starts normally from SSD-A. The fault tolerance management module only performs simple health monitoring and does not perform any data verification or background repair.
[0044] Step 3: Fault Occurrence and Switchover Recovery On the 200th day of operation in orbit, the SSD-A suffered a single-event latch-up failure, causing the main controller chip to malfunction and the SATA link to fail.
[0045] When the satellite restarts next time, the UEFI firmware attempts to load the bootloader from SSD-A, but fails due to a link timeout.
[0046] The fault tolerance management module detects a startup failure event and immediately sends a switching command to the switch middleware.
[0047] The switching middleware switches the boot channel from SSD-A to SSD-B according to the switching instruction.
[0048] The firmware successfully loaded the bootloader from SSD-B's B_EFI, then mounted the B_Root partition, and the operating system started normally.
[0049] After the system stabilized, the ground control center issued a recovery command based on the telemetry information that "the start disk failed and has been switched to the backup disk".
[0050] The fault tolerance management module executes the recovery task: Read A_Remote_Image (i.e., a complete system backup of SSD-A) from the B_Backup partition of SSD-B. Write the image completely and sequentially to the A_EFI and A_Root partitions of SSD-A via SATA interface or NVMe; After the write operation is complete, a read-back comparison and verification will be performed.
[0051] After successful recovery verification, SSD-A will be bootable again. The system can continue running from SSD-B, or switch back to SSD-A during the next scheduled maintenance window.
[0052] This embodiment only uses two basic operations, "switch start" and "full image write back", and does not involve any complex mechanisms such as incremental synchronization, checksum voting, or timed inspection.
[0053] Example 2: Bidirectional Recovery Based on Peer-to-Peer Backup This embodiment uses the payload control unit of a communication satellite as a scenario to emphasize the peer-to-peer redundancy of the two disks.
[0054] Initial configuration: Both SSD-A and SSD-B have the same operating system installed. By default, SSD-A will boot, but SSD-B can also boot independently.
[0055] Scenario 1 (SSD-A failure, SSD-B recovery): Same as Implementation Example 1, and will not be repeated.
[0056] Scenario 2 (SSD-B fails, SSD-A recovers): During the satellite's operation in orbit, the SSD-B developed a large number of bad blocks due to the aging of the NAND flash memory. Although it was still readable, SMART reported a "remaining life warning".
[0057] The fault tolerance management module detected a decline in the health status of SSD-B, but the system is currently running on SSD-A and the business is not affected.
[0058] After assessment, the ground control center will send a "repair backup disk" command when the time is right.
[0059] The fault tolerance management module executes the recovery task: Read B_Remote_Image from SSD-A's A_Backup partition; Write the image to the B_EFI and B_Root partitions of SSD-B, overwriting the damaged area; After successful verification, SSD-B returned to a healthy state.
[0060] The entire process requires no switching of the system boot source, and the boot disk service is uninterrupted.
[0061] The above embodiments use SATA disks as an example, but they are also applicable to other types of disks (such as NVMe disks), and this invention is not limited thereto.
[0062] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0063] It should be noted that although the operations of the method of this application are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. On the contrary, the steps depicted in the flowchart can be performed in a different order. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0064] It should be understood that when the terms "first," "second," "third," and "fourth," etc., are used in the claims, specification, and drawings of this application, they are used only to distinguish different objects and not to describe a specific order. The terms "comprising" and "including" as used in the specification and claims of this application indicate the presence of the described features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof.
[0065] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this specification and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.
[0066] Although the embodiments of this application are described above, the content is merely an example adopted for the purpose of facilitating understanding of this application and is not intended to limit the scope and application scenarios of this application. Any person skilled in the art described in this application may make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in this application, but the scope of patent protection of this application shall still be determined by the scope defined in the appended claims.
Claims
1. A fault-tolerant storage architecture for onboard computers based on multi-disk cross-backup, comprising: The system comprises a fault-tolerant management module, a switching middleware, and at least two physical storage modules, wherein the at least two physical storage modules include a first physical storage module and a second physical storage module, characterized in that: The first physical storage module and the second physical storage module have the same partition structure and functional roles; The first physical storage module backs up and stores a first image data formed by combining local EFI partition data and local Root partition data, as well as a second image data formed by combining EFI partition data and Root partition data from the second physical storage module. The second physical storage module contains backup storage of the first image data and the second image data; The switching middleware has an input terminal electrically connected to the fault-tolerant management module and an output terminal electrically connected to at least two of the physical storage modules, and is used to switch the startup channel to the corresponding physical storage module in response to the instruction issued by the fault-tolerant management module.
2. The fault-tolerant storage architecture for spaceborne computers according to claim 1, characterized in that: The fault tolerance management module is also used to send a switching command to the switching middleware in response to detecting a fault in the first physical storage module, and switch the startup channel from the first physical storage module to the second physical storage module.
3. The onboard computer fault-tolerant storage architecture according to claim 2, characterized in that: The fault tolerance management module is also used to obtain the first image data from the second physical storage module; Based on the first image data, the first physical storage module is repaired.
4. The fault-tolerant storage architecture for spaceborne computers according to claim 1, characterized in that: The fault tolerance management module is further configured to obtain the second image data from the first physical storage module in response to detecting a fault in the second physical storage module; The first physical storage module is repaired based on the second image data.
5. The spaceborne computer fault-tolerant storage architecture according to claim 1, characterized in that, The first image data and the second image data are backed up and stored in a triple redundancy manner, forming multiple backup data corresponding to the first image data and the second image data respectively.
6. The fault-tolerant storage architecture for spaceborne computers according to claim 5, characterized in that: The fault tolerance management module is also used to calculate the checksums of multiple backup data corresponding to the first mirror data and the second mirror data, respectively; Based on the checksum, determine whether the multiple backup data corresponding to the first image data and the second image data are complete.
7. The fault-tolerant storage architecture for spaceborne computers according to claim 1, characterized in that: The fault tolerance management module is further configured to set the first physical storage module or the second physical storage module to a read-only state, allowing only itself to be written to, when the first physical storage module or the first physical storage module is working normally.
8. The fault-tolerant storage architecture for spaceborne computers according to claim 6, characterized in that: The checksum is backed up and stored in a triple-redundant manner.
9. The fault-tolerant storage architecture for spaceborne computers according to claim 1, characterized in that: The switching middleware is a multiplexer.
10. The fault-tolerant storage architecture for spaceborne computers according to claim 1, characterized in that: The fault tolerance management module exists in the form of firmware and is deployed on the motherboard, controller, or logic circuit.