Satellite-borne CAN bus dual-redundancy fast switching method
By employing a field-programmable gate array (FPGA) in the spaceborne CAN bus to implement the hardware logic for fault determination and switching execution, the problems of long switching delay and insufficient data continuity in the existing technology are solved, and fast and transparent link switching is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI JINGJI COMM TECH CO LTD
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-19
AI Technical Summary
The existing dual-redundancy switching scheme for spaceborne CAN bus relies on software processes, resulting in long switching delays, failure of fault diagnosis capabilities under space irradiation environment, and inability to guarantee data continuity.
The hardware logic for fault diagnosis and switching execution is implemented using a field-programmable gate array (FPGA). By acquiring multi-dimensional error status indicators in real time and weighting them with a sliding time window, combined with the breakpoint retransmission technology of the backup control core, fast link switching is achieved.
Significantly reduce the switching time window, minimize business-level data loss, ensure data continuity, and complete the switching process at the hardware level, making it transparent to the host computer.
Smart Images

Figure CN122247793A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of spacecraft communication, and in particular to a method for fast switching of dual redundancy on a spaceborne CAN bus. Background Technology
[0002] In spacecraft such as satellites, the onboard CAN bus carries real-time control commands and critical scientific data between the platform and various subsystems. The continuity of this link directly affects core functions such as attitude control, payload management, and telemetry and remote control. To address the reliability risks posed by long-term exposure to space radiation and significant temperature variations during on-orbit operation, the onboard CAN bus typically employs a primary-backup dual-redundancy design. In the event of a primary link failure, it switches to the backup link to maintain communication. In this redundant architecture, the time consumed from fault identification to link switching completion is defined as the switching delay. The switching delay directly determines the scale of loss of real-time control commands and critical data that fail to arrive during the moment of link interruption, and is a key parameter for evaluating the effectiveness of the redundancy scheme.
[0003] Currently, the dual redundancy switching schemes widely used in on-orbit models mostly rely on a combination architecture of an "ARM processor plus an external CAN control chip." In this architecture, the ARM processor acts as the main body for fault diagnosis and switching decisions. A software-running fault monitoring program periodically reads the status register of the external CAN control chip. When an anomaly is detected, it sequentially executes interrupt responses, register reconfiguration, protocol stack reset, and transceiver enable switching to complete the switching. This type of scheme is simple to implement and highly portable, but its fundamental structure dictates that the fault diagnosis and switching execution paths are bound to the same external processor. This means the switching response chain must go through a complete software process, leading to a series of interconnected shortcomings.
[0004] First, the complete software flow typically results in switching latency ranging from hundreds of microseconds to milliseconds. Within this time window, real-time control commands or critical scientific data being transmitted on the bus may fail to reach the receiving node due to transmission channel failure, leading to loss of service-level data. Second, because fault diagnosis is performed at the software layer, the processor's fault diagnosis capability fails when it experiences anomalies such as single-event upsets, single-event lockouts, or watchdog timeouts under space irradiation conditions. This renders the redundancy mechanism, which should switch immediately upon a fault, ineffective. This is not an independent issue from the large switching latency, but rather a type of failure common to software-dependent architectures under different extreme conditions. Third, because the switching process proceeds gradually at the software layer, the software layer cannot maintain an end-to-end precise state for frames that have been sent but not yet acknowledged. At the moment of switching, the normal mechanism cannot trace back to which specific frame failed to arrive, and the host computer protocol layer must compensate afterward with timeout retransmissions. This increases the complexity of the host computer software and fails to guarantee the continuity of critical frames at the moment of switching.
[0005] It is evident that the above three shortcomings share the same technical root cause: both fault diagnosis and switching execution paths are linked to external software processes. Therefore, any local optimization targeting a single deficiency is unlikely to fundamentally reduce switching latency and improve data continuity. Summary of the Invention
[0006] In order to separate fault diagnosis and switching execution from software dependencies, achieve ultra-fast response by relying on hardware logic, and maintain data continuity from the perspective of the host computer during switching, this application provides a dual-redundancy fast switching method for spaceborne CAN bus.
[0007] This application provides a method for fast switching of dual redundancy on a spaceborne CAN bus, which adopts the following technical solution: A method for fast dual-redundancy switching of a spaceborne CAN bus is applied to a spaceborne CAN bus node that uses a field-programmable gate array (FPGA) to carry the main control core and the backup control core. The main control core and the backup control core are connected to the main bus and the backup bus respectively through independent transceiver groups. The main control core receives data to be transmitted from the host computer and sends it to the main bus driver through the corresponding transceiver group. The method for fast dual-redundancy switching of a spaceborne CAN bus includes the following steps: S1. Real-time acquisition of multi-dimensional error status indicators generated by the protocol engine of the main control core at the physical layer and data link layer; S2. Within a configurable sliding time window, multi-dimensional error status indicators are combined according to preset weights to obtain a communication quality score; S3. In response to the communication quality score crossing the warning threshold or reaching the forced handover condition, the transmission link of the data to be sent is switched from the main control core to the backup control core, completing the main / backup handover. S4. During the master-slave switchover, the breakpoint location is determined based on the reception confirmation status of the frames sent by the master control core. The standby control core then retransmits the unacknowledged frames from the breakpoint location, making the master-slave switchover transparent to the host computer.
[0008] By adopting the above technical solution, the multi-dimensional error status indicators generated by the main control core protocol engine at the physical layer and data link layer are acquired in real time within the field-programmable gate array (FPGA) and the communication quality score is obtained by weighting with a sliding time window. This allows fault determination to be completed within the hardware logic, rather than in the software process of the external processor. The transmission link switching triggered by threshold crossing is promoted at the hardware level by signal transmission between modules, which shortens the time window from fault identification to link switching completion. At the same time as switching, the breakpoint location is determined based on the reception confirmation status of the sent frames and the backup control core resumes transmission from the breakpoint, so that frames that are not delivered at the moment of switching can be resent. The master-slave switching is transparent to the host computer.
[0009] Optionally, S2 includes sub-steps S21-S23: S21. Within the sliding sub-windows corresponding to each of the multiple time scales, the multidimensional error state indicators are weighted and summed to obtain the scores of multiple time scales corresponding to each time scale. The multiple time scales include at least short-term, medium-term and long-term scales. S22. Take the larger of the time scale score corresponding to the short-term scale and the time scale score corresponding to the medium-term scale as the short-term urgency component. S23. Combine the short-term urgency component with the time scale score corresponding to the long-term scale according to a preset coefficient to obtain the communication quality score.
[0010] By adopting the above technical solution, a single window is replaced by parallel evaluation at multiple time scales, so that short-term urgency and long-term baseline are respectively handled by sub-windows with different time granularities; the short-term urgency component dominates the switching response speed, while the long-term baseline only participates in coefficient-based correction, thereby preserving transient fault tolerance without sacrificing timely perception of continuous degradation.
[0011] Optionally, the multidimensional error status index is divided into transient error class and persistent error class according to the persistence of the error. The preset weight of the transient error class is lower than the preset weight of the persistent error class. The forced handover condition includes the forced handover threshold. The warning threshold and the forced handover threshold are adjusted synchronously according to the posterior result of the effectiveness of the primary / backup handover, and the synchronous adjustment maintains the difference between the warning threshold and the forced handover threshold constant.
[0012] By adopting the above technical solutions, transient disturbances are suppressed with low weights to reduce false alarms, while continuous degradation is amplified with high weights to accelerate the response to real faults. At the same time, the dual-threshold online adaptive switching makes the switching sensitivity follow the actual operating conditions, and the constraint of constant difference protects the backup core preparation time budget during the early warning stage from being eroded by adaptive adjustment.
[0013] Optionally, the dual-redundancy fast switching method for the spaceborne CAN bus also includes: Multiple sets of threshold vectors are stored in the preset storage area of the field programmable gate array, and each set of threshold vectors corresponds to a different task stage of the on-board CAN bus node. In response to the task phase switching command from the host computer, the threshold vector corresponding to the task phase switching command is loaded as the reference value of the warning threshold and the forced switching threshold. During the loading process, the multi-dimensional error status indicators accumulated in the sliding time window remain unchanged. The adjustment amount of the synchronous adjustment is limited to a preset neighborhood range of the reference value.
[0014] By adopting the above technical solution, the mission phase threshold table provides static reference sensitivity for different flight phases, and the online adaptive system only makes dynamic fine-tuning within the neighborhood of the static reference, forming a two-layer structure of static reference and dynamic fine-tuning; and the phase switching process does not reset the window accumulation state, avoiding score jumps and false triggers during reference switching.
[0015] Optionally, the standby control core maintains one of several states, including at least hibernation, hot standby, mirror synchronization, and active states. The standby control core responds to the communication quality score by successively crossing multiple driving thresholds, transitioning from the hibernation state to the hot standby state and then to the mirror synchronization state. In S3, the switching of the transmission link is achieved by the backup control core moving from the mirror synchronization state to the active state.
[0016] By adopting the above technical solution, the total delay budget of the backup control core from cold state to activation is decomposed into each level of state for absorption. That is, the clock stabilization time is absorbed when migrating to the hot standby state, the protocol engine initialization and transmission buffer synchronization are absorbed when migrating to the mirror synchronization state, and finally, the only action left at the moment of switching is the extremely simple action of transceiver enable toggling, so that the critical delay of master-slave switching can be compressed to the microsecond level.
[0017] Optionally, the standby control core periodically performs a self-test operation while in hot standby mode. The result of the self-test operation serves as the migration gating for the standby control core to migrate from hot standby mode to mirror synchronization mode. If the result of the self-test operation fails, the standby control core is prohibited from migrating to mirror synchronization mode and a degraded operation is triggered.
[0018] By adopting the above technical solution, the self-test result is not used as an independent alarm, but is embedded in the state machine transition decision. The passive monitoring of the backup core degradation before the switch is transformed into active gating that does not switch to the degraded backup core at all, thus avoiding secondary failures caused by switching to the already degraded backup core.
[0019] Optionally, S1 also includes generating a fault location tag based on the monitoring results of the standby control core on the main bus, and the fault location tag can at least distinguish between bus physical level faults and internal faults of the main control core. In S3, the switching of the transmission link also determines the bus side of the switching target based on the fault location tag. In response to the fault location tag indicating a bus physical level fault, the transmission link is switched to the backup bus. In response to a fault location tag indicating an internal fault in the main control core, the transmission link remains on the main bus.
[0020] By adopting the above technical solution, the normally open characteristic of the backup control core receiving circuit is used to physically monitor the main bus. Without adding new hardware, the root cause of the fault can be distinguished as being at the node level or the bus level, so that the switching target matches the root cause of the fault. If the fault is actually at the bus level, switching to the backup core on the same bus is ineffective. Switching to the backup bus is necessary to restore the link and avoid secondary data loss caused by invalid switching.
[0021] Optionally, S4 includes sub-steps S41-S43: S41. Each frame of the data to be sent has been embedded with a frame-by-frame incrementing transmission sequence number; S42. Receive a feedback frame from the predetermined receiving node of the data to be sent, and obtain the maximum sequence number that the predetermined receiving node has successfully received from the feedback frame; S43. The next sequence number after the maximum sequence number is determined as the breakpoint position, and the backup control core retransmits the unacknowledged frame from the breakpoint position.
[0022] By adopting the above technical solution, and overcoming the limitation that the native bit-level ACK Slot of the CAN protocol can only confirm the receipt of at least one node, an end-to-end transmission sequence number mechanism is introduced. This enables the sender to accurately obtain the acknowledgment progress from the perspective of the predetermined receiving node, thereby accurately locating the breakpoint at the moment of switching and avoiding congestion and duplicate processing at the receiving end caused by retransmitting the entire window.
[0023] Optionally, the method maintains a shadow buffer in a preset storage area of the field-programmable gate array. The shadow buffer retains key frames in the data to be sent according to the priority of the service semantics. The key frames in the shadow buffer are released when the maximum sequence number reaches a preset release condition. S43 also includes a backup control core that reads key frames with a transmission sequence number greater than the maximum sequence number from the shadow buffer and retransmits them.
[0024] By adopting the above technical solution, excessive protection beyond the normal unacknowledged frame retransmission range is provided for high-priority key frames such as track control and attitude control. The release condition driven by the transmission sequence number ensures that key frames are not discarded prematurely, and retransmission is limited to the part where the transmission sequence number is greater than the maximum sequence number, thus avoiding the shadow buffer retransmission from overwhelming the receiver.
[0025] Optionally, following S4, the following may also be included: Start the verification timer; within the duration of the verification timer, acquire the multi-dimensional error status indicators generated by the protocol engine of the backup control core at the physical layer and data link layer in real time, and combine them in the sliding time window according to the S2 method to obtain the communication quality score during the verification period. Based on the communication quality score during the verification period and the monitoring results of the backup control check on the main bus, the validity a posteriori result of the master-slave switchover is determined. The validity a posteriori result is divided into at least three categories: false alarm, real fault and environmental interference. In response to the environmental interference as the validity posterior result, the method enters a safe mode. In the safe mode, the transceiver groups corresponding to the main control core and the backup control core remain disabled for transmission and enabled for reception. The posterior results of effectiveness are used as the basis for synchronous adjustment.
[0026] By adopting the above technical solution, a closed-loop verification is introduced after the switching action is completed. The real health score of the backup channel itself is combined with the physical monitoring results of the main bus to perform three-state classification. That is, if the score returns and the monitoring is normal, it is a real fault; if the score is still low but the monitoring is abnormal, it is environmental interference; if the score only briefly exceeds the line, it is a false alarm. The three-state results are fed back to the threshold adaptive channel to adapt to the diversity of fault modes under the strong radiation environment of spaceborne spacecraft, and avoid falling into the oscillation of frequent switching between main and backup under environmental faults.
[0027] Optionally, in scenarios where the resource consumption of the field-programmable gate array is limited, the acquisition of multi-dimensional error status indicators can be achieved by sharing the accumulator and weight storage area, and switching the error indicator input at different sampling periods through a multiplexing structure.
[0028] By adopting the above technical solutions, the use of a shared pipeline reduces the occupancy of lookup tables and block storage by the monitoring logic within the field-programmable gate array, enabling multi-dimensional monitoring to be realized under spaceborne limited programmable resources.
[0029] Optionally, multidimensional error status indicators are obtained using differentiated sampling frequencies according to feature type; counter-type indicators are sampled at a preset lower frequency; event-triggered indicators are obtained in an event-driven manner; and physical layer signal pulse width indicators are sampled synchronously with the CAN bus bit clock.
[0030] By adopting the above technical solution, sampling resources are only occupied when the event actually occurs, avoiding high-frequency redundant sampling of counter-type indicators and reducing the power consumption and timing pressure of monitoring logic.
[0031] Optionally, an event ring buffer can be maintained within the field-programmable gate array to record the timestamp of each master-slave switch, the communication quality score at the time of triggering, the error status index snapshot, the trigger source, and the bus status after the switch.
[0032] By adopting the above technical solutions, we can provide a basis for switching traceability for ground telemetry and control, maintain operability without introducing software processes, and facilitate the identification of the causes and handling links of each switching during fault review.
[0033] Optionally, the main control core and the backup control core use different preset weight combinations to weight the multidimensional error status indicators, and there are no common key weight items among the preset weight combinations.
[0034] By adopting the above technical solution, the primary and backup cores have different sensitivities to logical errors induced by the same source irradiation, reducing the probability that a single irradiation event will simultaneously interfere with the judgment results of the primary and backup cores.
[0035] Optionally, a health summary of the node is embedded in the data field of a preset low-priority frame of the data to be sent. The health summary includes at least the communication quality score trend and fault location tag status of the node in the most recent time, which can be used by other nodes on the same bus for multi-node joint switching decisions.
[0036] By adopting the above technical solution, the switching decision of a single node can be incorporated into the health status of neighboring nodes to form a system-level redundancy view, and an expansion interface is reserved for multi-node collaborative switching or fault location.
[0037] Optionally, the short-term timescale is on the same order as the bit time of the CAN bus, the medium-term timescale is on the same order as the CAN frame period, and the long-term timescale is on the same order as the task phase duration of the node.
[0038] By adopting the above technical solution, the three time scales are respectively bound to the three-level time scales of the CAN physical layer, frame layer and task layer, so that the health assessment can be observed at different abstraction levels and aligned with the time structure of the CAN bus itself.
[0039] Optionally, transient error classes include at least bit stuffing error events and single cyclic redundancy check error events; persistent error classes include at least continuous rise of the transmission error counter and continuous exceedance of the physical layer signal pulse width abnormality ratio.
[0040] By adopting the above technical solution, a specific observation caliber is provided for the binary division of error persistence, so that the distinction between transient and persistent errors is based on the existing standard state variables of the CAN protocol engine, without the need to introduce additional acquisition channels.
[0041] Optionally, in response to the validity posterior result indicating that the previous primary / backup switch was a false alarm, the warning threshold and the forced switch threshold are simultaneously increased; in response to the validity posterior result indicating that the previous primary / backup switch was a real fault, the warning threshold and the forced switch threshold are simultaneously decreased.
[0042] By adopting the above technical solution, the direction of adaptive adjustment is regularized. When there is a false alarm, the sensitivity is reduced to avoid false triggering again, and when there is a real fault, the sensitivity is increased to strive for an earlier response, so that the threshold converges to the working point that matches the actual failure rate during long-term operation.
[0043] Optionally, the clock of the standby control core is turned off in hibernation mode; In hot standby mode, the clock of the standby control core is started and the transceiver group corresponding to the standby control core remains disabled; In mirror synchronization mode, the protocol engine of the standby control core is initialized and the data in the transmit buffer of the main control core is synchronized to the mirror register of the standby control core. The transceiver group corresponding to the standby control core remains disabled. In the active state, the transceiver group corresponding to the standby control core is enabled.
[0044] By adopting the above technical solution, specific resource switch configurations are bound to the four states respectively, so that the power consumption, readiness and switching readiness of each state form a clear gradient, which makes it easier to accurately allocate the switching delay budget to each state in engineering implementation.
[0045] Optionally, the self-test operation includes configuring the protocol engine of the standby control core to internal loopback mode to send and receive test frames and verify the consistency of the protocol decoding results of the test frames, and performing write-readback verification on the mirror register of the standby control core.
[0046] By adopting the above technical solution, the self-test covers the entire switching path from the protocol engine to the image register. Failure in any link can be detected before the switch, rather than being passively discovered through score regression after the switch.
[0047] Optionally, in response to the standby control core failing to detect frames sent by the main control core on the main bus within a consecutive preset CAN frame interval, a fault location tag indicates a bus physical level fault. In response to a preset number of consecutive frames where the backup control core detects frames sent by the main control core on the main bus but the decoding error rate exceeds a preset rate, a fault location tag indicates an internal fault in the main control core.
[0048] By adopting the above technical solution, faults are located in parallel using two independent criteria: the missing frame interval level monitoring and the proportion of frame level decoding errors. The former captures the situation where there is no signal at all, while the latter captures the situation where there is a signal but the content is damaged. The combination of the two improves the robustness of fault location.
[0049] Optionally, the sending sequence number is embedded in a preset byte position in the data field of each frame of the data to be sent; The feedback frame is piggybacked by the service transmission frame of the predetermined receiving node, and the maximum sequence number is carried in the preset byte position in the data field of the feedback frame.
[0050] By adopting the above technical solution, the feedback mechanism uses service frames piggybacking instead of independent ACK frames, resulting in zero new bus bandwidth occupation and priority competition, and minimal changes to the original application layer of the receiving end.
[0051] Optionally, the transmission sequence number is only used to determine the breakpoint location during breakpoint retransmission, and does not trigger the retransmission of data to be sent during non-breakpoint retransmission.
[0052] By adopting the above technical solution, the serial number mechanism is completely transparent to normal transmission and reception, and is only activated at the moment of switching, minimizing the impact on compatibility with the existing CAN application layer.
[0053] Optionally, the business semantic priority is stored in the preset storage area of the field-programmable gate array in the form of a business semantic priority mapping table, and the business semantic priority mapping table can be updated through an external diagnostic interface.
[0054] By adopting the above technical solution, the key frame classification is decoupled from CANID planning, and the service semantic priority can be adjusted on track independently of the physical ID, thereby improving parameter maintainability.
[0055] Optionally, preset release conditions include the maximum sequence number growing to exceed the keyframe's transmission sequence number plus a preset margin.
[0056] By adopting the above technical solution, a pre-set margin is used to reserve a reversal window for remote retransmission after occasional frame loss on the receiving side, thus avoiding the premature release of key frames in edge scenarios.
[0057] Optionally, the tri-state classification can be further refined based on the monitoring results of the backup control core on the main bus: If the communication quality score during the verification period is lower than the warning threshold, the validity posterior result is a false alarm. If the communication quality score during the verification period remains above the warning threshold and the backup control core detects frames sent by the main control core on the main bus but the decoding error rate exceeds the preset rate, the validity posterior result is a real fault. If the communication quality score during the verification period remains above the warning threshold and the decoding error ratio obtained by the backup control core from monitoring the main bus also exceeds the preset ratio, the validity posterior result is environmental interference.
[0058] By adopting the above technical solutions, the main control frame can be monitored and the decoding errors are concentrated, pointing to real node-level faults. The large-scale decoding errors across the entire bus can be monitored, pointing to environmental faults such as irradiation storms or power supply anomalies. This makes the criteria for the three-state classification more specific and easier to implement in engineering as a threshold.
[0059] In summary, this application includes at least one of the following beneficial technical effects: 1. By internalizing fault characteristic acquisition, health assessment, switching decision and breakpoint retransmission in the field programmable gate array, the primary / standby switching is decoupled from the software flow of the external processor, which greatly reduces the switching time window and makes the switching transparent to the host computer, reducing the loss of business-level data during the switching moment.
[0060] 2. Multi-timescale sliding window and error classification weighting suppress transient false alarms; dual threshold synchronous adjustment superimposed task stage threshold table covers sensitivity for long-term adaptive and stage switching; backup control core multi-level state machine distributes delay budget to each level of pre-activated action, so that only hardware-level enable flipping remains at the critical switching moment.
[0061] 3. The backup control verifies the physical monitoring of the main bus to distinguish between node-level or bus-level faults, enabling the switching target to match the root cause of the fault; end-to-end transmission of sequence numbers and shadow buffers enables precise location of breakpoints and over-protection of key frames; after switching, the three-state post-test classifies false alarms, real faults and environmental interference and feeds them back into the adaptive circuit. Attached Figure Description
[0062] Figure 1 is a schematic diagram of the application environment of the dual-redundancy fast switching method of the spaceborne CAN bus in one embodiment of the present invention.
[0063] Figure 2 is a schematic diagram of the system architecture of a spaceborne CAN bus node carrying a dual-redundancy fast switching method for spaceborne CAN bus in one embodiment of the present invention.
[0064] Figure 3 is a main flowchart of a fast switching method for dual redundancy of spaceborne CAN bus in one embodiment of the present invention.
[0065] Figure 4 is a state transition diagram of the backup control core in one embodiment of the present invention. Detailed Implementation
[0066] The present application will be further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the application and are not intended to limit the scope of the application.
[0067] This application discloses a method for fast switching of dual redundancy on a spaceborne CAN bus, referring to... Figure 1 This method can be applied to, for example Figure 1 The application environment shown.
[0068] To facilitate understanding of the embodiments of this application by those skilled in the art, the following explanations are provided for several technical terms involved in the embodiments of this application.
[0069] Controller Area Network (CAN) is a serial communication protocol bus widely used in industrial control, automotive electronics, and spacecraft subsystem communication scenarios. The CAN bus uses differential signal transmission, with each physical bus consisting of two twisted-pair wires. Mechanisms such as bit stuffing and cyclic redundancy check (CRC) maintain the robustness of the data link layer. The CAN protocol engine is the logical entity that implements the functions of the CAN protocol's physical and data link layers, responsible for frame assembly, bus arbitration, error detection, and error state machine management.
[0070] In the CAN protocol specification, the protocol engine maintains two error counters: the transmit error counter (TEC) and the receive error counter (REC). The counter values increase with errors detected during transmission or reception and decrease with successful transmission or reception. When the TEC value exceeds the preset upper limit, the node enters the bus disconnect state. In this state, the node no longer participates in bus transmission and reception and can only reconnect to the bus after a specific recovery process.
[0071] Bit stuffing error refers to an error event reported when the protocol engine fails to insert reverse polarity stuffing bits after detecting multiple consecutive bits of the same polarity; Cyclic redundancy check error refers to an error event reported when the checksum recalculated by the receiver for the frame payload is inconsistent with the checksum carried in the frame tail; Physical layer signal pulse width anomaly refers to an event in which the duration of high and low levels deviates from the nominal bit width within the bit timing sequence, usually caused by sampling jitter, electromagnetic interference, or driver characteristic degradation.
[0072] A Field-Programmable Gate Array (FPGA) is a semiconductor device whose internal logic can be configured using hardware description language code. Through the configuration of lookup tables and flip-flop arrays, it can support a variety of logic functions, from simple combinational logic to complete on-chip systems. In this embodiment, the FPGA internally houses two functionally equivalent but physically independent CAN controller logic entities: a main control core and a backup control core. The control core contains components such as a protocol engine, a transmit buffer, a receive buffer, and an error state machine. The transmit buffer stores frames yet to be transmitted via the bus, while the receive buffer stores received frames that have been received from the bus but not yet delivered to the upper layer for processing. Mirror registers are a set of registers located within the backup control core, synchronized in content with the transmit buffer of the main control core, serving as a transitional carrier for handling incomplete data during switching.
[0073] A sliding time window is a statistical structure that maintains data samples within a recent time period, with older samples moving out of the window over time and new samples moving in. In this embodiment, the length of the sliding time window is configurable. Within the window, various error samples are combined according to preset weights to obtain a communication quality score, which is a scalar indicator reflecting the real-time health of the bus link. The warning threshold and the forced handover threshold are two thresholds used to trigger pre-activation and forced handover actions, respectively. Together, they determine the response behavior of the handover decision module at different score levels.
[0074] The breakpoint location refers to the earliest frame sequence position at the moment of master-slave switchover where the master control core has driven it to the bus but it has not yet been acknowledged by the intended receiving node. The backup control core retransmits the unacknowledged frame from the breakpoint location, ensuring the data flow before and after the switchover remains continuous from the perspective of the intended receiving node. The shadow buffer is an additional buffer located in the FPGA's preset storage area. It backs up and retains key frames in the data to be transmitted, outside the regular transmit buffer, serving as the source for key frame retransmission at the moment of switchover. Key frames refer to instruction or data frames that significantly impact system functionality, such as track control instructions, attitude control instructions, and critical payload remote control frames. Business semantic priority is determined by the importance of the business function, not by the priority of the CAN protocol frame identifier. The latter is determined by the value of the CAN protocol frame identifier, while the former is determined by the system business designer based on the scope of the instruction's impact. The two are different in meaning.
[0075] Mission phases refer to several stages with different operational characteristics throughout the satellite's entire lifespan. Typical mission phases include the launch phase, steady-state orbit insertion phase, attitude maneuver phase, orbit correction phase, safety mode, and end-of-life phase. Under different mission phases, the mission load, electromagnetic interference intensity, and cumulative irradiance level of the onboard subsystems vary significantly, thus requiring different bus switching sensitivity.
[0076] Reference Figure 1 The application environment is the communication network of the payload subsystem within a satellite platform. The payload subsystem connects to the satellite platform's CAN network via an onboard CAN bus node. The onboard CAN bus node is connected to the CAN network via a main bus and a backup bus. The main bus and backup bus each consist of independent twisted-pair cables, independent terminating resistors, and independent connectors. While physically independent in terms of wiring, connector placement, and grounding paths, they are equivalent at the CAN protocol layer. The satellite platform's mission computer or command subsystem acts as the host computer for this node, sending data to be transmitted through the node's service interface. This data includes remote control commands, attitude parameters, mode switching commands, and critical scientific data sent from the platform to the payload. In addition to this node, other slave nodes, such as payload control units, sensors, and actuators, are connected to the main and backup buses. Each slave node is simultaneously connected to both the main and backup buses, and a frame sent by this node to a particular bus can be received by all slave nodes on that bus.
[0077] The core constraint of the application environment lies in the unique characteristics of the space environment. During its on-orbit operation, the satellite is continuously exposed to radiation environments such as Earth's radiation belts, solar particle events, and galactic cosmic rays. Irradiation-induced single-event upsets and single-event lock-in can occur in any logical unit within the node. Temperature cycling during orbital operation may cause slow changes in connector contact resistance. The timescale of long-term steady-state operation ranges from several months to several years, during which the degradation of component characteristics and cumulative radiation dose cannot be ignored. Due to the limitations of the telemetry and control arc, the ground's maintenance window for on-orbit nodes is limited. When a fault occurs, the node itself often needs to respond at the hardware level first, and then the ground intervenes within the telemetry and control arc for review and handling. These constraints collectively determine that the dual-redundancy fast switching method of the onboard CAN bus in the embodiments of this application must have different sensitivities at different mission stages, must have the ability to operate autonomously without an external processor, and must maintain data continuity at the hardware level during the switching process.
[0078] Reference Figure 3 The dual-redundancy fast switching method for spaceborne CAN bus provided in this application includes steps S1 to S4. The execution entity of this method is internalized within a field-programmable gate array (FPGA). Specifically, the monitoring logic module, health assessment module, switching decision module, and breakpoint retransmission module within the FPGA collaboratively complete all actions from fault characteristic acquisition to link switching execution, without requiring external central processing unit intervention in fault determination, register configuration, or protocol stack reset processes. Steps S1 to S4 are described in detail below.
[0079] S1 is obtained by the monitoring logic module from the multi-dimensional error status indicators generated by the main control core protocol engine at the physical layer and data link layer in each monitoring sampling period.
[0080] The monitoring logic module is connected to the protocol engine of the main control core in a bypass mode. The bypass mode means that the monitoring logic module only reads and listens to the status registers and event signals inside the protocol engine, without writing any signals to the protocol engine. This bypass feature ensures that the monitoring actions do not affect the protocol engine's normal protocol response to the bus, nor does it participate in frame assembly, bus arbitration, or state transitions of the error state machine.
[0081] Multidimensional error status indicators are divided into several categories according to their dimensions, with each dimension corresponding to a certain aspect of bus link degradation. In some embodiments, transient error categories include at least bit stuffing error events and single cyclic redundancy check error events, while persistent error categories include at least continuously rising transmission error counters and continuously exceeding physical layer signal pulse width abnormality limits.
[0082] The acquisition mechanisms for error status indicators vary depending on the data source. Bit stuffing error events originate from bit stuffing error signals detected internally by the protocol engine, while single cyclic redundancy check (CRC) error events originate from CRC error signals detected internally by the protocol engine. These two types of errors are event-based indicators. When each error event occurs, the protocol engine generates a single pulse, and the monitoring logic module counts these pulses to obtain the cumulative event count.
[0083] The continuous rise of the transmission error counter originates from changes in the value of the transmission error counter register. The monitoring logic module samples the transmission error counter register in each monitoring sampling period and compares it with the sampled value of the previous period to obtain the increment of the transmission error counter; the increment accumulates within the sliding time window to form a "continuously rising" trend observation.
[0084] The abnormal proportion of physical layer signal pulse width is determined by the duration of the bit level obtained from the sampling of the bus physical layer signal. The monitoring logic module counts the proportion of bits whose bit level duration deviates from the nominal bit width by more than a preset tolerance within the monitoring sampling period. This proportion reflects the degree of deviation of the bus physical layer signal under driver characteristics, connector contact status, or electromagnetic interference conditions.
[0085] The acquisition process of multidimensional error status indicators employs differentiated sampling resource usage. In some embodiments, multidimensional error status indicators are acquired using differentiated sampling frequencies based on feature type. Counter-type indicators are sampled at a preset lower frequency, event-triggered indicators are acquired using an event-driven approach, and physical layer signal pulse width indicators are sampled synchronously with the CAN bus bit clock. Counter-type indicators change on a slow timescale within the protocol engine, eliminating the need for high-frequency sampling; using a preset lower sampling frequency reduces the timing pressure on the monitoring logic module. Event-triggered indicators are only meaningful when events actually occur; using an event-driven approach avoids redundant sampling during periods without events. Physical layer signal pulse width indicators need to capture subtle deviations in bit timing, requiring sampling synchronously with the bit clock to maintain accuracy. This differentiated sampling frequency allows the monitoring logic module to reduce the actual resource usage of lookup tables and triggers without sacrificing acquisition accuracy.
[0086] In addition to driving the health assessment of S2, the monitoring and acquisition results also participate in the generation of fault location tags to support the selection of the target bus side in S3. Specifically, although the backup transceiver group corresponding to the backup control core disables bus transmission before the master-slave switchover occurs, the receiving circuit remains enabled, so the backup control core can continuously monitor the bus it is connected to.
[0087] The processing of the main bus monitoring results by the backup control core is completed by the monitoring logic module. The processing result is output as a fault location tag, which distinguishes at least two types of faults: bus physical layer faults and internal faults of the main control core. In some embodiments, in response to the backup control core not listening to frames sent by the main control core on the main bus within a consecutive preset number of CAN frame intervals, the monitoring logic module determines that the main bus physical layer link has completely failed, and the fault location tag indicates a bus physical layer fault. In response to the backup control core listening to frames sent by the main control core on the main bus within a consecutive preset number of frames but the decoding error rate exceeds a preset rate, the monitoring logic module determines that there is an abnormality in the internal protocol engine or transmit driver of the main control core, but the bus physical layer link itself is still connected, and the fault location tag indicates an internal fault of the main control core. The two criteria run in parallel. The former captures the case of no signal at all, and the latter captures the case of signal presence but corrupted content. The combination of the two can distinguish the root cause of the fault and provide a basis for the selection of the target bus side for subsequent switching.
[0088] In some embodiments, under scenarios where field-programmable gate array (FPGA) resources are limited, the acquisition of multi-dimensional error status indicators shares an accumulator and a weight storage area, and the error indicator input is switched at different sampling periods through a multiplexing structure. Specifically, the monitoring logic module instantiates only one set of accumulators and one set of weight lookup tables within the FPGA. A set of multiplexers sequentially routes error status indicators of different dimensions to the accumulator input. The accumulator accumulates the number of error samples for different dimensions of error status indicators at different sampling periods, and the weight lookup table synchronously switches weight coefficients according to the multiplexer gating signal. This shared pipeline approach replaces spatial multiplexing with time multiplexing, reducing the FPGA's monitoring logic's occupation of lookup tables and block storage without sacrificing multi-dimensional monitoring capabilities, enabling multi-dimensional monitoring to be realized under the limited programmable resources of the onboard FPGA.
[0089] The above describes one method for obtaining multi-dimensional error status indicators in S1, namely, by bypassing the monitoring logic module to read the status register and event signals of the main control core protocol engine. In other embodiments, S1 can also employ other acquisition methods. For example, a set of dedicated error monitoring bypass logic can be instantiated inside the FPGA. This dedicated error monitoring bypass logic directly samples the main bus physical layer signals and independently performs bit stuffing detection, cyclic redundancy check recalculation, and bit timing discrimination. The resulting error status indicators do not depend on the status registers inside the main control core protocol engine. The dedicated error monitoring bypass logic can also achieve the purpose of extracting multi-dimensional error status indicators from bus signals, and therefore also belongs to the implementation method of S1.
[0090] S2. The health assessment module combines multi-dimensional error status indicators according to preset weights within a configurable sliding time window to obtain a communication quality score. The health assessment module internally maintains a circular buffer, the capacity of which is equal to the length of the sliding time window.
[0091] At the start of each monitoring sampling period, the health assessment module writes the multi-dimensional error status indicators output by S1 within that sampling period to the current pointer position of the circular buffer. The current pointer then increments sequentially. When the pointer reaches the end of the circular buffer, it wraps back to the beginning. Simultaneously with writing new samples, the earliest written samples in the circular buffer that have not yet exited the window are overwritten, enabling older samples to automatically exit the window over time. The health assessment module performs a weighted summation of all samples in the circular buffer according to preset weights to obtain a communication quality score. This communication quality score is output to the handover decision module in the form of a scalar register.
[0092] The calculation of the communication quality score can be expressed as follows: Where L is the communication quality score, and N is the number of dimensions of the multidimensional error status index. The preset weights are the values corresponding to the i-th dimension error state index. Let be the cumulative amount of the i-th dimension error status indicator within the current sliding time window.
[0093] Specifically, S2 includes sub-steps S21-S23.
[0094] S21. The health assessment module, within its respective sliding sub-window at multiple time scales, weights and sums the multidimensional error state indicators to obtain multiple time-scale scores corresponding to each time scale. These multiple time scales include at least short-term, medium-term, and long-term scales. Short-term scores are... It indicates that the mid-term score is based on This indicates that long-term scores are based on This means that the scores on the three time scales are calculated independently by three sets of parallel circular buffers and weighted summation units, and the results of the parallel calculations are updated synchronously in each monitoring sampling period.
[0095] In some embodiments, the short-term timescale is on the same order as the bit time of the CAN bus, the medium-term timescale is on the same order as the CAN frame period, and the long-term timescale is on the same order as the task phase duration of the node. Specifically, taking a typical CAN bus baud rate as an example, with a hypothetical positioning time of 1 microsecond, a standard frame period of approximately 100 to 200 microseconds, and a task phase duration on the order of minutes to hours, the sliding sub-window corresponding to the short-term timescale can cover a time range of several microseconds to tens of microseconds, the sliding sub-window corresponding to the medium-term timescale can cover a time range of several milliseconds to tens of milliseconds, and the sliding sub-window corresponding to the long-term timescale can cover a time range of several seconds to tens of seconds. The three timescales are respectively bound to the three-level time scales of the CAN protocol physical layer, data link layer, and task scheduling layer, enabling health assessment to simultaneously observe the bus status at different abstraction levels: the short-term timescale captures bit-level transient disturbances, the medium-term timescale captures frame-level error accumulation, and the long-term timescale reflects the overall health trend within the task phase.
[0096] S22. The health assessment module takes the larger of the time scale score corresponding to the short-term scale and the time scale score corresponding to the medium-term scale as the short-term urgency component. The short-term urgency component is... Indicates, that is The physical significance of the short-term urgency component is that it can capture transient faults whether they are presented as bit-level single-point disturbances or frame-level clustering. Taking the larger value rather than the smaller value or the average value is to ensure that urgency at any scale can be fully reflected in the short-term urgency component, and to avoid the short-term urgency being diluted by the low value average of another scale.
[0097] S23. The health assessment module combines the short-term urgency component with the corresponding time scale score of the long-term scale using preset coefficients to obtain the communication quality score. The communication quality score is represented by L, i.e. ,in, This is the long-term baseline correction factor. The range of values is Real numbers within the interval. When a smaller value is taken, the contribution of the long-term score to the communication quality score is suppressed, and the communication quality score is mainly driven by the short-term urgency component, so that the handover response speed is not slowed down by the long-term baseline. When a larger value is taken, the long-term score contributes more to the communication quality score, and the communication quality score becomes more sensitive to long-term cumulative degradation.
[0098] Here is a set of numerical examples: Within a certain monitoring sampling period, assume... The short-term urgency component Communication quality score This value is only a typical value; the actual weights and coefficients can be configured according to the task stage.
[0099] The preset weights reflect the differences in error persistence. The multidimensional error status index is divided into transient and persistent error classes based on error persistence, with the preset weights for transient errors being lower than those for persistent errors. Specifically, the preset weights for transient errors are... This indicates that the preset weight of the persistent error class is... This indicates that there is a relationship between the two. Relationship, in engineering and The ratio usually falls between 1 / 4 and 1 / 2.
[0100] Here is a set of specific numerical examples: Assume If the cumulative number of transient error events is 5 and the cumulative number of persistent error events is 2 within a certain monitoring sampling period, then the score contribution for that period is: Transient errors are assigned lower weights because events such as bit stuffing errors and single cyclic redundancy check errors may be caused by occasional electromagnetic interference on the bus. These events themselves do not indicate continuous link degradation. Assigning too high a weight would make the communication quality score overly sensitive to occasional interference, leading to unnecessary handovers. Persistent errors, on the other hand, are assigned higher weights to reflect the severity of the continuous degradation of the corresponding link. A continuously rising transmit error counter and a continuously exceeding physical layer signal pulse width limits both reflect a trend change in link characteristics, requiring a higher weight to amplify their representation in the communication quality score.
[0101] In some embodiments, the primary control core and the backup control core employ different preset weight combinations to weight the multidimensional error status indicators, and there are no common key weight items among the preset weight combinations. Specifically, the health assessment module instantiates two sets of preset weight lookup tables for the primary control core and the backup control core, respectively. The preset weight combination corresponding to the primary control core has a higher weight setting for bit-filling errors, while the preset weight combination corresponding to the backup control core has a higher weight setting for continuously rising transmission error counters. Since irradiation-induced single-event upsets often randomly hit specific logic nodes within the FPGA, the use of differentiated preset weight combinations allows the primary and backup cores to have different sensitivities to logic errors induced by the same source of irradiation, reducing the probability that a single irradiation event simultaneously interferes with the judgment results of the primary and backup cores.
[0102] The above describes one method for calculating the communication quality score in S2, namely, parallel evaluation using a three-time-scale sliding sub-window combined with asymmetric fusion of short-term urgency and long-term baseline. In other embodiments, S2 can also employ other calculation methods. For example, a raw score can be obtained by weighted summation of multi-dimensional error state indicators within a single time-scale sliding window, and then this raw score can be corrected for a long-term baseline using an exponentially weighted moving average, specifically in the form of… ,in, The communication quality score at the current moment. The communication quality score for the previous moment. This is the raw score at the current moment. The coefficients are exponentially weighted. This method differs from the three-time-scale parallel approach, has lower implementation complexity, and is suitable for scenarios with extremely limited FPGA resources; the exponentially weighted moving average can also achieve adaptive correction of the long-term baseline, and therefore also belongs to the implementation method of S2.
[0103] S3. The handover decision module compares the communication quality score with the warning threshold and the forced handover threshold. In response to the communication quality score crossing the warning threshold or reaching the forced handover condition, the transmission link of the data to be transmitted is switched from the main control core to the backup control core, completing the main / backup handover. The forced handover conditions include the communication quality score reaching the forced handover threshold and the main control core protocol engine entering a bus disconnect state. The handover decision module detects two independent conditions in parallel, and a forced handover is triggered as soon as either condition is met.
[0104] The warning threshold and the forced handover threshold are adjusted synchronously based on the posterior result of the effectiveness of the primary / backup handover, and the synchronous adjustment maintains a constant difference between the warning threshold and the forced handover threshold.
[0105] The physical significance of maintaining a constant difference lies in the fact that the window between the warning threshold and the forced handover threshold corresponds to the pre-activation time interval of the standby control core, i.e., the time interval from when the warning threshold triggers the pre-activation action to when the forced handover threshold triggers activation. This time interval is used for the standby control core to complete preparatory actions such as clock startup stabilization, protocol engine initialization, and data synchronization of the transmit buffer. If the difference changes during the synchronization adjustment process, the duration of the pre-activation time interval will also change, which may cause the standby control core to be forced into the active state before the preparatory actions are completed, causing the hardware-level enable to flip at the moment of handover to lose the support of the preceding preparations. Maintaining a constant difference allows the pre-activation time budget to remain stable during the adaptive adjustment process.
[0106] In some embodiments, in response to a validity posterior result indicating that the previous primary / standby handover was a false alarm, the handover decision module synchronously increases the warning threshold and the forced handover threshold; in response to a validity posterior result indicating that the previous primary / standby handover was a real failure, the handover decision module synchronously decreases the warning threshold and the forced handover threshold. Synchronous increases move the thresholds away from the current typical level of communication quality score, reducing the sensitivity of subsequent handovers and preventing false alarms from recurring under similar operating conditions; synchronous decreases move the thresholds closer to the current typical level of communication quality score, increasing the sensitivity of subsequent handovers and enabling real failures to be identified at an earlier stage. During long-term operation, synchronous adjustment causes the warning threshold and the forced handover threshold to converge to an operating point that matches the actual failure rate, avoiding both excessive sensitivity leading to frequent false handovers and excessive sluggishness causing real failures to be missed.
[0107] The switching decision module, while performing dynamic fine-tuning via synchronous adjustment, also supports static benchmark switching based on mission phases. The method further includes the following steps: storing multiple sets of threshold vectors in a preset storage area of the field-programmable gate array (FPGA), each set corresponding to a different mission phase of the onboard CAN bus node. Typical mission phases include launch, steady-state orbit insertion, attitude maneuvering, orbit correction, and safety mode. In response to mission phase switching commands from the host computer, the switching decision module loads the threshold vector corresponding to the mission phase switching command as the benchmark value for both the warning threshold and the forced switching threshold, maintaining the accumulated multidimensional error state indicators within the sliding time window unchanged during the loading process. The adjustment amount of the synchronous adjustment is limited to a preset neighborhood of the benchmark value.
[0108] The task phase switching command is issued by the host computer via a CAN diagnostic frame or a dedicated control interface, carrying the target task phase identifier. The switching decision module decodes this identifier and loads the corresponding threshold vector from the preset storage area into the threshold register. The loading process only replaces the contents of the threshold register, without resetting the circular buffer and read / write pointers inside the health assessment module, nor clearing the accumulated amount of multi-dimensional error status indicators, thereby avoiding score jumps and false triggers during threshold switching.
[0109] The adjustment amount of the synchronous adjustment is limited to a preset neighborhood of the reference value. Specifically, let the reference value of the current task stage be... The preset neighborhood radius is Then the actual threshold after online adaptive adjustment always falls within Within the specified range, even with multiple consecutive online adaptive adjustments in the same direction, the sensitivity level will not deviate from the expected sensitivity level for the task phase.
[0110] The switching decision module implements the switching of the transmission link through a multi-level state machine of the backup control core. (Refer to...) Figure 4 The standby control core maintains one of several states, including at least sleep, hot standby, mirror synchronization, and active states. In response to communication quality scores, the standby control core sequentially transitions from sleep to hot standby and then to mirror synchronization states, crossing multiple driving thresholds.
[0111] Specifically, in sleep mode, the backup control core's clock is off, all flip-flops and logic units inside the backup control core are static, power consumption is at its lowest, but no action can be responded to immediately; in hot standby mode, the backup control core's clock is started and the backup transceiver group remains disabled, the logic units inside the backup control core begin to work, but the transmission link has not yet been connected to the bus, and it is in an intermediate state of being awakened but not transmitting; in mirror synchronization mode, the backup control core's protocol engine completes initialization and the data in the master control core's transmit buffer is synchronized to the backup control core's mirror register, the backup transceiver group remains disabled, at this time the internal state of the backup control core is completely aligned with the master control core, only the last step of transceiver enable toggling is missing; in active mode, the backup transceiver group is enabled, the backup control core takes over the drive of the bus, and completes the master-slave switch.
[0112] The standby control core transitions to hot standby state when the communication quality score crosses a first driving threshold (which is below a warning threshold); transitions to mirror synchronization state when the communication quality score crosses the warning threshold; and transitions to active state when the communication quality score reaches a forced handover threshold or the main control core enters a bus disconnect state. The handover of the transmission link in S3 is achieved by the standby control core transitioning from mirror synchronization state to active state.
[0113] The allocation of the total switching delay budget across the four states is key to achieving microsecond-level switching in this method. The total delay budget for the standby control core from cold state to activation includes clock startup stabilization time, protocol engine initialization time, transmit buffer data synchronization time, and transceiver enable toggle time. In a typical engineering implementation, clock startup stabilization time is approximately in the hundreds of microseconds range, protocol engine initialization time is approximately in the tens of microseconds range, transmit buffer data synchronization time is approximately in the several to tens of microseconds range, and transceiver enable toggle time is measured in a single clock cycle within the FPGA, approximately in the sub-microsecond range. These values are typical; actual values vary depending on the FPGA process, clock frequency, and buffer size. If all delays occur simultaneously at the moment of switching, the total delay will reach hundreds of microseconds. By absorbing clock startup stabilization during the transition to hot standby state and absorbing protocol engine initialization and transmit buffer data synchronization during the transition to mirror synchronization state, the final switching moment consists only of the extremely simple transceiver enable toggle action, thereby compressing the critical delay of primary / standby switching to the microsecond level.
[0114] The resource configuration of the standby control core in hot standby mode allows it to perform periodic self-tests. During hot standby, the standby control core periodically performs a self-test operation, and the result of the self-test operation serves as the migration gating for the standby control core to migrate from hot standby mode to mirror synchronization mode. If the self-test operation fails, the standby control core is prevented from migrating to mirror synchronization mode and a degraded operation is triggered.
[0115] The self-test operation includes configuring the protocol engine of the backup control core into internal loopback mode to automatically send and receive test frames and verify the consistency of the protocol decoding results of the test frames, as well as performing write-readback verification on the mirror register of the backup control core. Specifically, the internal loopback mode is activated by the protocol engine's working mode register. After activation, the protocol engine internally short-circuits its sending and receiving paths. Any sending frame generated by the protocol engine will be immediately received by the protocol engine's receiving path. The self-test module compares the received result of the test frame with the original sending content. If the two are consistent, the protocol engine itself is deemed to be functioning normally. The write-readback verification of the mirror register involves the self-test module writing preset test data bit by bit into the mirror register and immediately reading it back. If the read-back data is consistent with the written data, the mirror register storage function is deemed to be normal.
[0116] If either of the two checks fails, the self-test module sets the result of the self-test operation to failure, and this result is transmitted to the handover decision module as a gating signal for state transition. When the communication quality score crosses the warning threshold, the handover decision module first reads the latest self-test operation result. If it is successful, it allows the transition to mirror synchronization state; if it fails, it prohibits the transition to mirror synchronization state and causes the node to enter degraded operation. In degraded operation, the handover decision module continues to maintain the main control core as the transmission link driver, while simultaneously reporting a backup control core failure alarm to the host computer via CAN diagnostic frames, awaiting ground intervention within the telemetry and control arc.
[0117] The processing of fault location tags is completed by the switching decision module. When a master-slave switch occurs, in addition to switching the transmission link from the master control core to the backup control core, the switching decision module also determines the bus side of the switching target based on the fault location tag output by segment S1. In response to a fault location tag indicating a bus physical layer fault, the switching decision module switches the transmission link to the backup bus; in response to a fault location tag indicating an internal fault in the master control core, the switching decision module keeps the transmission link on the master bus, i.e., only switches to the backup control core on the same bus, rather than switching to the backup bus itself. The physical basis for this processing is that if the root cause of the fault is the complete failure of the master bus physical layer link, the backup control core switched to the same bus will still be connected to the failed link and communication cannot be restored; if the root cause of the fault is an internal anomaly in the master control core, the backup control core can restore communication simply by connecting to the master bus, without relinquishing the physical resources of the master bus.
[0118] In some embodiments, an event ring buffer is maintained within the field-programmable gate array (FPGA) to record the timestamp of each master-slave switchover, the communication quality score at the time of triggering, a snapshot of the error status indicators, the trigger source, and the bus status after the switchover. The event ring buffer is located in a preset storage area of the FPGA, with a capacity equal to a preset number of events. Each event record is written by the switchover decision module at the moment of the switchover. When the event ring buffer capacity is exhausted, the newest event overwrites the oldest event. Each event record includes: a timestamp of the switchover occurrence, the communication quality score at the moment of triggering the switchover, a snapshot of the cumulative values of all dimensions of the multi-dimensional error status indicators at the moment of triggering, a trigger source identifier (including trigger types such as the communication quality score crossing the forced switchover threshold, the master control core entering a bus disconnect state, and triggering by a host computer instruction), and the bus-side identifier driven by the standby control core after the switchover. The contents of the event ring buffer can be uploaded in batches to the host computer via CAN diagnostic frames within the ground control arc, providing a basis for switchover tracing for ground control and maintenance. This preserves maintainability without introducing software processes and facilitates locating the causes and handling links of each switchover during fault review.
[0119] The above describes one implementation of the primary / backup switchover decision in S3, namely, a pre-activation and forced switchover decision based on dual thresholds plus a four-level backup control core state machine. In other embodiments, S3 can also employ other switchover decision methods. For example, only a single switchover threshold can be set, and the backup control core can only maintain a two-level state: dormant and active. In response to the communication quality score reaching this single switchover threshold, the backup control core directly transitions from the dormant state to the active state to complete the switchover. This simplified method can also achieve the purpose of primary / backup link switchover, and therefore also belongs to the implementation of S3.
[0120] S4. The breakpoint retransmission module determines the breakpoint location based on the reception acknowledgment status of the frames already sent by the main control core when the master-slave switchover occurs. The standby control core then retransmits the unacknowledged frames from the breakpoint location, making the master-slave switchover transparent to the host computer. The implementation of S4 relies on the end-to-end acknowledgment mechanism established during the normal transmission phase of the main control core.
[0121] During the normal transmission phase of the main bus driven by the main control core, the breakpoint retransmission module embeds a frame-incrementing transmission sequence number in the data field of each frame of data to be transmitted at a preset byte position. The transmission sequence number is inserted by the breakpoint retransmission module during the frame assembly phase, that is, before the main control core protocol engine pushes the frame to be transmitted to the transmission buffer. Therefore, each frame carries a globally incrementing sequence number from the perspective of this node from the moment it leaves the main control core driven bus.
[0122] After receiving each frame, the designated receiving node writes back the maximum sequence number it has successfully received to a preset byte position in the data field of its subsequent service transmission frames sent to the bus. This service transmission frame thus constitutes a feedback frame. The feedback frame is carried out via piggybacking, meaning that the feedback information is not carried through a separate acknowledgment frame, but is embedded in the designated receiving node's own service transmission frame. Therefore, the feedback mechanism does not introduce new frame traffic or new protocol frame identifiers onto the main bus. The breakpoint retransmission module continuously monitors the service transmission frames sent by the designated receiving node on the bus, extracts the maximum sequence number from the preset byte position in its data field, and maintains it in the "latest acknowledged sequence number" register inside the breakpoint retransmission module.
[0123] The operational characteristics of the sequence number transmission mechanism ensure its transparency to normal business operations. The sequence number is used only during breakpoint retransmission to determine the breakpoint location; it does not trigger retransmission of data during non-breakpoint retransmission periods. That is, throughout the entire time the main control core is normally driving the bus transmission, even if the breakpoint retransmission module observes that a certain sequence number has not been acknowledged by the predetermined receiving node for an extended period, the breakpoint retransmission module will not initiate a retransmission based on this unacknowledged state. The sequence number mechanism is only activated at the moment of master / slave switchover, serving as the basis for breakpoint location. This switchover-specific design ensures that the sequence number transmission mechanism does not interfere with the protocol timing of the main control core's normal transmission and reception, nor does it conflict with the native error handling mechanism of the CAN protocol, minimizing compatibility impact with existing CAN application layers.
[0124] Specifically, S4 includes sub-steps S41-S43.
[0125] S41. Each frame of data to be transmitted has been embedded with a transmission sequence number that increments by frame. This sub-step is for switching states that have already been established at the moment of switching. The specific formation mechanism is that during the normal transmission phase of the main control core, the breakpoint retransmission module continuously embeds the sequence number during the frame assembly phase, as described earlier in this paragraph.
[0126] S42. The breakpoint retransmission module receives a feedback frame from the predetermined receiving node of the data to be transmitted, and obtains the maximum sequence number that the predetermined receiving node has successfully received from the feedback frame. The maximum sequence number is the value read by the breakpoint retransmission module from its internal "latest confirmed sequence number" register at the moment of master-slave handover. This value reflects the sequence number of the last frame successfully received up to the time of handover from the perspective of the predetermined receiving node.
[0127] S43. The breakpoint retransmission module determines the next sequence number after the maximum sequence number as the breakpoint position, and the backup control core retransmits the unacknowledged frame from the breakpoint position. Specifically, the breakpoint retransmission module transmits the breakpoint position to the backup control core using "maximum sequence number + 1" as the breakpoint position. The backup control core uses this breakpoint position as the starting point, searches for the corresponding frame to be sent in its internal mirror register and shadow buffer, and retransmits it sequentially.
[0128] The retransmission of S43 not only covers the unacknowledged frames themselves, but also the key frames that have been filtered and retained according to the service semantic priority. This method maintains a shadow buffer in the preset storage area of the field-programmable gate array. The shadow buffer retains key frames in the data to be transmitted according to the service semantic priority. The release of key frames in the shadow buffer is triggered when the maximum sequence number reaches the preset release condition.
[0129] The shadow buffer's storage area is independent of the main control core's transmit buffer, and its capacity is the preset number of key frames. During the main control core's normal transmission phase, the breakpoint retransmission module performs a service semantic priority determination on each frame. If it is determined to be a key frame, a copy is retained in the shadow buffer while the frame leaves the main control core's transmit buffer and drives the main bus to transmit.
[0130] The preset release condition includes the maximum sequence number increasing to exceed the keyframe's transmission sequence number plus a preset margin. That is, assuming the transmission sequence number of a certain keyframe is 's', the preset margin is... Then when the maximum sequence number grows to exceed When this happens, the breakpoint retransmission module releases the keyframe from the shadow buffer. The engineering significance of the preset margin is to reserve a "regret window" for remote retransmission after occasional frame loss at the receiving end; that is, even if the keyframe has been confirmed by the predetermined receiving node, the keyframe is still retained in the shadow buffer. The extra frame time is used to handle edge scenarios where an anomaly occurs after the scheduled receiving node has confirmed the data and a remote retransmission is required.
[0131] S43 also includes a backup control core that reads key frames with a transmission sequence number greater than the maximum sequence number from the shadow buffer and retransmits them. The retransmission corresponding to the shadow buffer is limited to key frames with a transmission sequence number greater than the maximum sequence number in the shadow buffer, rather than retransmitting the entire shadow buffer segment, so as to avoid the receiver being overwhelmed by repeated transmission of confirmed key frames.
[0132] The method for determining service semantic priorities is maintainable. In some embodiments, service semantic priorities are stored in a preset storage area of the field-programmable gate array (FPGA) in the form of a service semantic priority mapping table, which can be updated through an external diagnostic interface. Specifically, the service semantic priority mapping table uses the frame identifier of the frame to be transmitted as an index and the service semantic priority level as the value. During each frame assembly stage, the breakpoint retransmission module looks up the corresponding service semantic priority in the table using the frame identifier of that frame, and decides whether to retain that frame in the shadow buffer accordingly. The service semantic priority mapping table is decoupled from the CAN protocol frame identifier priority and is determined by the system service designer based on the scope of instruction influence, independent of the CANID numerical order. The service semantic priority mapping table is updated by the host computer or ground telemetry and control via CAN diagnostic frames or a dedicated control interface, allowing the key frame classification to be adjusted in orbit throughout the satellite's entire lifespan without relying on the reallocation of CANID planning.
[0133] After the primary / standby switchover is completed, this method performs closed-loop verification of the switchover effectiveness. Following S4, the following steps are included: starting a verification timer; within the timer's duration, acquiring multi-dimensional error status indicators generated by the standby control core's protocol engine at the physical and data link layers in real time, and combining them within a sliding time window as in S2 to obtain a verification period communication quality score; based on the verification period communication quality score and the standby control core's monitoring results of the main bus, determining the post-validation result of the primary / standby switchover effectiveness, which is categorized into at least three types: false alarms, real faults, and environmental interference; in response to the post-validation result being environmental interference, the method enters a safe mode, where the transceiver groups corresponding to the primary and standby control cores remain disabled for transmission and enabled for reception; the post-validation result is used as the basis for synchronization adjustment.
[0134] The verification timer duration is configurable, typically ranging from tens to hundreds of milliseconds, to cover the time window required for link stabilization and a small number of frame interactions after handover. This value is a typical value; the actual value can be configured according to the task stage. Within the verification timer duration, the health regression verification module repeatedly performs the same operations as S1 and S2 on the transmission link driven by the backup control core. This involves bypassing the reading of multi-dimensional error status indicators generated by the backup control core protocol engine and weighting them within a sliding time window to obtain a communication quality score. The resulting score is the verification period communication quality score, denoted as [missing value]. .
[0135] The specific criteria for the three-state posterior classification are determined jointly based on the communication quality score during the verification period and the fault location label. In some embodiments, the three-state classification is further refined based on the monitoring results of the backup control core on the main bus: if the communication quality score during the verification period is lower than the warning threshold, the validity posterior result is a false alarm; if the communication quality score during the verification period remains higher than the warning threshold and the backup control core monitors the frames sent by the main control core on the main bus but the decoding error ratio exceeds a preset ratio, the validity posterior result is a real fault; if the communication quality score during the verification period remains higher than the warning threshold and the decoding error ratio obtained by the backup control core from monitoring the main bus also exceeds a preset ratio, the validity posterior result is environmental interference.
[0136] The physical meaning of the three-state classification is as follows: A low communication quality score during the verification period indicates that the link has recovered to a healthy state after the switchover; the previous score exceeding the threshold was an occasional transient disturbance, i.e., a false alarm. A high communication quality score during the verification period, coupled with the backup control core detecting a set of decoding errors in frames sent by the main control core on the main bus, indicates a persistent anomaly in the main control core's internal protocol engine or transmit driver, i.e., a real fault. A high communication quality score during the verification period, coupled with the backup control core detecting widespread decoding errors across the entire main bus, indicates that the root cause of the fault is not in the main control core or the main bus physical layer, but rather in external environmental factors such as radiation storms or power supply anomalies, i.e., environmental interference. In response to the validity posterior result indicating environmental interference, the method enters a safe mode. In this safe mode, the transceiver groups corresponding to both the main and backup control cores remain disabled for transmission and enabled for reception, waiting for the environmental interference to subside before the switchover decision module attempts to reconnect to the bus for transmission, or the host computer intervenes.
[0137] The validity posterior results serve as the basis for synchronous adjustment. Specifically, the health regression verification module categorizes and feeds back the validity posterior results to the handover decision module. The handover decision module then synchronously adjusts the warning threshold and the forced handover threshold based on these results: the accumulation of false alarm samples drives the threshold to be adjusted synchronously upwards, the accumulation of real fault samples drives the threshold to be adjusted synchronously downwards, and environmental interference samples do not participate in threshold adjustment because scores exceeding the threshold under environmental interference do not reflect the failure rate of the link itself. This three-state posterior classification and feeding back enables the threshold adaptive loop to have closed-loop feedback, gradually converging to the optimal operating point based on actual fault samples throughout the satellite's entire lifespan.
[0138] In some embodiments, a health summary of the local node is embedded in the data field of a preset low-priority frame of the data to be transmitted. The health summary includes at least the recent communication quality score trend and fault location tag status of the local node, which is used by other nodes on the same bus for multi-node joint handover decisions. The embedding of the health summary is performed by the breakpoint retransmission module during the frame assembly stage. A preset low-priority service frame in the data to be transmitted is selected as the carrier frame of the health summary. The summary content is written in a preset byte position in the data field of the frame. The summary content includes the average and direction of change of the communication quality score of the local node in the most recent K monitoring sampling periods, and the currently effective fault location tag value. The health summary provides other nodes on the same bus with a view of the local node's health status, enabling the handover decision of a single node to incorporate the health status of neighboring nodes to form a system-level redundant view. It also reserves an extension interface for multi-node collaborative handover or fault location, and can be used to build a cross-node fault association diagnosis mechanism in subsequent model iterations.
[0139] The above describes one implementation of S4 interrupt point retransmission, which involves sending the sequence number end-to-end, along with feedback frame piggybacking and a shadow buffer. In other embodiments, S4 can also employ other interrupt point retransmission methods. For example, the sequence number can be omitted during the normal transmission phase of the main control core. The interrupt point retransmission module directly reads the tail pointer of the main control core's transmission buffer at the moment of master-slave switchover, using the frame pointed to by the tail pointer as the interrupt point position. The limitation of this method is that the tail pointer of the transmission buffer only reflects the transmission progress from the perspective of this node, and frames that have not yet been received by the receiving node may be missed during retransmission. A snapshot of the tail pointer of the main control core's transmission buffer can also achieve the purpose of interrupt point location at the moment of switchover, and therefore also belongs to the implementation of S4.
[0140] To facilitate understanding of the collaborative relationships between the steps of this method by those skilled in the art, a series of specific scenarios are used to illustrate the entire process from fault initiation to link recovery.
[0141] Assuming the satellite is in a steady-state on-orbit phase, the density of single-event upset events increases in the FPGA area where the main control core is located due to long-term cumulative irradiation. This causes occasional anomalies in the frame assembly and error state machine management of the main control core protocol engine. Specifically, the transmission error counter continues to accumulate and the density of single-cycle redundancy check error events gradually increases, while the physical layer link of the main bus itself is intact.
[0142] During the monitoring sampling period, the monitoring logic module detected that the cumulative amount of the two indicators on a long-term scale was continuously increasing by bypassing the transmission error counter and cyclic redundancy check error event generated by the main control core protocol engine. The health assessment module weighted the indicator within the long-term sliding sub-window to obtain a long-term score that slowly increased, and the long-term baseline component of the communication quality score rose accordingly.
[0143] When the protocol engine anomaly further deteriorates, triggering a concentrated outbreak of Cyclic Redundancy Check (CRC) error events, the monitoring logic module detects this increase in event density during event-based metric collection. The health assessment module then weights this metric within a mid-term sliding window, resulting in a rapid increase in the mid-term score. This increase dominates the short-term urgency component, causing the communication quality score to quickly surpass the first driving threshold.
[0144] In response to this transition, the switching decision module moves the backup control core from hibernation to hot standby, and the internal clock of the backup control core starts and stabilizes. If the short-term urgency component further increases, causing the communication quality score to exceed the warning threshold, the switching decision module moves the backup control core from hot standby to mirror synchronization. The backup control core protocol engine initializes, and the data in the primary control core's transmit buffer is synchronized to the backup control core's mirror register.
[0145] Meanwhile, the fault location tag is generated by the monitoring logic module based on the monitoring results of the backup control core on the main bus. During the transmission disable period, the backup control core continuously monitors the main bus and can detect frames transmitted by the main control core. However, if the frame decoding error rate exceeds a preset ratio, the fault location tag indicates an internal fault in the main control core. Based on this, the switchover decision module determines the switchover target to the backup control core on the same bus, while the transmission link remains on the main bus.
[0146] When the communication quality score reaches the forced handover threshold, the handover decision module moves the backup control core from mirror synchronization to active state, completing the primary / backup handover. Simultaneously, the breakpoint retransmission module determines the breakpoint location based on its internally maintained latest confirmed sequence number. From the breakpoint location, the backup control core retransmits unconfirmed frames and key frames retained in the shadow buffer on the main bus.
[0147] After the switchover is complete, the health regression verification module starts the verification timer. During the verification period, the health regression verification module repeatedly executes the same operations as S1 and S2 on the main bus link driven by the backup control core to obtain the communication quality score during the verification period. Since the main bus physical layer itself is intact and the internal state of the backup control core is normal after the backup control core takes over the main bus transmission, the abnormal frame sequence tail before the main control core was disabled still remains on the main bus. In addition, after the original main control core is disabled, the monitored side continues to observe residual signs of decoding abnormality in the original main control core's transmission path. The communication quality score during the verification period remains higher than the warning threshold and the decoding error ratio obtained by the backup control core from the main control core exceeds the preset ratio. The validity post-verification result is a real fault.
[0148] Based on this result, the switching decision module synchronously lowers the warning threshold and the forced switching threshold, so that the threshold converges to the working point that matches the current actual failure rate; the event ring buffer also records the complete event information of this switching, so that the ground telemetry and control can perform fault review in the subsequent telemetry and control arc.
[0149] The entire process, from the accumulation of abnormalities in the main control core protocol engine to the completion of the backup control core taking over the main bus and sending data, only lasted for a few microseconds during the critical switching moment, corresponding to the transceiver enable toggling. The main / backup switch was transparent to the data stream to be sent from the perspective of the host computer, and the link degradation caused by the abnormality inside the main control core did not result in the loss of service-level data at the host computer level.
[0150] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0151] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A method for fast switching of dual redundancy on a spaceborne CAN bus, characterized in that, This method is applied to a spaceborne CAN bus node that uses a field-programmable gate array (FPGA) to carry a main control core and a backup control core. The main control core and the backup control core are connected to the main bus and the backup bus respectively through independent transceiver groups. The main control core receives data to be transmitted from the host computer and drives the transmission to the main bus through the corresponding transceiver group. The method includes the following steps: S1. Real-time acquisition of multi-dimensional error status indicators generated by the protocol engine of the main control core at the physical layer and data link layer; S2. Within a configurable sliding time window, the multidimensional error status indicators are combined according to preset weights to obtain a communication quality score; S3. In response to the communication quality score crossing the warning threshold or reaching the forced handover condition, the transmission link of the data to be sent is switched from the main control core to the backup control core, completing the main / backup handover; S4. During the primary / backup switchover, the breakpoint location is determined based on the reception confirmation status of the frames sent by the primary control core, and the backup control core retransmits the unacknowledged frames from the breakpoint location, making the primary / backup switchover transparent to the host computer.
2. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 1, characterized in that, S2 includes the following sub-steps: S21. Within the sliding sub-windows corresponding to each of the multiple time scales, the multidimensional error state indicators are weighted and summed to obtain multiple time scale scores corresponding to each of the time scales. The multiple time scales include at least short-term, medium-term and long-term scales. S22. Take the larger of the time scale score corresponding to the short-term scale and the time scale score corresponding to the medium-term scale as the short-term urgency component. S23. Combine the short-term urgency component with the time scale score corresponding to the long-term scale according to a preset coefficient to obtain the communication quality score.
3. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 2, characterized in that, The multidimensional error status index is divided into transient error class and persistent error class according to the persistence of error, and the preset weight corresponding to the transient error class is lower than the preset weight corresponding to the persistent error class; The forced handover condition includes a forced handover threshold. The warning threshold and the forced handover threshold are adjusted synchronously based on the posterior result of the effectiveness of the primary / backup handover, and the synchronous adjustment maintains a constant difference between the warning threshold and the forced handover threshold.
4. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 3, characterized in that, The method further includes: Multiple sets of threshold vectors are stored in the preset storage area of the field-programmable gate array, and each set of threshold vectors corresponds to a different task stage of the spaceborne CAN bus node. In response to the task stage switching command from the host computer, the threshold vector corresponding to the task stage switching command is loaded as the reference value of the warning threshold and the forced switching threshold, and the multidimensional error state index accumulated in the sliding time window remains unchanged during the loading process; The adjustment amount of the synchronous adjustment is limited to a preset neighborhood range of the reference value.
5. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 1, characterized in that, The backup control core maintains one of a multi-level state, which includes at least a hibernation state, a hot standby state, a mirror synchronization state, and an active state. The backup control core responds to the communication quality score as it sequentially crosses multiple driving thresholds, and then sequentially transitions from the hibernation state to the hot standby state and the mirror synchronization state. The switching of the transmission link in S3 is achieved by the backup control core moving from the mirror synchronization state to the active state.
6. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 5, characterized in that, The standby control core periodically performs a self-test operation while in the hot standby state, and the result of the self-test operation serves as the migration gating for the standby control core to migrate from the hot standby state to the mirror synchronization state. If the self-test operation fails, the backup control core is prevented from migrating into the mirror synchronization state and a degraded operation is triggered.
7. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 1, characterized in that, S1 further includes generating a fault location tag based on the monitoring results of the backup control core on the main bus, wherein the fault location tag at least distinguishes between bus physical level faults and internal faults of the main control core. The switching of the transmission link in S3 further determines the bus side of the switching target based on the fault location tag. In response to the fault location tag indicating a physical level fault in the bus, the transmission link is switched to the backup bus. In response to the fault location tag indicating an internal fault in the main control core, the transmission link remains on the main bus.
8. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 1, characterized in that, S4 includes the following sub-steps: S41. Each frame of the data to be sent has been embedded with a frame-by-frame incrementing transmission sequence number; S42. Receive a feedback frame from the predetermined receiving node of the data to be sent, and obtain the maximum sequence number that the predetermined receiving node has successfully received from the feedback frame; S43. The next sequence number of the maximum sequence number is determined as the breakpoint position, and the backup control core retransmits the unacknowledged frame from the breakpoint position.
9. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 8, characterized in that, The method maintains a shadow buffer in a preset storage area of the field-programmable gate array, and the shadow buffer retains key frames in the data to be sent according to the service semantic priority. The keyframe in the shadow buffer is released when the maximum sequence number reaches a preset release condition. S43 further includes the backup control core reading the key frame whose transmission sequence number is greater than the maximum sequence number from the shadow buffer and retransmitting it.
10. The method for fast switching of dual redundancy on a spaceborne CAN bus according to claim 3, characterized in that, Following S4, the following is also included: Start the verification timer; During the duration of the verification timer, the multi-dimensional error status indicators generated by the protocol engine of the backup control core at the physical layer and data link layer are acquired in real time, and combined within the sliding time window in accordance with the method described in S2 to obtain the verification period communication quality score. Based on the communication quality score during the verification period and the monitoring results of the backup control check on the main bus, the validity a posteriori result of the master-slave switch is determined. The validity a posteriori result is divided into at least three categories: false alarm, real fault and environmental interference. In response to the validity posterior result being environmental interference, the method enters a safe mode, in which the transceiver groups corresponding to the main control core and the backup control core remain disabled for transmitting and enabled for receiving. The validity posterior result is used as the basis for the synchronization adjustment.