A fault-tolerant recovery system and method applied to a heterogeneous training cluster

By designing a fault-tolerant recovery system in a heterogeneous training cluster, and utilizing a unified interface and scheduling layer to monitor and recover training tasks across cloud vendors, the fault tolerance problem of heterogeneous training tasks across cloud vendors is solved, reducing costs and improving training efficiency.

CN120371575BActive Publication Date: 2026-06-23SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2025-02-27
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing fault tolerance solutions cannot effectively support fault tolerance for large-scale heterogeneous training tasks across cloud vendors, resulting in high costs and an inability to perceive the global information of the training task.

Method used

Design a fault-tolerant recovery system for heterogeneous training clusters, including a scheduling layer and a functional layer. The functional layer sets up a predefined unified interface for accessing different cloud vendors and chips. Log monitoring, status monitoring, node detection and alarm notification are realized through a unified task interface and cloud vendor interface. The scheduling layer executes training task monitoring, inspection and recovery processes.

Benefits of technology

It achieves fault tolerance for heterogeneous training tasks across cloud vendors, reduces costs and improves training efficiency, and supports rapid access and expansion of various heterogeneous chips.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120371575B_ABST
    Figure CN120371575B_ABST
Patent Text Reader

Abstract

The application relates to a fault-tolerant recovery system applied to a heterogeneous training cluster and a method thereof, the system comprising a scheduling layer and a function layer, the function layer being provided with a predefined unified interface for accessing different cloud manufacturers and different chips, and the function layer being used for executing log monitoring, state monitoring, node detection and alarm notification processes; the scheduling layer is used for executing training task monitoring, training task inspection, fault analysis and training recovery processes. The method comprises the following steps: based on the predefined unified interface, a plurality of cloud manufacturers and a plurality of chips are accessed; a fault-tolerant recovery process is executed: the states of all training tasks are polled, bad node detection is performed, the bad node is removed and the training task is restarted, and corresponding alarm prompts are given; and the parameter service is checked and abnormity is handled. Compared with the prior art, the application can support training fault tolerance of the heterogeneous training cluster across cloud manufacturers, facilitate the expansion of a plurality of heterogeneous chips and improve training efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of model training technology, and in particular to a fault-tolerant recovery system and method for heterogeneous training clusters. Background Technology

[0002] In large-scale distributed training of large models, fault-tolerant systems play a crucial role. Because the training process typically involves significant computational resources, long execution times (days or even weeks), and complex data and model distributions, any hardware failure, software error, or network problem can cause training interruptions or failures. The role of a fault-tolerant system is to ensure that the training process can continue in the face of these failures, thereby minimizing resource waste and time loss.

[0003] Fault-tolerant systems typically need to implement the following functions: ① Checkpointing: Periodically saving the training state (implemented by the training framework). ② Task Retry: Automatically retrying failed tasks. ③ Failover: Transferring tasks to backup resources. ④ Logging & Monitoring: Tracking the system status in real time.

[0004] In large-scale distributed training of large models, fault-tolerant systems play a crucial role in ensuring the stability, reliability, and efficiency of training. Through techniques such as fault detection, checkpointing, and resource reallocation, they minimize the impact of faults on training, thereby reducing resource waste and time costs, and supporting large-scale, long-term training tasks.

[0005] As the number of model parameters increases, the computing power required by large training clusters also increases, often necessitating multiple cloud vendors providing various chips to meet the demands. Due to the significant differences in chips within heterogeneous clusters, asynchronous parallel solutions using parameter servers (3DPS) are generally employed in large-scale heterogeneous distributed training, such as... Figure 1 As shown, a training task is broken down into several sub-tasks, which allows the use of various chips from multiple cloud vendors for training. Multiple training sub-tasks can be deployed on different chips from different cloud vendors.

[0006] However, existing fault-tolerance solutions are all designed to provide fault tolerance for training tasks running on homogeneous chips from a single cloud vendor, and cannot solve the problem of large-scale heterogeneous training tasks across cloud vendors. Since they can only support training tasks from a single cloud vendor and a single type of chip, when facing large-scale heterogeneous training scenarios across cloud vendors, it is often necessary to deploy multiple fault-tolerance services on multiple cloud vendors, and even one for each type of chip. This results in high costs and each fault-tolerance service cannot be aware of the global information of the training task. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the prior art by providing a fault-tolerant recovery system and method for heterogeneous training clusters, which can support cross-cloud vendor training fault tolerance for heterogeneous training clusters, facilitate the expansion of various heterogeneous chips, and improve training efficiency.

[0008] The objective of this invention can be achieved through the following technical solution: a fault-tolerant recovery system for heterogeneous training clusters, comprising a scheduling layer and a functional layer, wherein the functional layer is provided with a predefined unified interface for accessing different cloud vendors and different chips, and the functional layer is used to perform log monitoring, status monitoring, node detection and alarm notification processes.

[0009] The scheduling layer is used to perform training task monitoring, training task inspection, fault analysis, and training recovery processes.

[0010] Furthermore, the unified interface includes a data structure, a unified task interface, and a unified cloud vendor interface.

[0011] Furthermore, the data structure includes an abstract task data structure, an abstract cloud vendor data structure, a bad node data structure, and an inspection information summary data structure.

[0012] Furthermore, the unified task interface includes:

[0013] Start the task;

[0014] Obtain task status through cloud vendors;

[0015] Delete bad nodes;

[0016] Stop the task;

[0017] Obtain the task's most recent loss (i.e., loss value) from the logs;

[0018] Determine if the task is stuck based on the logs;

[0019] Obtain the training task performance values ​​from the logs;

[0020] Get other custom monitoring metrics;

[0021] The cloud vendor identifier used.

[0022] Furthermore, the unified cloud vendor interface includes:

[0023] Start the task based on the task ID;

[0024] Get the task status based on the task ID;

[0025] Delete bad nodes;

[0026] Stop the task based on the task ID.

[0027] A fault-tolerant recovery method for heterogeneous training clusters includes the following steps:

[0028] S1. Based on a predefined unified interface, it can connect to multiple cloud vendors and various chips;

[0029] S2. Execute the fault-tolerant recovery process: Poll the status of each training task, detect bad nodes, remove bad nodes and restart the training task, and issue corresponding alarm prompts.

[0030] And perform checks and exception handling for parameter services.

[0031] Furthermore, the specific process of step S1 is as follows:

[0032] S11. Based on the predefined unified task interface, connect to each training task and parameter service respectively;

[0033] It can connect to multiple cloud vendors based on a predefined unified cloud vendor interface;

[0034] S12. Adapt the corresponding training framework for each heterogeneous chip.

[0035] Furthermore, in step S11, accessing multiple cloud vendors specifically involves using a unified cloud vendor interface. By calling the cloud vendor service interface, the tasks of starting the training task, obtaining the task status, deleting bad nodes, and stopping the task can be implemented respectively. In addition, logs are obtained using the unified cloud vendor interface.

[0036] Furthermore, in step S2, polling the status of each training task specifically involves performing a task inspection process at a preset time interval to collect the task status; the average Tgs (performance metric) of the most recent M training iterations (iter), the minimum loss value, the average loss value, the maximum loss value; and the loss value curve.

[0037] Furthermore, the specific process of step S2 is as follows:

[0038] S21. Check each training subtask sequentially and handle any exceptions:

[0039] If the task status is abnormal, triggering a task status abnormality alarm, stop the training subtask, perform bad node detection, remove bad nodes, and then restart the subtask.

[0040] Analyze the logs for abnormal loss values ​​and trigger an abnormal loss value alert.

[0041] If the training task gets stuck, triggering a task stuck alarm, stop the training subtask, perform a bad node detection, remove the bad nodes, and then restart the subtask.

[0042] If the performance of the training task is lower than the preset normal value, a low performance alarm is triggered. The training subtask is stopped, bad nodes are detected, bad nodes are removed, and the subtask is restarted.

[0043] Other preset or custom metrics are abnormal, triggering an alarm;

[0044] S22. Check parameter services and handle abnormal situations:

[0045] If the task status is abnormal, trigger a task status abnormality alarm, stop the task, and then restart the task.

[0046] Preset custom monitoring metrics to trigger alarms when abnormalities occur.

[0047] Compared with the prior art, the present invention has the following advantages:

[0048] This invention designs a fault-tolerant recovery system for heterogeneous training clusters, comprising a scheduling layer and a functional layer. The functional layer has predefined unified interfaces, including a unified task interface and a unified cloud vendor interface, for accessing different cloud vendors and different chips. The functional layer performs log monitoring, status monitoring, node detection, and alarm notification processes. The scheduling layer performs training task monitoring, training task inspection, fault analysis, and training recovery processes. This enables the fault-tolerant service to support training tasks from multiple cloud vendors and with multiple heterogeneous chips. By defining a unified interface, the differences in cloud vendor and chip architecture can be masked. After heterogeneous chips from multiple cloud vendors are adapted and accessed based on the unified interface, the fault-tolerant service can achieve a unified implementation independent of specific cloud vendors and specific chips.

[0049] This invention defines the access process for training tasks and parameter services, enabling various heterogeneous training tasks from different cloud vendors to quickly access fault-tolerant services based on a unified task interface and a unified cloud vendor interface.

[0050] This invention designs a fault-tolerant recovery process, which includes task inspection, anomaly alarm, and automatic recovery. This process can support cross-cloud vendor fault tolerance for various anomalies that occur during large-scale heterogeneous cluster distributed training, thereby reducing costs and improving training efficiency. Attached Figure Description

[0051] Figure 1 This is a schematic diagram of a large-scale heterogeneous cluster distributed training framework in existing technologies;

[0052] Figure 2 This is a schematic diagram of the method flow of the present invention;

[0053] Figure 3 This is a schematic diagram of the application architecture for an example.

[0054] Figure 4 A schematic diagram illustrating the process of connecting cloud vendors and chips;

[0055] Figure 5 A schematic diagram of the fault-tolerant recovery process for the training task. Detailed Implementation

[0056] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.

[0057] Example

[0058] A fault-tolerant recovery system for heterogeneous training clusters includes a scheduling layer and a functional layer. The functional layer is equipped with a predefined unified interface for accessing different cloud vendors and different chips. The functional layer is used to perform log monitoring, status monitoring, node detection and alarm notification processes.

[0059] The scheduling layer is used to perform training task monitoring, training task inspection, fault analysis, and training recovery processes.

[0060] Based on the above system, a fault-tolerant recovery method for heterogeneous training clusters is implemented, such as... Figure 2 As shown, it includes the following steps:

[0061] S1. Based on a predefined unified interface, it can connect to multiple cloud vendors and various chips;

[0062] S2. Execute the fault-tolerant recovery process: Poll the status of each training task, detect bad nodes, remove bad nodes and restart the training task, and issue corresponding alarm prompts.

[0063] And perform checks and exception handling for parameter services.

[0064] This embodiment applies the above-described solution to build such a system. Figure 3 The application framework shown mainly includes:

[0065] I. Unified Interface

[0066] 1.1 Data Structure Definition

[0067] Abstract task data structure, abstract cloud vendor data structure, bad node data structure, and inspection information summary data structure.

[0068] 1.2 Unified Interface Definition

[0069] The unified task interface definition includes: ① starting the task; ② obtaining the task status through the cloud vendor; ③ deleting bad nodes; ④ stopping the task; ⑤ obtaining the task's most recent loss from the logs; ⑥ determining whether the task is stuck based on the logs; ⑦ obtaining the training task performance value from the logs; ⑧ obtaining other custom monitoring metrics; and ⑨ the cloud vendor identifier used.

[0070] The unified cloud vendor interface includes: ① starting a task based on the task ID; ② obtaining the task status based on the task ID; ③ deleting bad nodes; and ④ stopping a task based on the task ID.

[0071] II. Accessing multiple cloud vendors and multiple chips based on a unified interface

[0072] like Figure 4 As shown, each training task and parameter service is accessed through a unified task interface. The parameter service implements the following: ① starting the task; ② obtaining the task status through the cloud provider; ③ stopping the task; ④ the cloud provider used; and ⑤ obtaining custom monitoring metrics.

[0073] After multiple cloud vendors connect through a unified cloud vendor interface, the function of starting a task is provided. By calling the cloud vendor service interface, the training task can be started.

[0074] Provides an implementation for obtaining task status, which can be achieved by calling the cloud vendor's service interface to obtain the current status of the training task;

[0075] Provides an implementation for deleting bad nodes, which is achieved by calling the cloud vendor's service interface;

[0076] Provides a way to stop a task by calling the cloud vendor's service interface;

[0077] Provides the functionality to retrieve logs.

[0078] Furthermore, for heterogeneous chip adaptation training frameworks, each chip adaptation training framework generates its own training logs as a training subtask. In practical applications, it is only necessary to develop a function to analyze the training logs and obtain the following metrics:

[0079] Get the most recent loss value for the task;

[0080] Obtain the training task's tgs (performance value);

[0081] Determine if the training task is stuck.

[0082] III. Fault Recovery Process

[0083] The design includes an alarm module and a task inspection module. The alarm module is used to implement: ① task restart alarm; ② training task stuck alarm; ③ alarm for specific strings appearing in the log; ④ alarm for bad nodes detected; ⑤ abnormal task status; ⑥ abnormal loss value alarm; ⑦ alarm for training task performance below normal value.

[0084] The task inspection module is used to collect the following: ① task status; ② average TGS (performance metric), minimum loss, average loss, and maximum loss for the most recent 50 training iterations (iter); ③ loss curve.

[0085] During execution, the fault-tolerant scheduler collects training task information and sends the information to the alarm service, which then determines whether an alarm is triggered.

[0086] The inspection dispatcher collects training task information and inspection information, and sends the task inspection information to the alarm tool.

[0087] The specific fault-tolerant recovery process includes the following three sub-processes:

[0088] 1. For example Figure 5 As shown, the abnormal situation handling process is as follows: Each training subtask is checked in turn.

[0089] ① The task status is abnormal, triggering a task status abnormality alarm. Stop the training subtask, perform bad node detection, remove bad nodes, and then restart the subtask;

[0090] ② Analyze loss anomalies from the logs and trigger loss anomaly alerts.

[0091] ③ The training task freezes, triggering a task freeze alarm. Stop the training subtask, perform a bad node detection, remove the bad nodes, and then restart the subtask;

[0092] ④ The training task performance is lower than normal, triggering a low performance alarm. Stop the training subtask, perform bad node detection, remove bad nodes, and then restart the subtask.

[0093] ⑤ Other custom metrics are abnormal, triggering an alarm.

[0094] 2. Check parameter services and abnormal situation handling procedures:

[0095] ① If the task status is abnormal, a task status abnormality alarm will be triggered. Stop the task and then restart the task.

[0096] ② When a custom monitoring metric is abnormal, an alarm is triggered.

[0097] 3. Task Inspection Process

[0098] Execute once at fixed time intervals to collect: ① task status; ② average TGS (performance metric), minimum loss, average loss, and maximum loss for the most recent 50 training iterations; ③ loss curve. Combine these data into a report and send it to the user's communication tools.

[0099] In summary, this solution defines a unified interface for training clusters with heterogeneous chips from multiple cloud vendors (including defining multiple unified abstract data structures, a unified task interface, and a unified cloud vendor interface), and a unified interface for adapting to heterogeneous chips from multiple cloud vendors (multiple cloud vendors implement a unified cloud vendor interface, and tasks running on different chips adapt to implement a unified task interface, thereby eliminating differences between cloud vendors and chips). This decouples the fault-tolerant process from heterogeneous chips from cloud vendors, making it easier to expand to multiple heterogeneous chips and reducing chip access costs.

[0100] On the other hand, it defines fault tolerance processes, including fault recovery methods based on multiple cloud vendors and multiple heterogeneous chips, and defines various fault handling processes. It can support cross-cloud vendor heterogeneous training clusters with multiple cloud vendors for training fault tolerance, which greatly reduces costs and improves training efficiency.

[0101] This solution was applied in practice and ran for several months in a training task using a variety of domestically produced chips from two cloud vendors, totaling tens of thousands of cards. The results have verified that this solution can achieve the aforementioned technical effects.

Claims

1. A fault-tolerant recovery system applied to heterogeneous training clusters, characterized in that, It includes a scheduling layer and a functional layer. The functional layer is equipped with a predefined unified interface for accessing different cloud vendors and different chips. The functional layer is used to perform log monitoring, status monitoring, node detection and alarm notification processes. The scheduling layer is used to perform training task monitoring, training task inspection, fault analysis, and training recovery processes. The unified interface includes a data structure, a unified task interface, and a unified cloud vendor interface; The unified task interface includes: Start the task; Obtain task status through cloud vendors; Delete bad nodes; Stop the task; Retrieve the most recent loss value for the task from the logs; Determine if the task is stuck based on the logs; Obtain the training task performance values ​​from the logs; Get other custom monitoring metrics; The cloud vendor identifier used; The unified cloud vendor interface includes: Start the task based on the task ID; Get the task status based on the task ID; Delete bad nodes; Stop the task based on the task ID.

2. The fault-tolerant recovery system for heterogeneous training clusters according to claim 1, characterized in that, The data structure includes an abstract task data structure, an abstract cloud vendor data structure, a bad node data structure, and an inspection information summary data structure.

3. A fault-tolerant recovery method for heterogeneous training clusters, implemented based on the fault-tolerant recovery system for heterogeneous training clusters as described in claim 1, characterized in that, Includes the following steps: S1. Based on a predefined unified interface, it can connect to multiple cloud vendors and various chips; S2. Execute the fault-tolerant recovery process: Poll the status of each training task, detect bad nodes, remove bad nodes and restart the training task, and issue corresponding alarm prompts. And perform checks and exception handling on parameter services; The specific process of S1 is as follows: S11. Based on the predefined unified task interface, connect to each training task and parameter service respectively; It can connect to multiple cloud vendors based on a predefined unified cloud vendor interface; S12. Adapt the corresponding training framework for each heterogeneous chip; In S11, accessing multiple cloud vendors specifically involves using a unified cloud vendor interface. By calling the cloud vendor service interface, the functions of starting training tasks, obtaining task status, deleting bad nodes, and stopping tasks can be implemented respectively. In addition, logs are obtained using the unified cloud vendor interface. The specific process of S2 is as follows: S21. Check each training subtask sequentially and handle any exceptions: If the task status is abnormal, triggering a task status abnormality alarm, stop the training subtask, perform bad node detection, remove bad nodes, and then restart the subtask. Analyze the logs to identify abnormal loss values ​​and trigger an abnormal loss value alert. If the training task gets stuck, triggering a task stuck alarm, stop the training subtask, perform a bad node detection, remove the bad nodes, and then restart the subtask. If the performance of the training task is lower than the preset normal value, a low performance alarm is triggered. The training subtask is stopped, bad nodes are detected, bad nodes are removed, and the subtask is restarted. Preset custom metric anomalies trigger alarms; S22. Check parameter services and handle abnormal situations: If the task status is abnormal, trigger a task status abnormality alarm, stop the task, and then restart the task. Preset custom monitoring metrics to trigger alarms when abnormalities occur.

4. The fault-tolerant recovery method for heterogeneous training clusters according to claim 3, characterized in that, In step S2, polling the status of each training task specifically involves performing a task inspection process at a preset time interval to collect the task status; the average performance index tgs of the most recent M training iterations, the minimum loss value, the average loss value, the maximum loss value; and the loss value curve.