A fault detection method applied to heterogeneous training clusters
By adopting a unified interface for group communication and training detection processes in heterogeneous training clusters, faulty nodes can be quickly located, solving the problems of long time consumption, high cost and low detection rate in existing technologies. This achieves efficient and accurate fault detection and is suitable for large-scale heterogeneous training clusters.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2025-02-27
- Publication Date
- 2026-06-30
Smart Images

Figure CN120371576B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of model training technology, and in particular to a fault detection method applied to heterogeneous training clusters. Background Technology
[0002] Hardware failures are unavoidable during the training of large models. When a hardware failure causes a training task to malfunction, it is necessary to quickly detect and remove the faulty node, replace it with a backup node, and then resume training. As the number of model parameters increases, the computing power required for training large models also increases, and the cluster size also grows larger. In large-scale heterogeneous training clusters composed of multiple chips, it is crucial to quickly and accurately detect the faulty node when the training task is interrupted due to hardware failure (the longer the training task is interrupted, the higher the financial cost).
[0003] The commonly used solution is to use a binary search approach for communication detection and training detection. This involves dividing all nodes into two groups, detecting each group separately, and further dividing the problematic group. However, communication detection only checks communication functions between training cards (such as allreduce and allgather). While the test is quick, it cannot fully reproduce the training environment (which involves high computational load and high hardware temperature). Training detection, on the other hand, requires preparing a model with the same structure as the training task but a smaller number of parameters (training can be done with a small number of nodes). Then, a group of nodes is used to train this small-parameter model to see if it can train normally.
[0004] In summary, the binary search detection task cannot be run in parallel, resulting in long processing times and high costs. Each round requires submitting two detection tasks, and problematic groups continue to be binary searched. For example, for a Wanka training cluster (8 cards per node, 1250 nodes in total), 8 rounds of detection are required (625->312->156->78->39->20->10->5 rounds).
[0005] Furthermore, communication-based detection methods have low fault detection rates, and the computational load on the chip is relatively low during communication detection, making it impossible to fully reproduce the training environment (which involves high computational load and high hardware temperatures), potentially failing to detect faulty nodes. Additionally, because it only supports fault detection for homogeneous chips, migrating to support heterogeneous training clusters becomes prohibitively costly when dealing with heterogeneous chips. Summary of the Invention
[0006] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a fault detection method for heterogeneous training clusters, which can accurately and efficiently detect faulty nodes and support fault detection of large-scale heterogeneous training clusters.
[0007] The objective of this invention can be achieved through the following technical solution: a fault detection method applied to heterogeneous training clusters, comprising the following steps:
[0008] S1. Start the detection process based on the defined unified interface;
[0009] S2. Execute the group communication detection process. If a faulty node is detected, output the detection result and end the current detection process; otherwise, execute step S3.
[0010] S3. Execute the group training and detection process, output the detection results, and end the current detection process.
[0011] Furthermore, in step S1, the unified interface is specifically defined using function declarations.
[0012] Furthermore, the unified interface in step S1 includes a data structure, a common interface, a unified interface for packet communication detection, and a unified interface for packet training detection.
[0013] Furthermore, the data structure includes: a group communication detection task data structure; a communication detection function script; a detection result data structure; a group training detection task data structure; a training detection function script; a node detection task data structure; and a node detection script node information data structure.
[0014] Furthermore, the public interface includes: grouping and starting tasks; stopping tasks; and collecting detection results;
[0015] The unified interface for packet communication detection includes: a chip type identifier; and a communication detection function interface.
[0016] The unified interface for group training and detection includes: chip type identifier; training and detection function interface.
[0017] Furthermore, step S1 specifically involves using a unified interface to initiate tasks for different chip adaptation groups, communication detection functions, training detection functions, task stopping methods, and methods for collecting detection results.
[0018] Furthermore, the packet communication detection process in step S2 is specifically as follows:
[0019] Round 1: Group the nodes into N groups, execute custom scripts within each group, and if the calculation fails, locate the faulty node directly; if the communication fails, locate the group containing the faulty node.
[0020] Second round: Take one node from each of the normal group and the abnormal group from the results of the first round to form a two-node group. Then, execute a custom communication script for each two-node group. If the result is normal, it means that the node in the original abnormal group is normal. If the result is abnormal, it means that the node in the original abnormal group is confirmed as a faulty node.
[0021] Furthermore, the custom script is specifically defined using the matmul, allreduce, allgather, or alltoall functions.
[0022] Furthermore, the group training and detection process in step S3 is specifically as follows:
[0023] First round: Group the nodes into M groups and perform a small parameter training task. Groups that train normally are considered normal nodes, while groups that train abnormally are marked as abnormal groups, indicating that there are faulty nodes in the abnormal groups.
[0024] Second round: Take one node from each of the (M-1) normal groups and one abnormal group from the results of the first round, and retrain with a small number of parameters using M nodes as a group. If the training is abnormal, it means that the node provided by the original abnormal group is a faulty node; otherwise, it means that the node provided by the original abnormal group is normal.
[0025] Furthermore, the training anomaly specifically corresponds to the situation where the training speed (tgs) is lower than a preset threshold and the loss (loss function value) is abnormal.
[0026] Compared with the prior art, the present invention has the following advantages:
[0027] This invention proposes a fault detection method for heterogeneous training clusters. First, communication detection is performed in groups. If a faulty node is detected, the detection process ends. If no faulty node is detected, training detection is performed in groups. This combines communication detection and training detection, and designs the simultaneous detection in groups, which can improve the accuracy and efficiency of fault detection and solve the problems of low detection rate of single communication detection and long time consumption and high cost of single training detection.
[0028] This invention defines a unified interface. When the overall training task contains multiple heterogeneous chips, it utilizes the unified interface to adapt the grouping and starting methods, communication detection functions, training detection functions, task stopping methods, and detection result collection methods for each chip. As a result, the chip type does not need to be known during the entire fault detection process scheduling. After chip adaptation, the impact of heterogeneous chip differences on the fault detection process can be shielded, enabling the unified fault detection method to support any chip. This reduces the complexity of fault detection applied to heterogeneous training clusters and lowers costs. Attached Figure Description
[0029] Figure 1 This is a schematic diagram of the method flow of the present invention;
[0030] Figure 2 This is a schematic diagram illustrating the application process of an example.
[0031] Figure 3 This is a schematic diagram of packet communication detection in this invention;
[0032] Figure 4 This is a schematic diagram of the group training and detection process in this invention;
[0033] Figure 5 This is a schematic diagram of group training and detection in this invention. Detailed Implementation
[0034] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0035] Example
[0036] like Figure 1 As shown, a fault detection method applied to heterogeneous training clusters includes the following steps:
[0037] S1. Start the detection process based on the defined unified interface;
[0038] S2. Execute the group communication detection process. If a faulty node is detected, output the detection result and end the current detection process; otherwise, execute step S3.
[0039] S3. Execute the group training and detection process, output the detection results, and end the current detection process.
[0040] This embodiment applies the above-described solution, such as Figure 2 As shown, communication detection is first performed in groups. If a faulty node cannot be detected, it is then grouped again for small-parameter training detection. A standard interface for fault detection is defined, and a multi-chip adaptation access process is implemented to support fault detection in large-scale heterogeneous training clusters.
[0041] The main contents include:
[0042] I. Define a unified interface
[0043] 1.1 Data Structure
[0044] This includes: data structure for group communication detection tasks; script for communication detection functions; data structure for detection results; data structure for group training and detection tasks; script for training and detection functions; data structure for node detection tasks; and node information data structure for node detection scripts.
[0045] 1.2 Public Interface:
[0046] This includes: ① starting tasks in groups; ② stopping tasks; ③ collecting test results.
[0047] Unified interface for packet communication detection:
[0048] This includes: ① Chip type identifier; ② Communication detection function interface.
[0049] Unified interface for group training and detection:
[0050] This includes: ① chip type identifier; ② training and detection function interface.
[0051] 1.3 Multi-chip adaptation process:
[0052] Since the overall training task involves multiple chips, methods for starting tasks in groups, communication detection functions, training detection functions, methods for stopping tasks, and methods for collecting detection results need to be adapted for each chip. The chip type does not need to be known during the entire fault detection process scheduling.
[0053] II. Design Testing Process
[0054] 2.1 Packet Communication Detection Process:
[0055] Round 1: Group the nodes into N groups and execute custom scripts (functions such as matmul, allreduce, allgather, and alltoall). If the calculation fails, the faulty node can be located directly. If the communication fails, the faulty node can only be located in the group containing the faulty node.
[0056] Second round: Take one node from each of the normal and abnormal groups from the results of the first round to form multiple two-node groups. Execute the custom communication scripts for each group. If the result is normal, it means that the node in the original abnormal group is normal. If the result is abnormal, it means that the node in the original abnormal group is confirmed to be abnormal, that is, a faulty node.
[0057] like Figure 3 As shown, in this embodiment, the nodes are grouped into units of 8. The results of the first round show that all nodes in Group 0 are normal, while there are nodes with communication failures in Group 1. In the second round, Node 0 and Node 8 are grouped together, Node 1 and Node 9 are grouped together, and so on, and communication tests are performed on each. The results show that Node 0 and Node 8 have abnormal communication, indicating that Node 8 is a faulty node.
[0058] 2.2 Group Training and Testing Process
[0059] like Figure 4As shown, in the first round: In this embodiment, the nodes are grouped into groups of 4, and the training and testing tasks (small parameter training tasks) are performed in the groups. The groups that train normally are normal nodes, and the groups that train abnormally (low TGS, abnormal loss, etc.) are marked as abnormal groups, indicating that there are faulty nodes in the group.
[0060] The second round still uses four nodes per group, but one node is taken from each of the three normal groups and one abnormal group, and the training task with a small number of parameters is carried out again with four nodes per group. If the training is abnormal (low TGS, abnormal loss, etc.), it means that the node provided by the original abnormal group is a faulty node, and if the training is normal, it means that the node provided by the original abnormal group is a normal node.
[0061] It should be noted that tgs is the training speed, and the larger the tgs value, the better the performance; loss is the output value of the loss function, and the loss value usually decreases as training progresses.
[0062] like Figure 5 As shown in this embodiment, in the first round, training is performed in groups of four nodes. If Group1 is found to have abnormal training results, it is marked as an abnormal group, while Group0, Group2, and Group3 are marked as normal groups. In the second round, one node from each of the abnormal group Group1 and the three normal groups is taken to form a new group. If the training result of the new group G0123_0 is found to be abnormal, it indicates that Node4 provided by the abnormal group marked in the first round is a faulty node.
[0063] In summary, this solution combines communication detection and training detection methods, along with simultaneous grouping, to more accurately and efficiently detect faulty nodes. This addresses the issues of low detection rates with single communication detection and high time consumption and cost with single training detection. Furthermore, it defines a standard interface for fault detection, enabling multi-chip adaptation and access processes. For large-scale heterogeneous training clusters, the fault detection process is unaware of the chip hardware structure within the cluster, achieving decoupling between the detection process and chip structure. This makes it easier to expand the inspection process to various heterogeneous chips, and reduces chip access costs. Practical application of this solution, running for several months in a training task using multiple domestically produced chips totaling tens of thousands of cards, has verified that this solution can achieve the aforementioned technical effects.
Claims
1. A fault detection method applied to heterogeneous training clusters, characterized in that, Includes the following steps: S1. Start the detection process based on the defined unified interface; S2. Execute the group communication detection process. If a faulty node is detected, output the detection result and end the current detection process; otherwise, execute step S3. S3. Execute the group training and detection process, output the detection results, and end the current detection process; The packet communication detection process in S2 is as follows: Round 1: Group the nodes into N groups, execute custom scripts within each group, and if the calculation fails, locate the faulty node directly; if the communication fails, locate the group containing the faulty node. Second round: Take one node from each of the normal group and the abnormal group from the results of the first round to form a two-node group. Then, execute a custom communication script for each two-node group. If the result is normal, it means that the node in the original abnormal group is normal. If the result is abnormal, it means that the node in the original abnormal group is confirmed as a faulty node. The specific process of group training and detection in S3 is as follows: First round: Group the nodes into M groups and perform a small parameter training task. Groups that train normally are considered normal nodes, while groups that train abnormally are marked as abnormal groups, indicating that there are faulty nodes in the abnormal groups. Second round: Take one node from each of the (M-1) normal groups and one abnormal group from the results of the first round, and retrain with a small number of parameters using M nodes as a group. If the training is abnormal, it means that the node provided by the original abnormal group is a faulty node; otherwise, it means that the node provided by the original abnormal group is normal.
2. The fault detection method applied to heterogeneous training clusters according to claim 1, characterized in that, In step S1, the unified interface is specifically defined using function declarations.
3. The fault detection method for heterogeneous training clusters according to claim 1, characterized in that, The unified interface in step S1 includes data structure, common interface, unified interface for packet communication detection, and unified interface for packet training detection.
4. The fault detection method for heterogeneous training clusters according to claim 3, characterized in that, The data structures include: a group communication detection task data structure; a communication detection function script; a detection result data structure; a group training detection task data structure; a training detection function script; a node detection task data structure; and a node information data structure for the node detection script.
5. The fault detection method for heterogeneous training clusters according to claim 4, characterized in that, The public interface includes: grouping and starting tasks; stopping tasks; and collecting detection results. The unified interface for packet communication detection includes: a chip type identifier; and a communication detection function interface. The unified interface for group training and detection includes: chip type identifier; training and detection function interface.
6. The fault detection method for heterogeneous training clusters according to claim 5, characterized in that, Specifically, step S1 involves using a unified interface to start tasks for different chip adaptation groups, communication detection functions, training detection functions, task stopping methods, and methods for collecting detection results.
7. The fault detection method applied to heterogeneous training clusters according to claim 1, characterized in that, The custom script is specifically defined using the matmul, allreduce, allgather, or alltoall functions.
8. The fault detection method for heterogeneous training clusters according to claim 1, characterized in that, The training anomalies specifically correspond to situations where the training speed (tgs) is lower than a preset threshold or the loss function value (loss) is abnormal.