Hardware fault handling methods, devices, equipment, storage media and programs
By analyzing the hardware status and structured fault detection results of the target computing nodes, and combining node isolation and task reconstruction strategies, the problem of inaccurate hardware-level fault detection in existing technologies is solved, achieving high reliability and fault tolerance for distributed training tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DAWNING INT INFORMATION IND CO LTD
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, Kubernetes-based containerized cluster scheduling systems cannot accurately detect hardware-level failures, such as GPU card loss or memory anomalies, resulting in low reliability of training tasks. Furthermore, when a single container fails, it cannot coordinate the reconstruction of other Pods, leading to training tasks being stuck or data inconsistencies.
By performing state detection on the hardware of the target computing node, structured fault detection results are generated. Based on these results, node isolation is performed and training tasks are reconstructed in the distributed cluster. Differentiated node isolation strategies and topology-driven priority reconstruction strategies are adopted to ensure accurate identification and rapid response to hardware-level faults.
It achieves accurate identification and rapid response to hardware-level faults, improves the reliability of training tasks, avoids problems such as training task freezing or data inconsistency, and enhances the fault tolerance and task scheduling accuracy of distributed training tasks.
Smart Images

Figure CN122309259A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a hardware fault handling method, apparatus, device, storage medium, and program. Background Technology
[0002] As the model size grows exponentially, a containerized cluster scheduling system based on Kubernetes (K8S) can be used to complete distributed training tasks through multi-node collaboration.
[0003] In existing technologies, multiple containers (Pods) can work together to maintain the consistency of the training topology, and single-container-level fault tolerance can be achieved through container restart strategies (such as Always and OnFailure). However, when a single container exits due to node failure, it is impossible to coordinate the reconstruction of other Pods, leading to training tasks being stuck or data inconsistencies. Furthermore, the inability to accurately detect hardware-level failures (such as missing GPU cards or memory anomalies) can cause faulty nodes to be misjudged as healthy, resulting in new tasks being incorrectly scheduled to abnormal nodes, thus lowering the reliability of training tasks. Summary of the Invention
[0004] This application provides a hardware fault handling method, apparatus, device, storage medium, and program to solve the technical problem of low reliability in training tasks.
[0005] In a first aspect, this application provides a hardware fault handling method, including:
[0006] The hardware of the target computing node is subjected to status detection to obtain hardware status data;
[0007] Based on the hardware status data, generate the fault detection result corresponding to the target computing node;
[0008] Based on the fault detection results, the target computing node is isolated.
[0009] In the distributed cluster corresponding to the target computing node, at least one target subtask corresponding to the target computing node is reconstructed.
[0010] In this application, data transmission is achieved through structured fault detection results, ensuring accurate identification and rapid response to hardware-level faults. Furthermore, after a node failure is detected, other computing nodes in the cluster can be coordinated to rebuild the training task, thereby improving the reliability of the training task.
[0011] Optionally, the hardware of the target computing node is subjected to state detection to obtain hardware state data, including:
[0012] Generate a hardware status detection script for the target hardware model corresponding to the target computing node;
[0013] Based on the hardware status detection script, the hardware status of the target computing node is detected to obtain the hardware status data.
[0014] In this application, a corresponding hardware status detection script can be generated according to the target hardware model, which solves the problem that the detection script cannot be adapted to heterogeneous hardware environments, avoids detection failure due to differences in hardware models, and improves the reliability of node isolation.
[0015] Optionally, based on the hardware status detection script, the hardware of the target computing node is subjected to status detection to obtain the hardware status data, including:
[0016] Obtain historical fault data corresponding to the target computing node;
[0017] Based on the historical fault data, determine the target detection frequency corresponding to the hardware status detection script;
[0018] Based on the target detection frequency, the hardware status of the target computing node is detected using the hardware status detection script to obtain the hardware status data.
[0019] In this application, by dynamically adjusting the detection frequency of the hardware status detection script, the problem of resource waste or missed detection caused by a fixed detection frequency is solved, thereby improving the system's operating efficiency and stability.
[0020] Optionally, in the distributed cluster corresponding to the target computing node, at least one target subtask corresponding to the target computing node is reconstructed, including:
[0021] A selected computing node is determined from multiple candidate computing nodes in the distributed cluster, and the scheduling status of the candidate computing node is schedulable.
[0022] In the selected computing node, the reconstruction process is performed on the at least one target subtask.
[0023] In this application, the training task reconstruction can avoid training task freezing or data inconsistency caused by node failure, realize seamless migration of distributed training tasks, and improve the reliability of training cluster tasks.
[0024] Optionally, in the selected computing node, the reconstruction process for the at least one target subtask includes:
[0025] Based on the training task topology corresponding to the target computing node, determine the priority of each target subtask in the at least one target subtask;
[0026] Based on the priority of each target subtask, reconstruction processing is performed on the selected computing node.
[0027] In this application, a priority reconstruction strategy driven by topology resolution and data consistency guarantee are used to achieve topology consistency and seamless migration of distributed training tasks. This avoids the problems of interruption of the entire training task, data inconsistency or training progress rollback caused by the failure of some Pods, and improves the reliability of training tasks.
[0028] Optionally, based on the fault detection results, the target computing node is isolated, including:
[0029] Based on the fault detection results, the current fault level of the target computing node is determined;
[0030] Determine the node isolation strategy corresponding to the current fault level;
[0031] Based on the node isolation strategy, the target computing node is isolated.
[0032] In this application, a differentiated node isolation strategy is adopted to balance the strictness of fault response with resource utilization, reduce resource idleness caused by excessive isolation, and ensure the stability of high-priority tasks.
[0033] Optionally, based on the node isolation strategy, the target computing node is isolated, including:
[0034] If the node isolation policy is the severe isolation policy, then the scheduling status of the target computing node is updated to unschedulable.
[0035] If the node isolation strategy is the abnormal isolation strategy, then the scheduling status of the target computing node will be updated to the degraded running status.
[0036] If the node isolation strategy is the mild isolation strategy, then a fault log is recorded. After the number of faults in the fault log reaches a preset number, the scheduling status of the target computing node is updated to an unschedulable state.
[0037] In this application, a differentiated isolation strategy based on fault classification is used to achieve a balance between the strictness of fault response and the utilization rate of cluster resources, avoiding resource idleness caused by indiscriminate isolation and ensuring the operational reliability of high-priority AI training tasks.
[0038] Secondly, this application provides a hardware fault handling device, including a status detection module, a generation module, a node isolation module, and a reconstruction processing module:
[0039] The status detection module is used to perform status detection on the hardware of the target computing node and obtain hardware status data.
[0040] The generation module is used to generate a fault detection result corresponding to the target computing node based on the hardware status data.
[0041] The node isolation module is used to isolate the target computing node based on the fault detection results.
[0042] The reconstruction processing module is used to perform reconstruction processing on at least one target subtask corresponding to the target computing node in the distributed cluster corresponding to the target computing node.
[0043] Thirdly, embodiments of this application provide an electronic device, including: a processor, and a memory communicatively connected to the processor;
[0044] The memory stores computer-executed instructions;
[0045] The processor executes computer execution instructions stored in the memory to implement the method as described in any of the first aspects.
[0046] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the method described in the first aspect.
[0047] Fifthly, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the first aspects.
[0048] The hardware fault handling method, apparatus, device, storage medium, and program provided in this application can perform status detection on the hardware of a target computing node, obtain hardware status data, convert the hardware status data into structured fault detection results, and perform node isolation operations based on the fault detection results. By using structured fault detection results to achieve data transmission, accurate identification and rapid response to hardware-level faults are ensured, improving the accuracy of fault node identification. Furthermore, after a node fault exit is detected, other computing nodes in the cluster can be coordinated to rebuild the training task, improving the reliability of the training task. Attached Figure Description
[0049] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0050] Figure 1 A schematic diagram illustrating the application scenarios provided in the embodiments of this application;
[0051] Figure 2 A flowchart illustrating a hardware fault handling method provided in an embodiment of this application;
[0052] Figure 3 A flowchart illustrating another hardware fault handling method provided in an embodiment of this application;
[0053] Figure 4 A timing diagram for hardware fault handling provided in an embodiment of this application;
[0054] Figure 5 This is a schematic diagram of the structure of a hardware fault handling device provided in an embodiment of this application;
[0055] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0056] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0057] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0058] Figure 1 This is a schematic diagram illustrating an application scenario provided in an embodiment of this application. Please refer to [link / reference]. Figure 1 This application applies to AI training cluster scenarios based on a distributed cluster 100. The network architecture of the distributed cluster 100 may include a hardware fault monitor 101, multiple computing nodes 102, and a task migration controller 103.
[0059] The training task can be broken down into multiple subtasks, which are processed separately by multiple computing nodes 101. Computing node 102 may experience task interruptions due to hardware failures (such as GPU loss or memory malfunctions), system crashes, or network outages. When the hardware failure monitor 101 detects a failure in computing node 102, the task migration controller 103 can quickly migrate the subtasks corresponding to the failed node to a healthy computing node 102.
[0060] In related technologies, multiple containers can be run collaboratively to maintain the consistency of the training topology, and fault tolerance at the single container level can be achieved through container restart strategies. However, hardware-level faults (such as GPU card loss or memory abnormalities) cannot be accurately detected, which can lead to faulty nodes being misjudged as healthy and new tasks being incorrectly scheduled to abnormal nodes. At the same time, when a single container exits due to node failure, it is impossible to coordinate the reconstruction of other containers, which can cause training tasks to freeze or data to be inconsistent, resulting in low reliability of training tasks.
[0061] The hardware fault handling method provided in this application can perform status detection on the hardware of the target computing node to obtain hardware status data, convert the hardware status data into structured fault detection results, and perform node isolation operations based on the fault detection results. By using structured fault detection results for data transmission, it ensures accurate identification and rapid response to hardware-level faults, improving the accuracy of fault node identification. Furthermore, after a node failure is detected, it can coordinate with other computing nodes in the cluster to rebuild the training task, improving the reliability of the training task.
[0062] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.
[0063] Figure 2 This is a flowchart illustrating a hardware fault handling method provided in an embodiment of this application. Please refer to... Figure 2 The method may include:
[0064] S201. Perform status detection on the hardware of the target computing node to obtain hardware status data.
[0065] The execution entity in this application embodiment can be a distributed cluster or a hardware fault handling device set in the distributed cluster. The hardware fault handling device can be implemented by software or by a combination of software and hardware.
[0066] The target computing node is any one of the multiple computing nodes in the distributed cluster.
[0067] A node hardware health monitoring device can be deployed in the compute node to perform hardware status monitoring logic. For example, this node hardware health monitoring device can be based on the core logic of the open-source Node Health Check (NHC) tool, or it can use other health monitoring tools with hardware monitoring capabilities.
[0068] Hardware status data refers to the hardware operating status information of computing nodes obtained through detection scripts. Examples include whether the GPU card exists, whether memory modules are missing, the status of the Self-Monitoring Analysis and Reporting Technology (SMART) system, and whether the GPU temperature is abnormal.
[0069] S202. Based on the hardware status data, generate the fault detection results corresponding to the target computing node.
[0070] Fault detection results can be in a structured data format that includes hardware status data and fault type identifiers. For example, a fault detection result report in JSON format may include fields such as node_ip, fault_type (e.g., "GPU card missing"), and timestamp.
[0071] Among them, node_ip is the Internet Protocol Address of the node, that is, the IP address of the target computing node; fault_type is the fault type, such as "GPU card lost", "memory stick lost", "hard disk SMART abnormal", etc.; timestamp is the timestamp, that is, the time when the fault detection result was generated.
[0072] S203. Based on the fault detection results, isolate the target computing node.
[0073] Based on the fault detection results, when it is determined that the target computing node has a preset hardware fault (such as GPU card loss, memory module loss, hard disk SMART abnormality), the node isolation device (AutoCordon) will perform node isolation operation on the target computing node.
[0074] Node isolation refers to the operation of marking a faulty compute node as unschedulable by calling the cluster management system's application programming interface (API).
[0075] For example, the scheduling status of a target compute node can be updated to unschedulable by using a node unschedulable marking command in a distributed cluster management system (such as the cordon command in Kubernetes), preventing new training task Pods from being scheduled to the faulty node.
[0076] In some embodiments, the fault type can be determined based on the fault detection results; when the fault type is a preset fault type, the scheduling status of the target computing node is updated to an unschedulable state.
[0077] Preset fault types include GPU card loss, memory module loss, and hard drive SMART malfunction.
[0078] In this application, by actively polling the fault detection results and triggering node isolation operations, the isolation response time of faulty nodes is reduced from minutes to seconds, avoiding new tasks being incorrectly scheduled to abnormal nodes, and significantly improving the fault tolerance and task scheduling accuracy of the distributed training cluster.
[0079] In some embodiments, based on the fault detection results, the current fault level of the target computing node is determined; the node isolation strategy corresponding to the current fault level is determined; and the target computing node is isolated based on the node isolation strategy.
[0080] In this application, a differentiated node isolation strategy is adopted to balance the strictness of fault response with resource utilization, reduce resource idleness caused by excessive isolation, and ensure the stability of high-priority tasks.
[0081] In some embodiments, alarms are triggered by standard monitoring components such as alarm components, and node fault information is pushed to the administrator in a timely manner.
[0082] Specifically, alarm triggering conditions are set based on the fault_level field in the fault detection results. For example, when fault_level is "critical", a P1 level alarm is triggered; when fault_level is "warning", a P2 level alarm is triggered; and when fault_level is "info", only logs are recorded and no alarm is triggered.
[0083] The alarm information received by the alarm component is aligned with the fault detection result format, including fields such as node_ip, fault_type, fault_level, and timestamp, ensuring that the alarm information can be directly used for fault location.
[0084] The node hardware health detection device pushes the structured fault detection results to the push gateway, triggers alarms according to preset rules, and forwards the alarm information to the designated channel.
[0085] Configure notification channels such as email, enterprise software, and SMS to push alarm information of different levels to the corresponding administrators. For example, P1 level alarms (critical faults) are pushed to both the operations manager and the on-duty engineer; P2 level alarms (abnormal faults) are pushed to the operations team's email; minor faults are only displayed on the monitoring dashboard and no notifications are pushed proactively.
[0086] In this application, an alarm component is used to implement node anomaly alarms, and fault information is pushed to the administrator in a timely and hierarchical manner, which facilitates quick location and repair of problems and improves the overall system observability and maintainability.
[0087] S204. In the distributed cluster corresponding to the target computing node, at least one target subtask corresponding to the target computing node is reconstructed.
[0088] After the target computing node is marked as unschedulable, the task migration controller performs reconstruction processing on the training tasks that cannot be executed normally on the target computing node due to hardware failure.
[0089] In some embodiments, a selected computing node is determined from multiple candidate computing nodes in a distributed cluster; and on the selected computing node, at least one target subtask is reconstructed.
[0090] The candidate computing nodes are in a schedulable state, used to take over training tasks from failed nodes and ensure the hardware resource requirements for task execution. At least one target subtask is a distributed training task that was terminated and cannot continue execution on the target computing node due to hardware failure.
[0091] Select a healthy node in the cluster whose hardware is in normal condition and whose scheduling status is schedulable.
[0092] The reconstruction process refers to the task migration controller first terminating at least one target subtask Pod on the faulty node, and then cloning and reconstructing the training task Pod on the selected healthy node according to the distributed training topology priority, ensuring that the topology role of the new Pod is consistent with the original task.
[0093] For example, in a PS / Worker (parameter server / worker node) distributed training architecture, if a GPU card is lost on the node where the Worker Pod is located, the task migration controller will first confirm the running status of the PS Pod (critical Pod) and then clone a new Worker Pod on a healthy node.
[0094] In this application, the training task reconstruction can avoid problems such as training task freezing or data inconsistency caused by node failure, realize the seamless migration of distributed training tasks, and improve the task reliability of AI training clusters.
[0095] The hardware fault handling method provided in this application can perform status detection on the hardware of a target computing node to obtain hardware status data, convert the hardware status data into structured fault detection results, and perform node isolation operations based on the fault detection results. By using structured fault detection results to achieve data transmission, it ensures accurate identification and rapid response to hardware-level faults, improving the accuracy of fault node identification. Furthermore, after a node failure is detected, it can coordinate with other computing nodes in the cluster to rebuild the training task, improving the reliability of the training task.
[0096] Figure 3 This is a flowchart illustrating another hardware fault handling method provided in an embodiment of this application. Please refer to... Figure 3 The method may include:
[0097] S301. Generate a hardware status detection script that corresponds to the target hardware model of the target computing node.
[0098] Hardware status detection scripts refer to executable programs or scripts used to detect the hardware status of computing nodes.
[0099] It can adapt to different hardware through a configurable and scalable detection script framework, supports user-defined hardware detection items and detection logic, and has good configuration scalability.
[0100] The detection rules for each hardware model are defined using a structured configuration file. The configuration file format includes, but is not limited to, YAML, JSON, or INI formats. The configuration content includes hardware type, hardware manufacturer, hardware model, detection instructions, parameter thresholds, judgment conditions, and fault level.
[0101] When starting or initializing a detection task, the above structured configuration file is read and parsed, and the detection logic for the corresponding hardware model is automatically loaded without restarting the service or modifying the core code.
[0102] Meanwhile, it provides an interface for user-defined detection items, allowing users to extend support for new hardware and new detection indicators by adding configuration items or writing custom scripts.
[0103] For example, you can use nvidia-smi to detect GPU status, lspci to detect PCIe device status, and hard drive SMART status to detect hard drive status using hard drive detection commands.
[0104] The types and models of hardware on different computing nodes may differ, and the detection instructions, detection parameters and judgment logic corresponding to different hardware models are different. Therefore, it is necessary to dynamically generate or call a hardware status detection script that is compatible with the target hardware model based on a structured configuration file.
[0105] For example, the target hardware models include GPU cards from different manufacturers and models, memory modules of different specifications, and hard drives with different interface types; for NVIDIA series GPUs, the nvidia-smi instruction is used to detect the GPU's in-situ status, driver status, and temperature status; for PCIe bus devices, the lspci instruction is used to detect the device's in-situ status; for hard drive devices, hard drive testing tools are used to obtain SMART information, thereby achieving accurate status detection for different hardware models.
[0106] S302. Based on the hardware status detection script, perform status detection on the hardware of the target computing node to obtain hardware status data.
[0107] The target hardware model refers to the model information of the hardware device used by the target computing node. For example, GPU model (such as A100, H100), NPU model (such as Atlas 300, Atlas 900), memory module model (such as DDR5 32GB 4800MHz), hard drive model (such as SSD NVMe 2TB), etc.
[0108] The dimensions and judgment criteria for hardware status data differ for different target hardware models.
[0109] For example, the hardware status data of an NVIDIA A100 GPU includes: GPU in-situ status, driver version, core temperature (normal threshold ≤ 85℃), memory usage, and power consumption; the hardware status data of an NPU includes: NPU in-situ status, chip temperature (normal threshold ≤ 90℃), computing power utilization, and device health level; the hardware status data of a DDR5 memory module includes: memory in-situ status, ECC error count (normal threshold = 0), memory frequency, and channel communication status; the hardware status data of an NVMe SSD includes: hard drive in-situ status, SMART health status (normal threshold ≥ 80 points), bad block count, and read / write IO response time.
[0110] In this application, a corresponding hardware status detection script can be generated according to the target hardware model, which solves the problem that the detection script cannot be adapted to heterogeneous hardware environments, improves the universality and accuracy of hardware status detection, avoids detection failure due to differences in hardware models, and improves the reliability of node isolation.
[0111] Specifically, historical fault data corresponding to the target computing node is obtained; based on the historical fault data, the target detection frequency corresponding to the hardware status detection script is determined; based on the target detection frequency, the hardware status of the target computing node is detected through the hardware status detection script to obtain hardware status data.
[0112] Historical fault data refers to the hardware fault-related records of the target computing node within a preset statistical period (such as the past 30 days), including fault type (such as GPU card loss, memory ECC error), number of faults, fault duration, fault recovery method, etc.
[0113] For example, if the target computing node has experienced 3 GPU card loss failures and 2 memory ECC errors in the past 30 days, this is typical historical failure data.
[0114] The target detection frequency refers to the execution interval of the hardware status detection script (such as 1 minute / time, 5 minutes / time, etc.), which is used to dynamically balance the timeliness of detection and resource overhead according to the node risk level.
[0115] When the target computing node is identified as a high-risk node, the execution interval of the preset detection frequency will be shortened; when the target computing node is identified as a low-risk node, the preset detection frequency can be maintained, or the execution interval can be appropriately extended.
[0116] For example, if the preset detection frequency is 5 minutes per instance, and the target computing node is identified as a high-risk node, the target detection frequency is adjusted to 1 minute per instance to improve the timeliness of fault detection. If the target computing node is identified as a low-risk node, the preset detection frequency of 5 minutes per instance is maintained, or it is adjusted to 10 minutes per instance to reduce unnecessary resource consumption.
[0117] This application solves the problem of resource waste or missed detections caused by a fixed detection frequency by dynamically adjusting the detection frequency of the hardware status detection script. While ensuring timely detection, it optimizes the utilization of detection resources and improves system operating efficiency and stability.
[0118] For example, a node is identified as high-risk if it meets any of the following conditions:
[0119] Within the preset statistical period, the number of serious faults (such as GPU card loss, memory ECC error) is ≥2 times;
[0120] Within the preset statistical period, the number of abnormal faults (such as GPU overheating, hard disk SMART warning) is ≥5 times;
[0121] A single serious failure lasts for ≥10 minutes and is not fully resolved (e.g., it recurs after a reboot).
[0122] For example, a node is considered low-risk if it meets all of the following conditions:
[0123] Within the preset statistical period, the number of serious faults occurred = 0;
[0124] Within the preset statistical period, the number of abnormal faults occurs ≤ 1 time;
[0125] All faults were automatically repaired (such as driver restart) within 5 minutes and did not recur.
[0126] S303. Based on the hardware status data, generate the fault detection results corresponding to the target computing node.
[0127] Based on hardware status data, the health status of each hardware device in the target computing node is determined, and structured fault detection results are generated according to preset fault level rules.
[0128] Specifically, the hardware status data is compared with the normal threshold and anomaly judgment conditions of the corresponding hardware model to determine whether there is a hardware anomaly. If an anomaly is found, the node identifier, hardware type, hardware model, fault type, detection timestamp and other information are encapsulated into a structured fault detection result in JSON format.
[0129] If the hardware status data is normal, a health status detection result is generated without triggering node isolation and task migration; if the hardware status data is abnormal, a fault status detection result is generated, carrying information such as fault location and fault description.
[0130] In this application, fault detection results are generated through hardware status determination, fault classification, and structured encapsulation, thereby achieving refined fault classification and standardized output and providing reliable data basis.
[0131] S304. Based on the fault detection results, determine the current fault level of the target computing node.
[0132] Based on the fault type, abnormality level, and hardware status data carried in the fault detection results, the current fault level of the target computing node can be determined according to the preset fault level classification rules.
[0133] Preset fault levels can include critical faults, warning faults, and info faults.
[0134] Among them, serious faults are faults that affect the normal operation of hardware, such as GPU card loss and memory ECC errors; abnormal faults are faults that pose an abnormal risk but are not completely failed, such as GPU overheating and hard disk SMART warnings; and minor faults are faults that do not affect the basic functions of hardware, such as slight fluctuations in network latency.
[0135] S305. Determine the node isolation strategy corresponding to the current fault level.
[0136] Based on the current fault level, a target strategy can be matched from a variety of preset node isolation strategies. These preset isolation strategies can include severe isolation, abnormal isolation, and minor isolation strategies.
[0137] If the current fault level is a severe fault, the node isolation policy will be set to severe isolation policy; if the current fault level is an abnormal fault, the node isolation policy will be set to abnormal isolation policy; if the current fault level is a minor fault, the node isolation policy will be set to minor isolation policy.
[0138] S306. Based on the node isolation strategy, perform node isolation on the target computing node.
[0139] Specifically, if the node isolation policy is a severe isolation policy, the scheduling status of the target computing node will be updated to an unschedulable state; if the node isolation policy is an abnormal isolation policy, the scheduling status of the target computing node will be updated to a degraded running state; if the node isolation policy is a minor isolation policy, a fault log will be recorded, and after the number of faults in the fault log reaches a preset number, the scheduling status of the target computing node will be updated to an unschedulable state.
[0140] In an unschedulable state, no new Pods may be scheduled to this node.
[0141] Degraded operating status is used to indicate that there is an anomaly in the node hardware but it has not completely failed. It allows low-priority task scheduling and prohibits high-priority task scheduling, thereby improving the utilization of cluster resources while ensuring the stability of high-priority services.
[0142] For example, for warning nodes, low-priority task scheduling (such as non-critical model training and offline data preprocessing tasks) is allowed to avoid resource waste; while critical nodes are strictly isolated to prevent new tasks from being incorrectly scheduled to the faulty node.
[0143] In this application, a differentiated isolation strategy based on fault classification is used to achieve a balance between the strictness of fault response and the utilization rate of cluster resources, avoiding resource idleness caused by indiscriminate isolation, while ensuring the operational stability and reliability of high-priority AI training tasks.
[0144] S307. Determine the selected computing node from multiple candidate computing nodes in the distributed cluster.
[0145] In addition to being schedulable and having normal hardware status, the candidate computing nodes can also meet the requirements of matching the hardware model with the resource requirements of the target subtask and ensuring that the remaining resources meet the operational requirements of the target subtask.
[0146] In the distributed cluster, all candidate computing nodes that meet the criteria are selected, and then the selected computing node is determined based on optimal resource utilization and topology affinity.
[0147] Among these, nodes with the same hardware configuration as the faulty node and that have already run other sub-tasks of the same training task are given priority to reduce data transmission overhead and topology adaptation costs.
[0148] For example, suppose there are 3 candidate compute nodes in the distributed cluster, namely Node A, Node B and Node C, and the target subtask is the Master Pod in the PS / Worker architecture (requires NVIDIA A100 GPU and 32GB of memory).
[0149] Among them, Node A is schedulable, with hardware of NVIDIA A100 GPU + 64GB memory, and has 2 Worker Pods running this training task (high topology affinity), with remaining resources meeting the requirements; Node B is schedulable, with hardware of NVIDIA H100 GPU + 64GB memory, and has no Pods running this training task (low topology affinity), with remaining resources meeting the requirements; Node C is schedulable, with hardware of NVIDIA A100 GPU + 16GB memory (insufficient memory, resources not met).
[0150] By identifying NodeA as the selected compute node, we can ensure hardware resource matching while reducing data synchronization costs through topology affinity.
[0151] S308. In the selected computing node, perform reconstruction processing on at least one target subtask.
[0152] Specifically, based on the training task topology corresponding to the target computing node, the priority of each target subtask in at least one target subtask is determined; based on the priority of each target subtask, reconstruction processing is performed on the selected computing node.
[0153] During the reconstruction process, the core configuration of the original task (such as model parameters, data sharding information, and network communication configuration) is fully preserved, and the intermediate data of the original task is synchronized through distributed storage (such as NAS and GlusterFS) to ensure seamless task recovery after migration.
[0154] Training task topology refers to the role definition, dependencies, and communication architecture of each subtask (Pod) in a distributed training task. Examples include PS / Worker (parameter server / worker node) architecture, Master-Worker architecture, and AllReduce architecture, which are used to clarify the priority and data flow of subtasks.
[0155] For example, in a PS / Worker architecture, if the node where the Worker Pod resides fails, vc-controller will prioritize rebuilding the Worker Pod and ensure that its communication topology is consistent with that of the Parameter Server, so as to avoid the task getting stuck because the Worker Pod cannot synchronize data after being rebuilt.
[0156] Priorities can include high-priority subtasks, medium-priority subtasks, and low-priority subtasks.
[0157] Among them, high-priority subtasks can be responsible for global parameter storage, task scheduling, or data distribution functions, including Parameter Server, Master Pod, data synchronization node, etc.
[0158] Medium-priority subtasks can be tasks that directly affect the computational efficiency of the training task, including core WorkerPods (such as Worker nodes that undertake key computational steps).
[0159] Low-priority subtasks can be subtasks that do not affect the core logic of the training task and only provide auxiliary functions, including auxiliary Worker Pods, log collection Pods, etc.
[0160] In this application, a priority reconstruction strategy driven by topology resolution and data consistency guarantee are used to achieve topology consistency and seamless migration of distributed training tasks. This avoids the problems of interruption of the entire training task, data inconsistency or training progress rollback caused by the failure of some Pods, and improves the reliability of training tasks.
[0161] The hardware fault handling method provided in this application embodiment can be adapted to the hardware status detection script of the target hardware model, perform accurate status detection on the hardware of the target computing node, and obtain multi-dimensional hardware status data; convert the hardware status data into a structured fault detection result containing metadata such as fault type, fault level, and node identifier; then execute a differentiated node isolation strategy according to the fault level, and coordinate the healthy selected computing nodes in the cluster to complete the training task reconstruction through the priority reconstruction mechanism driven by topology resolution.
[0162] By generating customized detection scripts for the target hardware model, the problem of traditional fixed scripts being unable to adapt to heterogeneous hardware environments is solved, enabling fine-grained status detection of different types of hardware such as GPUs, NPUs, memory, and hard drives, and significantly reducing the probability of missed or false detections of hardware faults.
[0163] Based on a graded mechanism of severe faults, abnormal faults, and minor faults, a differentiated isolation strategy is implemented. Severe fault nodes are immediately isolated to avoid mis-scheduling of tasks, abnormal fault nodes are degraded to improve resource utilization, and minor faults are only logged without affecting business. This solves the problem of resource idleness caused by traditional indiscriminate isolation. At the same time, the fault isolation time is reduced from minutes to seconds through the active polling mechanism of the node isolation device, so as to achieve rapid fault response.
[0164] Reconstructing subtasks based on training task topology priority prioritizes the reconstruction and communication consistency of key components such as PS Pod and Master Pod. Combined with distributed storage to synchronize intermediate data, it enables seamless migration and recovery of training tasks, avoiding training task interruption, data inconsistency or progress rollback caused by node failure, and improving the reliability of distributed AI training tasks.
[0165] By dynamically adjusting the hardware detection frequency, resource consumption is optimized while ensuring timely detection. The structured fault detection results provide clear fault tracing basis for operation and maintenance. Combined with the hierarchical alarm mechanism, the complexity of cluster operation and maintenance is reduced, and the overall stability and fault tolerance of the AI training cluster are improved.
[0166] Figure 4 This is a timing diagram illustrating a hardware fault handling method provided in an embodiment of this application. Please refer to [link / reference]. Figure 4 The hardware fault handling process in this embodiment is based on the collaborative interaction of a hardware fault monitor and a task migration controller, which includes a node hardware health detection device, a node isolation device, and a cluster management system API service.
[0167] The node isolation device initiates a timed detection request to the node hardware health detection device, triggering a hardware status detection of the target computing node; after the node hardware health detection device detects a hardware abnormality in the target computing node, it sends the node abnormality information back to the node isolation device.
[0168] Based on node anomaly information, the node isolation device sends an operation to the cluster management system API service to take the node offline, records the anomaly information, and completes the scheduling status marking and anomaly information reporting of the faulty node; the task migration controller continuously sends listening node status requests to the cluster management system API service to obtain real-time changes in the scheduling status of nodes within the cluster.
[0169] The cluster management system API service synchronizes the state changes of the faulty node to the task migration controller and sends a node anomaly notification to it. After receiving the node anomaly notification, the task migration controller initiates the migration operation of the affected container to the cluster management system API service, completing the topology consistency reconstruction and migration of the training task containers on the faulty node.
[0170] Figure 5This is a schematic diagram of a hardware fault handling device provided in an embodiment of this application. Please refer to [link / reference]. Figure 5 The hardware fault handling device 500 includes a status detection module 501, a generation module 502, a node isolation module 503, and a reconstruction processing module 504.
[0171] The status detection module 501 is used to perform status detection on the hardware of the target computing node and obtain hardware status data.
[0172] The generation module 502 is used to generate fault detection results corresponding to the target computing node based on hardware status data;
[0173] The node isolation module 503 is used to isolate the target computing node based on the fault detection results.
[0174] The reconstruction processing module 504 is used to perform reconstruction processing on at least one target subtask corresponding to the target computing node in the distributed cluster corresponding to the target computing node.
[0175] Optionally, the status detection module 501 is specifically used for:
[0176] Generate a hardware status detection script that corresponds to the target hardware model of the target computing node;
[0177] Based on the hardware status detection script, the hardware status of the target computing node is detected to obtain hardware status data.
[0178] Optionally, the status detection module 501 is specifically used for:
[0179] Obtain historical fault data corresponding to the target computing node;
[0180] Based on historical fault data, determine the target detection frequency corresponding to the hardware status detection script;
[0181] Based on the target detection frequency, the hardware status of the target computing node is detected using a hardware status detection script to obtain hardware status data.
[0182] Optionally, the reconstruction processing module 504 is specifically used for:
[0183] The selected computing node is determined from multiple candidate computing nodes in the distributed cluster, and the scheduling status of the candidate computing node is set to schedulable.
[0184] In the selected computing node, at least one target subtask is rebuilt.
[0185] Optionally, the reconstruction processing module 504 is specifically used for:
[0186] Based on the training task topology corresponding to the target computing node, determine the priority of each target subtask in at least one target subtask;
[0187] Based on the priority of each target subtask, reconstruction processing is performed on the selected computing node.
[0188] Optionally, the node isolation module 503 is specifically used for:
[0189] Based on the fault detection results, determine the current fault level of the target computing node;
[0190] Determine the node isolation strategy corresponding to the current fault level;
[0191] Based on the node isolation strategy, the target computing nodes are isolated.
[0192] Optionally, the node isolation module 503 is specifically used for:
[0193] If the node isolation policy is a severe isolation policy, then the scheduling status of the target computing node will be updated to an unschedulable state.
[0194] If the node isolation policy is an abnormal isolation policy, then the scheduling status of the target computing node will be updated to a degraded running status.
[0195] If the node isolation policy is a mild isolation policy, then a fault log is recorded. Once the number of faults in the fault log reaches a preset number, the scheduling status of the target computing node is updated to an unschedulable state.
[0196] The hardware fault handling device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0197] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Please refer to... Figure 6 The electronic device 600 may include a processor 601 and a memory 602 communicatively connected to the processor 601. Exemplarily, the processor 601 and the memory 602 are interconnected via a bus 603.
[0198] Memory 602 stores computer-executed instructions;
[0199] The processor 601 executes computer execution instructions stored in the memory 602, causing the processor 601 to perform the hardware fault handling method as described in the above method embodiment.
[0200] Accordingly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the hardware fault handling method of the above-described method embodiments.
[0201] Accordingly, embodiments of this application may also provide a computer program product, including a computer program, which, when executed by a processor, can implement the hardware fault handling method shown in the above method embodiments.
[0202] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0203] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily essential to this application.
[0204] It should be further noted that although the steps in the flowchart are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.
[0205] It should be understood that the above-described device embodiments are merely illustrative, and the device of this application can also be implemented in other ways. For example, the division of units / modules in the above embodiments is only a logical functional division, and there may be other division methods in actual implementation. For example, multiple units, modules, or components may be combined, or integrated into another system, or some features may be ignored or not executed.
[0206] Furthermore, unless otherwise specified, the functional units / modules in the various embodiments of this application can be integrated into one unit / module, or each unit / module can exist physically separately, or two or more units / modules can be integrated together. The integrated units / modules described above can be implemented in hardware or as software program modules.
[0207] When integrated units / modules are implemented in hardware, the hardware can be digital circuits, analog circuits, etc. The physical implementation of the hardware structure includes, but is not limited to, transistors, memristors, etc. Unless otherwise specified, the processor can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, and ASIC, etc. Unless otherwise specified, the storage unit can be any suitable magnetic or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory (HBM), Hybrid Memory Cube (HMC), etc.
[0208] If the integrated unit / module is implemented as a software program module and sold or used as an independent product, it can be stored in a computer-readable storage device (CMD). Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned memory includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.
[0209] In the above embodiments, the descriptions of each embodiment have their own emphasis. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification.
[0210] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.
[0211] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
Claims
1. A hardware fault handling method, characterized in that, include: Perform status detection on the hardware of the target computing node to obtain hardware status data; Based on the hardware status data, generate the fault detection result corresponding to the target computing node; Based on the fault detection results, the target computing node is isolated. In the distributed cluster corresponding to the target computing node, at least one target subtask corresponding to the target computing node is reconstructed.
2. The method according to claim 1, characterized in that, The hardware of the target computing node is subjected to state detection to obtain hardware state data, including: Generate a hardware status detection script for the target hardware model corresponding to the target computing node; Based on the hardware status detection script, the hardware status of the target computing node is detected to obtain the hardware status data.
3. The method according to claim 2, characterized in that, Based on the hardware status detection script, the hardware status of the target computing node is detected to obtain the hardware status data, including: Obtain historical fault data corresponding to the target computing node; Based on the historical fault data, determine the target detection frequency corresponding to the hardware status detection script; Based on the target detection frequency, the hardware status of the target computing node is detected using the hardware status detection script to obtain the hardware status data.
4. The method according to claim 1, characterized in that, In the distributed cluster corresponding to the target computing node, at least one target subtask corresponding to the target computing node is reconstructed, including: A selected computing node is determined from multiple candidate computing nodes in the distributed cluster, and the scheduling status of the candidate computing node is schedulable. In the selected computing node, the reconstruction process is performed on the at least one target subtask.
5. The method according to claim 4, characterized in that, In the selected computing node, the reconstruction process for the at least one target subtask includes: Based on the training task topology corresponding to the target computing node, determine the priority of each target subtask in the at least one target subtask; Based on the priority of each target subtask, reconstruction processing is performed on the selected computing node.
6. The method according to claim 1, characterized in that, Based on the fault detection results, the target computing node is isolated, including: Based on the fault detection results, the current fault level of the target computing node is determined; Determine the node isolation strategy corresponding to the current fault level; Based on the node isolation strategy, the target computing node is isolated.
7. The method according to claim 6, characterized in that, Based on the node isolation strategy, node isolation is performed on the target computing node, including: If the node isolation policy is a severe isolation policy, then the scheduling status of the target computing node is updated to an unschedulable state. If the node isolation strategy is an abnormal isolation strategy, then the scheduling status of the target computing node will be updated to a degraded running status. If the node isolation strategy is a mild isolation strategy, then a fault log is recorded. After the number of faults in the fault log reaches a preset number, the scheduling status of the target computing node is updated to an unschedulable state.
8. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions that, when executed by a processor, are used to implement the method described in any one of claims 1-7.
10. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method described in any one of claims 1-7.