A distributed training task monitoring method and device

By deploying multi-level liveness probes on the K8s platform, full-link health monitoring of distributed training tasks is achieved, solving the problem of hidden failures, improving task execution efficiency and resource utilization, and reducing operation and maintenance costs.

CN122220177APending Publication Date: 2026-06-16NEW H3C TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NEW H3C TECH CO LTD
Filing Date
2026-02-27
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In a distributed task scheduling environment based on Kubeflow, the phenomenon of implicit failures, where a task has failed internally but the container process continues to run, cannot be directly perceived, resulting in low task execution efficiency, low resource utilization, and increased operation and maintenance costs.

Method used

By deploying multi-level liveness probes on the K8s platform, the monitoring dimension is penetrated from the container process level to the business logic level, realizing full-link health monitoring of distributed training tasks, including fine-grained detection of script running status, business process status, and inter-node communication status.

🎯Benefits of technology

Timely detection of node failures, script anomalies, or communication interruptions improves the execution efficiency of distributed training tasks, avoids resource waste, and reduces operation and maintenance costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122220177A_ABST
    Figure CN122220177A_ABST
Patent Text Reader

Abstract

The application provides a distributed training task monitoring method and device. The method comprises: obtaining key task information of a distributed training task; assigning cluster resources to each task logical node and selecting an adaptive cluster physical node according to the key task information, generating a Pod definition for each task logical node and submitting it to an API Sever to complete persistence; for each task logical node, issuing an execution instruction to the cluster physical node adapted to the task logical node, triggering a Kubelet component on the cluster physical node to start a container on the cluster physical node and deploy a multi-level survival probe in the container. The method realizes fine and full-link health monitoring of the distributed training task through the multi-level survival probe, discovers node faults, script abnormalities or communication interruption problems in a timely manner, and improves the stability and observability of the distributed training task.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method and apparatus for monitoring distributed training tasks. Background Technology

[0002] Kubeflow, an open-source machine learning platform based on Kubernetes (K8s), provides native support for the full lifecycle management of AI workloads. Its toolchain covers core aspects from data preprocessing and model training to service deployment, and has become the mainstream infrastructure in distributed machine learning scenarios.

[0003] In Kubeflow-based distributed task scheduling environments, a common phenomenon is the implicit failure where a task has internally failed, but the container process continues to run. This type of failure cannot be directly detected through the native state interfaces of Kubeflow or Kubernetes, making it difficult for operations personnel to identify and intervene in a timely manner. This implicit failure problem is particularly prominent in long-running, high-cost distributed training tasks, severely impacting task execution efficiency and resource utilization, and significantly increasing the overall operational and time costs of AI projects. Summary of the Invention

[0004] This application provides a method and apparatus for monitoring distributed training tasks, which utilizes multi-level survival probes to penetrate the monitoring dimension from the traditional container process level to the business logic level, thereby achieving refined, end-to-end health monitoring of distributed training tasks.

[0005] Specifically, this application provides the following technical solution: Firstly, this application provides a distributed training task monitoring method, which is applied to the K8s platform, and the method includes: Obtain key task information for the distributed training task, including the number of task logic nodes participating in the training, the role of each task logic node, and the configuration of the running script. Based on the key task information, cluster resources are allocated to each task logical node and suitable cluster physical nodes are selected. A Pod definition is generated for each task logical node and submitted to the API Server for persistence. The Pod definition includes multi-level liveness probe configuration. For each task logic node, an execution command is issued to the cluster physical node adapted to that task logic node, triggering the Kubelet component on the cluster physical node to execute the following process: According to the Pod definition of the task logical node, pull the container image, configure the container runtime environment, and start the container to execute the running script of the task logical node; according to the multi-level liveness probe configuration in the Pod definition of the task logical node, deploy multi-level liveness probes in the container. The multi-level liveness probes are used to periodically probe multiple probes of the current container and output the probe results. The multiple probes include the running script status, business process status, and inter-node communication status.

[0006] Secondly, this application provides a distributed training task monitoring device, which is applied to a K8s platform, and the device includes: The information acquisition module is used to acquire key task information of the distributed training task, including the number of task logic nodes participating in the training, the role of each task logic node, and the configuration of the running script. The resource allocation module is used to allocate cluster resources to each task logical node and select suitable cluster physical nodes according to the key task information, generate Pod definitions for each task logical node and submit them to the API Server for persistence, and the Pod definitions include multi-level liveness probe configurations. The deployment module is used to issue execution instructions to the cluster physical nodes adapted to each task logical node, triggering the Kubelet component on the cluster physical nodes to execute the following process: According to the Pod definition of the task logical node, pull the container image, configure the container runtime environment, and start the container to execute the running script of the task logical node; according to the multi-level liveness probe configuration in the Pod definition of the task logical node, deploy multi-level liveness probes in the container. The multi-level liveness probes are used to periodically probe multiple probes of the current container and output the probe results. The multiple probes include the running script status, business process status, and inter-node communication status.

[0007] Thirdly, this application provides an electronic device, comprising: A memory, one or more processors; the memory is coupled to the processors; wherein the memory stores computer program code, the computer program code including computer instructions, and when the computer instructions are executed by the processor, the electronic device performs the method described above.

[0008] Fourthly, this application provides a computer-readable storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform the method described above.

[0009] Fifthly, this application provides a computer program product that, when run on a computer, causes the computer to perform the method described above.

[0010] The technical solution provided in this application has the following beneficial effects: This application achieves refined, end-to-end health monitoring of distributed training tasks through multi-level survival probes, promptly detecting node failures, script anomalies, or communication interruptions, avoiding the impact of hidden failures on AI engineering, improving the execution efficiency of distributed training tasks, and preventing the waste of cluster computing resources.

[0011] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0012] The accompanying drawings, which are incorporated in and form part of this application, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0013] Figure 1 A flowchart illustrating the distributed training task monitoring method provided in this application embodiment; Figure 2 A schematic diagram of the distributed training task monitoring device provided in the embodiments of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0014] The technical solution of this application will be clearly and completely described below with reference to the accompanying drawings and specific embodiments. It should be understood that the drawings and embodiments of this application are for illustrative purposes only and are not intended to limit the scope of protection of this application. This application can be implemented in many different forms and should not be construed as limited to the embodiments described herein.

[0015] In Kubeflow distributed task scheduling scenarios (such as distributed model training and inference), there is a phenomenon where a task logic has failed internally, but the container still remains running. In this state, users cannot directly perceive the task anomaly through the native interface of Kubernetes or Kubeflow. This type of hidden failure seriously affects task execution efficiency and wastes computing resources.

[0016] The core causes of the above phenomenon can be summarized into the following three levels: 1. Native Kubernetes can only monitor the lifecycle state at the container level, and can only reflect whether the container process exists or has exited. It cannot penetrate to the business logic layer and is difficult to reflect the actual execution state of the training or inference process. 2. The Kubeflow task management component can only monitor the overall running status at the Pod level and can only perceive the start and stop status of the Pod. It lacks the ability to perceive the logical status of the business process inside the container and cannot identify logical anomalies inside the process. 3. Traditional machine learning code has shortcomings in exception handling strategies and process exit mechanisms, which can easily lead to hidden failure scenarios such as task freezes, suspensions, and logic blockages, further amplifying the impact of hidden failures.

[0017] To address the aforementioned issues, this application provides a method for monitoring distributed training tasks. The core of this application lies in deploying multi-level survival probes within each container of the distributed training task using the Kubernetes platform. These probes penetrate the monitoring dimension from the traditional container process level to the business logic level, constructing a comprehensive, multi-dimensional monitoring system encompassing "basic container survival verification - business logic execution status monitoring - end-to-end communication interaction verification." This enables refined, end-to-end health monitoring of distributed training tasks, facilitating the triggering of differentiated anomaly handling strategies based on probe results. This allows for the timely detection of node failures, script anomalies, or communication interruptions, accurately resolving various hidden failure pain points, ensuring stable task operation, and significantly improving resource utilization.

[0018] Before introducing the technical solution of this application, the core concepts involved in this application will be explained first.

[0019] Kubernetes (K8s): The core container orchestration platform, which is the central hub of the entire architecture. It is responsible for receiving task requests, scheduling cluster resources, managing the lifecycle of Pods and containers, receiving task instructions from the upper-layer Kubeflow, and controlling the execution behavior of the lower-layer Kubelet.

[0020] Kubeflow: An upper-layer application for AI task management, running on Kubernetes, responsible for submitting distributed training tasks related to upper-layer AI, parsing task information, and defining node roles.

[0021] Kubelet: The core node-level component of Kubernetes (K8s) is the execution component of K8s on the physical nodes of the cluster. It runs on each physical node of the cluster, receives scheduling instructions from K8s, and is responsible for specific operations such as Pod creation, container startup, probe deployment and startup, and health status feedback.

[0022] The technical solution and its beneficial effects of this application will be described in detail below through specific embodiments.

[0023] This embodiment provides a distributed training task monitoring method applied to the Kubernetes platform. For example... Figure 1 As shown, the method may include the following steps: Step 110: Obtain key task information for the distributed training task. The key task information includes the number of task logic nodes participating in the training, the role of each task logic node, and the configuration of the running script.

[0024] Specifically, the Kubernetes platform obtains the aforementioned key task information from Kubeflow, which is obtained by Kubeflow by parsing the logical resource requirements of the distributed training task.

[0025] In this embodiment, the roles of task logic nodes include master nodes and worker nodes. The runtime script configuration includes script path, startup command, and node communication parameters. The runtime script configuration of task logic nodes with the same role can be the same.

[0026] Specifically, the master node's runtime script configuration includes at least the service listening IP and port, used to establish encrypted communication links with each worker node, facilitating the issuance of training scheduling instructions to worker nodes and the receipt of task statuses submitted by worker nodes; the worker node's runtime script configuration includes at least the connection target IP and port, used to accurately locate and securely access the master node.

[0027] Step 120: Based on the key task information, allocate cluster resources for each task logical node and select suitable cluster physical nodes. Generate a Pod definition for each task logical node and submit it to the API Server for persistence. The Pod definition includes multi-level liveness probe configuration.

[0028] Specifically, the Kubernetes platform allocates corresponding computing, storage, and network resources to each task logical node based on critical task information, and selects cluster physical nodes (i.e., the hardware resources of the actual running containers in Kubernetes) with load-matching capabilities within the cluster. During resource allocation, Kubernetes differentiates resource allocation based on the role of the task logical node; for example, the master node is given priority in allocating higher computing and network resources, thereby ensuring the stability of the distributed training scheduling core.

[0029] On the other hand, Kubernetes generates a Pod definition for each task logical node based on critical task information. This Pod definition can include the task logical node's resource requirements (such as compute, storage, and network resource quotas), the task logical node's runtime script configuration, container image configuration (such as image address, version, and pull strategy), and multi-level liveness probe configuration (such as probe mode, probe frequency, and probe script path). Kubernetes then submits the Pod definition to its API Server for persistence, making it available for subsequent calls.

[0030] The format of the Pod definition mentioned above can be a Pod configuration file (yaml file).

[0031] Step 130: For each task logic node, issue an execution command to the cluster physical node that is adapted to the task logic node, triggering the Kubelet component on the cluster physical node to start a container on the cluster physical node and deploy multi-level liveness probes in the container.

[0032] The Kubelet component executes the following process: It pulls the container image, configures the container runtime environment, and starts the container according to the Pod definition of the task logical node to execute the task logical node's runtime script; based on the multi-level liveness probe configuration in the Pod definition of the task logical node, it deploys multi-level liveness probes within the container. These multi-level liveness probes periodically probe multiple probe items of the current container and output the probe results. These multiple probe items include the runtime script status, business process status, and inter-node communication status.

[0033] As mentioned earlier, Kubernetes allocates cluster physical nodes and generates Pod definitions for each task logical node. Then, Kubernetes issues execution instructions to the task logical node, thereby triggering the Kubelet component on that node to start the container according to the corresponding Pod definition, thus completing the deployment and startup of the distributed training task.

[0034] Specifically, after receiving the execution command, the Kubelet on the node first reads the corresponding Pod definition, then parses the container image configuration within it, and then pulls the corresponding container image from the image repository according to the parsed image address. Based on the Pod definition, it mounts the storage volume, sets environment variables, configures the network, and finally creates and starts the container based on the container image. After the container starts, it automatically executes the running script, thus officially starting the distributed training task.

[0035] If any container fails to start, Kubernetes will record the reason for the failure, such as insufficient resources, failure to pull the image, or incorrect script configuration, and return a failure notification containing the reason for the failure to the user; if all containers start successfully, the subsequent probe deployment phase will begin.

[0036] During the probe deployment phase, the Kubelet component deploys multi-level liveness probes within the container based on the multi-level liveness probe configuration in the Pod definition. These multi-level liveness probes work in conjunction with the runtime scripts within the container. At this point, multi-level liveness probes have been successfully deployed in all containers of the distributed training task. Subsequently, the multi-level liveness probes begin periodically probing multiple probes within the current container.

[0037] In this embodiment, the multi-level liveness probe detects the current container using the Eex Command method. It is worth noting that, addressing the limitation of native Kubernetes probes supporting only single-command execution, this embodiment designs a multi-command, multi-level detection mechanism to meet the refined, multi-dimensional detection requirements for script status, business process status, and inter-node communication status in distributed training tasks.

[0038] Specifically, this embodiment configures multiple probe commands (command1, command2, etc.) in the probe script path, and each command can further contain sub-commands (e.g., command1 can be nested under command1-1, command1-2, etc.), forming a hierarchical probe command structure. Each probe command is used to probe one item. Different levels represent the order of probes, and the system will execute the probe commands sequentially according to the hierarchical order.

[0039] In other words, the specific detection content and order can be defined within the detection script path. During actual detection, the multi-level survival probe sequentially probes each item according to the detection script path.

[0040] Specifically, the detection content includes the following three preset detection items: (1) Running script status: Check whether the running script is executing normally and whether a crash or deadlock has occurred; (2) Business process status: Detect the CPU / memory usage of the core training process and whether it generates valid logs; (3) Inter-node communication status: Check whether the network connection between the working node and the master node is smooth and whether the gradient synchronization is normal.

[0041] In addition to the preset probes mentioned above, this embodiment also supports configuring any number of custom probes in the probe content. These custom probes can be associated with business metrics of distributed training, such as model training iteration progress, thereby further improving the business relevance of the probe process.

[0042] Optionally, in this embodiment, the probe begins periodic probing after a preset time interval following the start of the distributed training task. The preset time and probing frequency can be set according to business requirements. The default preset duration is 10 seconds to ensure the task script completes initialization.

[0043] To accommodate different detection logics, this embodiment provides two detection modes for each level of detection command: In the first mode, the AND mode, commands at the same level are executed simultaneously. A level of detection is considered successful only if all command checks pass; if any command fails, the entire level of detection fails. The second mode, OR mode, executes commands at the same level independently. A level of detection is considered successful if at least one command passes the test; a level of detection is considered unsuccessful only if all commands fail.

[0044] In practical applications, detection modes can be flexibly configured according to requirements. The master control node defaults to the first mode to ensure the normal operation of the core scheduling node across all dimensions; worker nodes can flexibly select either mode based on business importance to improve the flexibility of detection.

[0045] To accurately distinguish between different detection items, this embodiment introduces a custom return code. Specifically, this embodiment supports setting different return codes for the detection results of each detection item in the multi-level liveness probe configuration. During the actual detection process, the multi-level liveness probe outputs the corresponding return code for each detection item to indicate whether the detection item has passed the test, helping maintenance personnel to locate anomalies.

[0046] For example, the return code is two digits. The first digit indicates the probe category (e.g., 6 for script status, 7 for process status, 8 for communication status). The second digit indicates the status, such as 0 for normal and 1 for abnormal. Other digits can be customized according to actual needs, such as 60 for normal script operation, 61 for script abnormality, and 62-69 can be customized.

[0047] Furthermore, Kubernetes can perform corresponding automatic recovery operations based on the probe results output by multi-level liveness probes, especially in the event of probe failure, according to preset automatic recovery strategies. These operations could include restarting containers on the node or reallocating cluster resources for distributed training tasks. Different automatic recovery strategies can be triggered for different probe results. For example, if a single probe fails slightly, the business scripts within the container will be restarted first; if multiple probes fail severely, then container restart or node reallocation will be performed, thereby minimizing task loss.

[0048] Finally, the Kubernetes platform can collect the detection results from multi-level liveness probes on all containers of the distributed training task, categorize and summarize them according to node roles and task stages, and generate a health status report for the distributed training task. Specifically, the report includes anomaly statistics, detection success rate, node health ranking, etc., making it easy for operations and maintenance personnel to intuitively grasp the task's running status.

[0049] As can be seen from the above technical solutions, the method of this embodiment can achieve the following beneficial effects: First, by using multi-level survival probes to periodically detect core business layer indicators such as the status of scripts within containers, the status of associated processes, and the communication status between nodes, it breaks through the limitation of relying solely on the container's running status to judge the health of tasks. It can accurately identify hidden failure scenarios such as node failures, script anomalies, and communication interruptions, and completely solves the technical pain point of the difficulty in timely detection of anomalies in distributed training tasks.

[0050] Second, by configuring multi-level survival probes based on the running scripts of task logic nodes, the monitoring dimensions are penetrated from the Pod lifecycle to the business process logic and inter-node communication links, realizing the linkage monitoring of container running status and business execution status, enabling the platform to accurately obtain the true health status of distributed training tasks.

[0051] Third, the multi-level survival probe supports custom detection items, detection order and detection mode, which can flexibly adapt to the role differences and interaction logic of the master node and worker nodes, realize fine monitoring of the entire distributed training link, and completely solve the problem of traditional monitoring capabilities being single and unable to adapt to complex AI projects.

[0052] Fourth, based on the real-time feedback from the probe, an automatic recovery strategy is triggered, and a full task health status report is generated. This not only enables rapid handling of anomalies and ensures stable task operation, but also allows maintenance personnel to accurately locate the root cause of the fault, significantly reducing the cost of manual troubleshooting and improving the utilization rate of cluster resources and the overall execution efficiency of distributed training tasks.

[0053] The following is a specific example to illustrate the implementation process of the distributed training task monitoring method of this application.

[0054] Assuming there is a distributed training task, the complete implementation process is as follows: Step 1: When a user submits a distributed training task through the Kubeflow platform, the Kubeflow platform first parses its logical resource requirements to obtain the following key task information: Node configuration: This task consists of one master node and four worker nodes. The master node has an IP address of 192.168.100 and a port of 2222. Master node script configuration: The script defines the service listening IP and port of the master node; Worker node script configuration: The script contains the IP address and port required to connect to the master node.

[0055] Step 2: K8s obtains critical task information from the Kubeflow platform, allocates cluster resources to the master node and four worker nodes based on the critical task information, and starts containers and deploys multi-level liveness probes on the appropriate cluster physical nodes through the Kubelet component for each task logical node.

[0056] Kubernetes allocates corresponding computing, storage, and network resources to the master node and four worker nodes, and starts containers adapted to the training task requirements on each physical node of the cluster. Specifically, containers are started on each of the five nodes, and if any container fails to start, Kubernetes immediately returns a failure notification to the user.

[0057] If all containers start successfully, a multi-level liveness probe is configured on each container using the Kubelet component. This probe is used to probe the container it resides in using the Eex Command method. The probe targets the running script of the distributed training task, enabling a multi-command, multi-level probing mechanism to achieve comprehensive and fine-grained probing of the running status of each node, the script execution status, and the communication status between nodes.

[0058] Step 3: After the distributed training task enters the execution state for 10 seconds, the multi-level survival probe begins to perform detection at a 10-second interval.

[0059] This example supports flexible configuration of multi-level survival probes. Below is a configuration example of a multi-level survival probe using the AND mode, meaning all peer-level tests must pass: Step A1: Execute the first command to check if the running script exists. If it exists, return code 20 and proceed to the next step; if it does not exist, return code 21, the probe fails, and subsequent checks are terminated. Step A2: Execute the second command to check the status of the associated process. If it runs normally, return code 30 and proceed to the next step; if it fails, return code 31, indicating that the probe has failed. Step A3: Execute the third command to check the TCP connection status of the master node. If the connection is normal (i.e., in the establish state), return code 40 and proceed to the next step; otherwise, return code 41, and the probe fails. Step A4: Execute the fourth command to check any custom detection item, such as whether the data directory is accessible. If accessible, return code 50, and this round of detection has passed; otherwise, return code 51, and the probe has failed.

[0060] Below is another configuration example for multi-level survival probes, using OR mode, meaning the probe is considered successful as long as any one of the tests passes: Step B1: Execute the fifth command to check the model file loading status. If loading is normal, return code 60, indicating successful probe detection, and terminate the current round of subsequent detection; otherwise, return code 61 and proceed to the next step. Step B2: Execute the sixth command to check the service port listening status. If the listening is normal, return code 70, indicating the probe has succeeded; otherwise, return code 71, indicating both checks have failed, and the probe is considered to have failed overall.

[0061] Step 4: K8s collects the return codes of all multi-level survival probes and generates a health status report.

[0062] The fault report generated in this example will accurately pinpoint the node where the fault occurred, the probe item, and the return code, helping operations and maintenance personnel quickly locate the problem and ensure the normal operation of distributed training tasks.

[0063] Specifically, in And mode, if all checks pass, the report indicates that the distributed training task is in good health. If any check fails, such as an abnormal TCP connection on the master node, the report will clearly indicate the specific anomaly based on the return code 41 of the multi-level liveness probe: the master node's TCP connection is broken.

[0064] This example uses K8s to track key information of kubeflow's distributed training tasks, transforming the distributed training tasks into a path tracking matrix monitored by the K8s platform. It tracks the vertical and horizontal status of multiple nodes, processes, and ports in real time, promptly detects task anomalies, and takes corresponding effective interventions, significantly improving the execution efficiency of distributed training tasks and avoiding the waste of cluster resources by hidden failures in distributed training tasks.

[0065] Based on the same inventive concept, this embodiment provides a distributed training task monitoring device. The device is applied to the K8s platform, and its structural schematic diagram is shown below. Figure 2 As shown, it specifically includes: Information acquisition module 210 is used to acquire key task information of distributed training tasks, including the number of task logic nodes participating in training, the role of each task logic node, and the configuration of the running script. The resource allocation module 220 is used to allocate cluster resources to each task logical node and select suitable cluster physical nodes according to the key task information, generate Pod definitions for each task logical node and submit them to the API Server for persistence, and the Pod definitions include multi-level liveness probe configurations. Deployment module 230 is used to issue execution instructions to the cluster physical nodes adapted to each task logic node, triggering the Kubelet component on the cluster physical nodes to execute the following process: According to the Pod definition of the task logical node, pull the container image, configure the container runtime environment, and start the container to execute the running script of the task logical node; according to the multi-level liveness probe configuration in the Pod definition of the task logical node, deploy multi-level liveness probes in the container. The multi-level liveness probes are used to periodically probe multiple probes of the current container and output the probe results. The multiple probes include the running script status, business process status, and inter-node communication status.

[0066] Optionally, the multi-level survival probe configuration includes a probe script path, which includes hierarchical probe commands. Each probe command corresponds one-to-one with a probe item, and the level of the probe command is used to indicate the probe order of the corresponding probe item. The multi-level survival probe is used to probe the current container according to the probe script path and output the probe results.

[0067] Optionally, the detection script path also includes a detection mode, which is used to indicate the execution method of detection commands at the same level. The detection mode includes a first mode and a second mode. The first mode indicates that the detection commands at the current level are executed simultaneously, and the second mode indicates that the detection commands at the current level are executed independently. For each level of detection command, the multi-level survival probe is used to execute the detection command at that level according to the detection mode of that level, so as to realize the detection of the current container.

[0068] Optionally, the plurality of detection items includes preset detection items and custom detection items; the multi-level survival probe is used to detect the preset detection items and custom detection items of the current container.

[0069] Optionally, the multi-level survival probe configuration includes a return code for the probe item, which indicates whether the probe item has passed the probe; the multi-level survival probe is used to periodically probe multiple probe items of the current container and output the corresponding return code according to the probe result of each probe item.

[0070] Optionally, the device further includes: The self-recovery module is used to perform corresponding automatic recovery operations according to a preset automatic recovery strategy when the detection result is a failure.

[0071] Optionally, the device further includes: The reporting module is used to generate a health status report for the distributed training task based on the detection results output by the multi-level survival probes in all containers of the distributed training task.

[0072] The apparatus provided in this embodiment is used to execute the corresponding method described above. Therefore, the beneficial effects it can achieve can be referred to the beneficial effects of the corresponding method described above, and will not be repeated here.

[0073] This embodiment provides an electronic device that may include a memory and one or more processors. The memory stores computer program code, which includes computer instructions. When the processor executes the computer instructions, the electronic device can perform various functions or steps of the method embodiments described above.

[0074] The structure of this electronic device can be referenced. Figure 3 The structure of the electronic device 100 shown.

[0075] For example, the processor mentioned above can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0076] This embodiment provides a computer-readable storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform various functions or steps of the above method embodiments.

[0077] The aforementioned computer-readable storage media include, but are not limited to, any of the following: USB flash drive, portable hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, and other media capable of storing program code.

[0078] This application provides a computer program product that, when run on a computer, causes the computer to perform various functions or steps of the above-described method embodiments.

[0079] The aforementioned electronic devices, computer-readable storage media, and computer program product embodiments are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to in the beneficial effects of the corresponding methods provided above, and will not be repeated here.

[0080] The above are merely specific embodiments of this application and are not intended to limit the scope of protection of this application. For those skilled in the art, the technical solutions of this application can have various variations or substitutions. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A distributed training task monitoring method, characterized in that, The method is applied to the K8s platform, and the method includes: Obtain key task information for the distributed training task, including the number of task logic nodes participating in the training, the role of each task logic node, and the configuration of the running script. Based on the key task information, cluster resources are allocated to each task logical node and suitable cluster physical nodes are selected. A Pod definition is generated for each task logical node and submitted to the API Server for persistence. The Pod definition includes multi-level liveness probe configuration. For each task logic node, an execution command is issued to the cluster physical node adapted to that task logic node, triggering the Kubelet component on the cluster physical node to execute the following process: According to the Pod definition of the task logical node, pull the container image, configure the container runtime environment, and start the container to execute the running script of the task logical node; according to the multi-level liveness probe configuration in the Pod definition of the task logical node, deploy multi-level liveness probes in the container. The multi-level liveness probes are used to periodically probe multiple probes of the current container and output the probe results. The multiple probes include the running script status, business process status, and inter-node communication status.

2. The method according to claim 1, characterized in that, The multi-level survival probe configuration includes a probe script path, which includes hierarchical probe commands. Each probe command corresponds one-to-one with a probe item, and the level of the probe command is used to indicate the probe order of the corresponding probe item. The multi-level survival probe is used to probe the current container according to the probe script path and output the probe results.

3. The method according to claim 2, characterized in that, The detection script path also includes a detection mode, which is used to indicate the execution method of detection commands at the same level. The detection mode includes a first mode and a second mode. The first mode indicates that the detection commands at the current level are executed simultaneously, and the second mode indicates that the detection commands at the current level are executed independently. For each level of detection command, the multi-level survival probe is used to execute the detection command of that level according to the detection mode of that level, so as to realize the detection of the current container.

4. The method according to claim 1, characterized in that, The multiple detection items include preset detection items and custom detection items; The multi-level survival probe is used to detect preset and custom detection items of the current container.

5. The method according to claim 1, characterized in that, The multi-level survival probe configuration includes a return code for the probe item, which indicates whether the probe item has passed the detection. The multi-level survival probe is used to periodically probe multiple probes of the current container and output corresponding return codes based on the detection results of each probe.

6. The method according to claim 1, characterized in that, The method further includes: When the detection result is a failure, the corresponding automatic recovery operation is executed according to the preset automatic recovery strategy.

7. The method according to claim 1, characterized in that, The method further includes: Based on the detection results output by the multi-level survival probes in all containers of the distributed training task, a health status report of the distributed training task is generated.

8. A distributed training task monitoring device, characterized in that, The device is applied to the K8s platform, and the device includes: The information acquisition module is used to acquire key task information of the distributed training task, including the number of task logic nodes participating in the training, the role of each task logic node, and the configuration of the running script. The resource allocation module is used to allocate cluster resources to each task logical node and select suitable cluster physical nodes according to the key task information, generate Pod definitions for each task logical node and submit them to the API Server for persistence, and the Pod definitions include multi-level liveness probe configurations. The deployment module is used to issue execution instructions to the cluster physical nodes adapted to each task logical node, triggering the Kubelet component on the cluster physical nodes to execute the following process: According to the Pod definition of the task logical node, pull the container image, configure the container runtime environment, and start the container to execute the running script of the task logical node; according to the multi-level liveness probe configuration in the Pod definition of the task logical node, deploy multi-level liveness probes in the container. The multi-level liveness probes are used to periodically probe multiple probes of the current container and output the probe results. The multiple probes include the running script status, business process status, and inter-node communication status.

9. An electronic device, characterized in that, include: A memory, one or more processors; the memory is coupled to the processors; wherein the memory stores computer program code, the computer program code including computer instructions, and when the computer instructions are executed by the processor, the electronic device performs the method as described in any one of claims 1-7.

10. A computer-readable storage medium comprising computer instructions, characterized in that, When the computer instructions are executed on the electronic device, the electronic device causes the electronic device to perform the method as described in any one of claims 1-7.

11. A computer program product, characterized in that, When the computer program product is run on a computer, it causes the computer to perform the method as described in any one of claims 1-7.