Multi-machine training task standardized deployment method and device, equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By abstracting the training task requirements in a multi-machine, multi-GPU environment into standardized logical task units, generating a container configuration for the target training framework, and injecting distributed training information, the problem of complex deployment and cumbersome configuration of large-scale distributed training of models in a multi-machine, multi-GPU environment is solved, achieving efficient and stable deployment and resource utilization.

CN122195601APending Publication Date: 2026-06-12CHINA MERCHANTS BANK

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHINA MERCHANTS BANK
Filing Date: 2026-03-04
Publication Date: 2026-06-12

Application Information

Patent Timeline

04 Mar 2026

Application

12 Jun 2026

Publication

CN122195601A

IPC: G06F9/48; G06F9/50

AI Tagging

Application Domain

Program initiation/switching Resource allocation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Service merging and memory pooling method and system for microkernel operating system
CN122195676AProgram initiation/switching Resource allocation
A function-priority-based serverless computing scheduling method
CN122195593AProgram initiation/switching Resource allocation
Model optimization method, computing device, and readable storage medium
CN122195616AProgram initiation/switching Resource allocation
Task processing method, chip, multi-chip module, electronic device and storage medium
US20260169791A1Program initiation/switchingProgram synchronisation
Authentication method and device for computing device, electronic device, and storage medium
CN116933698BProgram initiation/switchingCAD circuit design

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In a multi-machine, multi-GPU environment, the distributed training and deployment of large models is complex and cumbersome to configure, resulting in low efficiency, wasted resources, poor cross-hardware platform compatibility, and the need for users to manually configure communication protocols, which can easily lead to training interruptions.

⚗Method used

The training task requirements are abstracted into standardized logical task units. Based on preset instance templates and custom resource objects, the container configuration of the target training framework is generated, and distributed training configuration information is dynamically generated and injected into the task unit instance. This achieves layer-by-layer logical abstraction and standardization of physical deployment environment, training configuration and underlying communication.

🎯Benefits of technology

It improves the deployment efficiency and resource utilization of multi-machine training tasks, eliminates the adaptation costs for users of heterogeneous hardware and multiple frameworks, and achieves environmental uniformity, framework uniformity and communication uniformity, providing an efficient, stable and scalable deployment foundation for large-scale AI training scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122195601A_ABST

Patent Text Reader

Abstract

The application discloses a multi-machine training task standardization deployment method and device, equipment and a storage medium, relates to the technical field of multi-machine training, and comprises the following steps: acquiring a training task requirement in a multi-machine multi-card environment, and abstracting the training task requirement into a standardized logical task unit; generating a container configuration of a target training framework based on a preset instance template and a self-defined resource object; instantiating the standardized logical task unit based on the container configuration of the target training framework, and obtaining a task unit instance of the target training framework; dynamically generating distributed training configuration information adaptive to the target training framework based on a preset configuration template; and injecting the distributed training configuration information of the target training framework into the corresponding task unit instance. Through the above method, the physical deployment environment, training configuration and underlying communication are logically abstracted and standardized layer by layer, and the ability of dynamically adapting different training frameworks and hardware platforms is combined, so that the deployment efficiency and resource utilization of the multi-machine training task are improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of multi-machine training technology, and in particular to standardized deployment methods, devices, equipment and storage media for multi-machine training tasks. Background Technology

[0002] Deploying large-scale models across multiple machines and GPUs in a distributed training environment is complex. Multi-machine training tasks require manual configuration of physical nodes, computing card models, and network communication protocols, relying on manual intervention and resulting in low deployment efficiency. Training framework configuration is cumbersome; different training frameworks (such as PyTorch, MindSpore, and DeepSpeed) have significantly different configuration parameters, requiring users to familiarize themselves with the characteristics of multiple frameworks, leading to a high learning curve. The underlying distributed communication is hidden, relying on low-level protocols, requiring users to manually configure communication library parameters, and exhibiting poor cross-hardware platform compatibility.

[0003] Existing multi-machine training deployment solutions are inefficient, with traditional multi-machine training tasks averaging 2 hours to deploy and requiring manual adjustments to the number of workers and computing card allocation, failing to dynamically adapt to task requirements. Furthermore, resource utilization is low, with fragmented computing power not being effectively integrated (e.g., some GPUs remaining unused), leading to wasted cluster resources. Current frameworks are highly hardware-dependent, requiring users to repeatedly configure different training frameworks and hardware platforms, lacking a unified abstraction layer and limiting system scalability. Finally, complex communication configurations, requiring manual adjustments to IP allocation and communication protocols for cross-cluster and cross-hardware distributed communication, are prone to training interruptions due to IP drift.

[0004] The above content is only used to help understand the technical solution of the present invention and does not represent an admission that the above content is prior art. Summary of the Invention

[0005] The main purpose of this application is to provide a standardized deployment method, apparatus, device and storage medium for multi-machine training tasks, aiming to solve the technical problems of complex distributed training deployment of large models in multi-machine and multi-card environments, which are characterized by complicated configuration, hidden communication, low efficiency and waste of resources.

[0006] To achieve the above objectives, this application provides a standardized deployment method for multi-machine training tasks, the method comprising: Obtain the training task requirements in a multi-machine, multi-GPU environment, and abstract the training task requirements into standardized logical task units; Based on preset instance templates and custom resource objects, generate the container configuration for the target training framework; Based on the container configuration of the target training framework, the standardized logical task unit is instantiated to obtain a task unit instance of the target training framework; Based on a preset configuration template, distributed training configuration information adapted to the target training framework is dynamically generated; The distributed training configuration information of the target training framework is injected into the corresponding task unit instance so that the target training framework performs task training based on the task unit instance.

[0007] In one embodiment, the step of generating a custom resource object includes: Based on the training task requirements in a multi-machine, multi-card environment, an initial custom resource object is created. The training task requirements include at least logical resource requirements, compatibility requirements, and networking requirements. Based on physical resources, resources are allocated to the standardized logical task units to determine the optimal node combination of the standardized logical task units; The node information corresponding to the optimal node combination is added to the initial custom resource object to generate a custom resource object. The node information includes at least Internet protocol address, hardware type, video memory specification and version information.

[0008] In one embodiment, the step of allocating resources to the standardized logical task units based on physical resources and determining the optimal node combination of the standardized logical task units includes: Based on physical resources, resources are allocated to the standardized logical task units to determine the resource requirement vector of the standardized logical task units; Based on the resource requirement vector of the standardized logical task unit, the corresponding target physical node is selected from the node pool; Based on the target physical nodes of the standardized logical task units, the optimal node combination of the standardized logical task units is determined.

[0009] In one embodiment, the step of generating the container configuration of the target training framework based on a preset configuration template and a custom resource object includes: Obtain a preset configuration template, which includes at least dynamically replaceable parameter placeholders, including at least resource placeholders, environment placeholders, framework placeholders, and basic placeholders. The general resource parameters of the standardized logical task units are converted into corresponding hardware adaptation parameters. Obtain node information from the custom resource object, substitute the node information and the hardware adaptation parameters into the preset configuration template, and generate the container configuration of the target training framework.

[0010] In one embodiment, the step of dynamically generating distributed training configuration information adapted to the target training framework based on a preset configuration template includes: Convert the training task requirements into template variable values; Based on the template variable values, the preset configuration template is rendered to generate distributed training configuration information adapted to the target training framework.

[0011] In one embodiment, the method further includes: When the number of busy units in the standardized logical task unit is greater than a first quantity threshold, the number of the standardized logical task units is expanded. The GPU utilization of the busy unit is greater than a first preset utilization threshold, the busy duration of the busy unit is greater than a preset busy duration, and there are tasks in the waiting queue of the busy unit. When the number of idle units in the standardized logical task unit is greater than the second quantity threshold, the number of the standardized logical task units is reduced. The GPU utilization of the idle unit is less than the second preset utilization threshold, the idle duration of the idle unit is greater than the preset idle duration, and there are no tasks in the waiting queue of the idle unit.

[0012] In one embodiment, the method further includes: Based on the standardized logical task unit, create an associated service corresponding to the task unit instance, and bind the task unit instance to the associated service corresponding to the task unit instance; When a task unit instance starts, the name of the associated service is determined based on the bound associated service; The associated service name is parsed to obtain the Internet Protocol address of the standardized logical task unit; The Internet Protocol address of the standardized logical task unit is written into the node registry in the shared storage, so that the task unit instance can determine the Internet Protocol address of the standardized logical task unit in real time based on the node registry during task training.

[0013] Furthermore, to achieve the above objectives, this application also proposes a standardized deployment device for multi-machine training tasks, which includes: The instance management module is used to obtain training task requirements in a multi-machine, multi-GPU environment and abstract the training task requirements into standardized logical task units. The instance management module is also used to generate the container configuration of the target training framework based on the preset instance template and the custom resource object. The instance management module is also used to instantiate the standardized logical task unit based on the container configuration of the target training framework to obtain a task unit instance of the target training framework. The dynamic configuration injection module is used to dynamically generate distributed training configuration information that adapts to the target training framework based on a preset configuration template. The dynamic configuration injection module is further configured to inject the distributed training configuration information of the target training framework into the corresponding task unit instance, so that the target training framework performs task training based on the task unit instance.

[0014] In addition, to achieve the above objectives, this application also proposes a multi-machine training task standardization deployment device, which includes: a memory, a processor, and a computer program stored in the memory and executable on the processor. The computer program is configured to implement the steps of the multi-machine training task standardization deployment method described above.

[0015] In addition, to achieve the above objectives, the present invention also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it implements the steps of the standardized deployment method for multi-machine training tasks as described above.

[0016] In addition, to achieve the above objectives, this application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the steps of the standardized deployment method for multi-machine training tasks as described above.

[0017] This application provides a standardized deployment method for multi-machine training tasks. It obtains the training task requirements in a multi-machine, multi-GPU environment and abstracts these requirements into standardized logical task units. Based on preset instance templates and custom resource objects, it generates a container configuration for the target training framework. Based on the container configuration of the target training framework, it instantiates the standardized logical task units to obtain task unit instances of the target training framework. Based on preset configuration templates, it dynamically generates distributed training configuration information adapted to the target training framework. It injects the distributed training configuration information of the target training framework into the corresponding task unit instances, enabling the target training framework to perform task training based on the task unit instances. This application logically abstracts and standardizes the physical deployment environment, training configuration, and underlying communication layer by layer. Combined with the ability to dynamically adapt to different training frameworks and hardware platforms, it improves the deployment efficiency and resource utilization of multi-machine training tasks. Simultaneously, it eliminates the adaptation costs for users with heterogeneous hardware and multiple frameworks, achieving environmental uniformity, framework uniformity, and communication uniformity for training tasks. This solves the technical problems of complex deployment, cumbersome configuration, and hidden communication in large-scale distributed training deployment of models in multi-machine, multi-GPU environments, leading to low efficiency and resource waste. Attached Figure Description

[0018] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0019] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a flowchart illustrating an embodiment of the standardized deployment method for multi-machine training tasks in this application. Figure 2 This is a schematic diagram of the overall architecture of the standardized deployment method for multi-machine training tasks provided in Embodiment 1 of this application; Figure 3 A schematic diagram illustrating the configuration injection for the standardized deployment method of multi-machine training tasks provided in Embodiment 1 of this application; Figure 4 This is a flowchart illustrating Embodiment 2 of the standardized deployment method for multi-machine training tasks in this application; Figure 5 This is a schematic diagram of the module structure of the standardized deployment device for multi-machine training tasks according to an embodiment of this application; Figure 6 This is a schematic diagram of the hardware operating environment involved in the standardized deployment method for multi-machine training tasks in this application embodiment.

[0021] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0022] It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of this application and are not intended to limit this application.

[0023] To better understand the technical solution of this application, a detailed description will be provided below in conjunction with the accompanying drawings and specific implementation methods.

[0024] The main solution of this application embodiment is as follows: based on the training task requirements in a multi-machine, multi-card environment, physical resources are abstracted into standardized logical task units; based on the standardized logical task units, logical instance objects are generated; based on the preset configuration template, distributed training configuration information adapted to different target training frameworks is dynamically generated; and the distributed training configuration information of the target training framework is injected into the corresponding logical instance objects.

[0025] Currently, the distributed training and deployment of large models in a multi-machine, multi-GPU environment is complex, requiring manual configuration of physical nodes, computing card models, and network communication protocols, resulting in low deployment efficiency and an inability to dynamically adapt to task requirements. Distributed training frameworks are cumbersome to configure, with significant differences in configuration parameters between different training frameworks, requiring users to become familiar with the characteristics of multiple frameworks, leading to high learning costs and low configuration efficiency. The underlying distributed communication configuration is complex, relying on underlying protocols, requiring users to manually configure communication library parameters, and exhibiting poor cross-hardware platform compatibility, making training prone to interruption due to IP address drift.

[0026] This application provides a solution that abstracts and standardizes the physical deployment environment, training configuration, and underlying communication layer by layer. Combined with the ability to dynamically adapt to different training frameworks and hardware platforms, it significantly improves the deployment efficiency, resource utilization, and system scalability of multi-machine training tasks. At the same time, it eliminates the adaptation costs for users to heterogeneous hardware and multiple frameworks, and achieves environmental uniformity, framework uniformity, and communication uniformity for training tasks. It provides an efficient, stable, and scalable deployment foundation for large-scale AI training scenarios, and solves the technical problems of complex deployment, cumbersome configuration, and hidden communication of large-scale distributed training of models in multi-machine and multi-card environments, which leads to low efficiency and waste of resources.

[0027] It should be noted that the executing entity in this embodiment can be a computing service device with data processing, network communication, and program execution functions, such as a tablet computer, personal computer, or mobile phone, or an electronic device capable of performing the above functions, a standardized deployment device for multi-machine training tasks, etc. This embodiment does not specifically limit it. The following uses a standardized deployment device for multi-machine training tasks as an example to describe this embodiment and the following embodiments.

[0028] This application provides a standardized deployment method for multi-machine training tasks, referring to... Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the standardized deployment method for multi-machine training tasks in this application.

[0029] In this embodiment, the standardized deployment method for multi-machine training tasks includes steps S10 to S50: Step S10: Obtain the training task requirements in a multi-machine, multi-card environment, and abstract the training task requirements into standardized logical task units. It should be noted that the standardized logical task unit is the Worker, as shown in the reference. Figure 2 Each Worker runs on an independent Node. The training task requirement is the core requirement for distributed training in a multi-machine, multi-GPU environment. Training task requirements are usually submitted by the user, who can define them via API (e.g., "4 Workers + 8 GPUs") without needing to concern themselves with the underlying hardware / training framework details.

[0030] It is understandable that unified management of heterogeneous hardware environments is achieved through a multi-dimensional resource abstraction model (VModel), the core implementation mechanism of which is a multi-layered resource abstraction design. The physical layer includes node information (such as IP address and hardware model), GPU resources (such as CUDA version and memory specifications), and network bandwidth. The logical layer abstracts physical resources into standardized logical task units (Workers). Each Worker contains core parameters, network parameters, and compatibility tags. Core parameters include the number of CPU cores, memory size, number and model of GPUs; network parameters include bandwidth (such as 100Gbps RoCE) and latency (such as 0.1ms); and compatibility tags include framework compatibility (such as support for PyTorch 2.0) and communication protocol support (such as NCCL 2.18).

[0031] Additionally, it should be noted that the basic environment deployed in this embodiment includes: a Kubernetes cluster (supporting CRD / Operator), a CUE template engine, a hardware driver library, shared storage (such as etcd), and a node pool (including heterogeneous hardware nodes, such as Huawei 910B, NVIDIA cards, and Muxi cards).

[0032] Step S20: Generate the container configuration of the target training framework based on the preset instance template and the custom resource object; It should be noted that the preset instance template is the standardized VModel template set in this embodiment, which supports template parameterization. The CUE template engine can dynamically replace the parameters in the template (such as the number of GPUs and communication parameters) to generate a container configuration adapted to the framework (such as the nproc_per_node parameter in PyTorch).

[0033] It is understandable that the target training framework is the training framework selected by the user.

[0034] In one feasible implementation, the step of generating a custom resource object may include steps A11-A13: Step A11: Based on the training task requirements in a multi-machine, multi-card environment, create an initial custom resource object. The training task requirements include at least logical resource requirements, compatibility requirements, and networking requirements. It should be noted that, in this embodiment, the training task requirements include at least logical resource requirements, compatibility requirements, and networking requirements. Logical resource requirements include at least the number of standardized logical task units, the number of GPUs allocated to each standardized logical task unit, the number of CPU cores, and memory requirements. Logical resource requirements refer to the demand for logical resources. The number of standardized logical task units is the number of Workers, for example, 4 Workers. The number of GPUs allocated to each standardized logical task unit is the number of GPUs allocated to each Worker, for example, 8 GPUs. The number of CPU cores is the number of physical cores within the CPU (more cores mean stronger parallel processing). Memory requirements refer to the required amount of memory. Compatibility requirements refer to the compatibility requirements, typically including compatible training framework types (such as PyTorch 2.0, MindSpore, DeepSpeed) and supported communication protocols (such as NCCL 2.18). Networking requirements refer to network requirements, typically including network bandwidth requirements (such as 100Gbps RoCE) and maximum latency (such as 0.1ms).

[0035] Understandably, the Kubernetes API is called to create an initial custom resource object (CRD) named "VModel". This initial CRD contains standardized parameters required by the user, such as the number of workers, GPU configuration, and framework type, which serve as the data carrier for subsequent resource scheduling and configuration generation. The system's built-in VModel Operator (KubernetesOperator mode) listens for CRD creation events in real time and triggers the subsequent worker instance generation process.

[0036] Step A12: Allocate resources to the standardized logical task units based on physical resources, and determine the optimal node combination of the standardized logical task units; In one feasible implementation, step A12 may include: allocating resources to the standardized logical task unit based on physical resources to determine the resource requirement vector of the standardized logical task unit; selecting a corresponding target physical node in the node pool based on the resource requirement vector of the standardized logical task unit; and determining the optimal node combination of the standardized logical task unit based on the target physical node of the standardized logical task unit.

[0037] It is understood that this embodiment employs an improved Bin Packing algorithm for resource allocation / scheduling, calculating the resource requirement vector for each Worker. This vector can consist of data such as the number of CPU cores, the number of GPUs, memory requirements, and bandwidth requirements. All physical nodes in the Kubernetes node pool are scanned, and nodes that meet the resource requirement vector and are compatible with the target training framework are selected as the target physical nodes. The optimal node combination is then chosen from among the target physical nodes. For example, 4 Workers correspond to 4 Nodes, with each Node allocated 2 GPUs to ensure maximum resource utilization and avoid fragmented computing power waste. During this process, heterogeneous hardware hybrid scheduling is supported, for example, simultaneously selecting Huawei 910B nodes and NVIDIA card nodes.

[0038] Step A13: Add the node information corresponding to the optimal node combination to the initial custom resource object to generate a custom resource object. The node information includes at least Internet protocol address, hardware type, video memory specification, and version information.

[0039] Understandably, the node information for recording the optimal node combination should include at least an Internet Protocol address (IP), hardware type, memory specifications, and version information (CUDA version). This node information is then added to the initial CRD to obtain the custom resource objects needed for subsequent container configuration.

[0040] In one feasible implementation, step S20 may include steps S201 to S203: Step S201: Obtain a preset configuration template. The preset configuration template includes at least dynamically replaceable parameter placeholders, which include at least resource placeholders, environment placeholders, framework placeholders, and basic placeholders. It's important to note that the default configuration template forms the basic framework for container configuration, containing placeholders for all configuration items necessary for Worker Pods to run—that is, parameter placeholders. These placeholders can be dynamically replaced. Generally, parameter placeholders can be categorized into four types: resource placeholders, environment placeholders, framework placeholders, and basic placeholders. Resource placeholders can include the number of CPU cores, the number of GPUs, and the memory size, corresponding to the allocation results on the node. Environment placeholders can include hardware driver paths and communication protocol types, corresponding to hardware adaptation parameters. Framework placeholders can include framework startup parameters and framework version; these are framework-specific placeholders. Basic placeholders can include container images, storage mount paths, and Pod restart policies; these are fixed, standardized parameters.

[0041] Step S202: Convert the general resource parameters of the standardized logical task unit into corresponding hardware adaptation parameters; It is understood that this embodiment provides a unified hardware operation interface (such as GPU initialization and memory allocation) through a hardware abstraction interface to complete the underlying hardware adaptation and generate standardized logic layer parameters. Based on the hardware type of the selected node, the corresponding driver module is automatically loaded; for example, a Huawei 910B node loads ascend_driver.so, and an NVIDIA card node loads the CUDA driver. User-inputted general resource parameters are converted into hardware-specific parameters, i.e., hardware adaptation parameters; for example, general resource parameters are converted into vendor parameters corresponding to HCCL / XDR / NCCL.

[0042] Step S203: Obtain node information from the custom resource object, substitute the node information and the hardware adaptation parameters into the preset configuration template, and generate the container configuration of the target training framework.

[0043] It should be noted that the CUE template engine is called, and the parameters in the VModel CRD are combined to generate the container configuration for each Worker's corresponding framework.

[0044] Understandably, the node information and hardware adaptation parameters from the custom resource object are substituted into the preset configuration template, replacing the corresponding parameter placeholders in the preset configuration template to generate a framework-specific configuration, such as the "--nproc_per_node" parameter in the PyTorch framework and the "--deepspeed" parameter in the DeepSpeed framework. This framework-specific configuration is then integrated with resource classes, environment classes, base classes, and parameters to generate a complete container configuration file, i.e., the container configuration.

[0045] Step S30: Based on the container configuration of the target training framework, instantiate the standardized logical task unit to obtain a task unit instance of the target training framework; It should be noted that a task unit instance is a Worker Pod, and one Worker corresponds to one Worker Pod.

[0046] Understandably, the Kubernetes API is used to create a corresponding WorkerPod on each selected Node, and the generated container configuration is mounted to the WorkerPod's storage directory. This ensures that the Pod can read all the configurations when it starts, thus completing the instantiation of the Worker. The instantiated WorkerPod is then bound to the corresponding logical Worker, and the mapping relationship between "Worker-0, WorkerPod-0, Node-0" is recorded in the VModel CRD for easy monitoring and scaling later.

[0047] It should be understood that standardized tags are added to Worker Pods, for example, "Framework Compatibility: PyTorch 2.0", "Communication Protocol Support: NCCL 2.18", and "Hardware Model: Huawei 910B", for subsequent configuration injection.

[0048] Step S40: Based on the preset configuration template, dynamically generate distributed training configuration information that adapts to the target training framework; Understandably, distributed training configurations are injected through a standardized UI, and personalized configurations adapted to different training frameworks (such as PyTorch, MindSpore, and DeepSpeed) are dynamically generated based on the CUE template engine. Distributed network information (such as MA_NUM_HOSTS defining the total number of Workers, VC_TASK_INDEX assigning Worker IDs, and VC_MASTER_HOST dynamically obtaining the Master node IP) is injected. For different framework adaptations, framework-specific parameters (such as the "--nproc_per_node" parameter for PyTorch and the "--deepspeed" parameter for DeepSpeed) are dynamically injected based on the user's selected target training framework.

[0049] In one feasible implementation, step S40 may include steps S401 to S402: Step S401: Convert the training task requirements into template variable values; Understandably, users input parameters such as training framework type, number of workers, GPU allocation strategy, and data parallelism strategy through Web UI or CLI tools. These user-input parameters are then converted into variable values in the CUE template, i.e., template variable values. For example, "GPU:2" is mapped to a resource operation and maintenance feature used to manage the number of GPU cards used for tasks.

[0050] Step S402: Based on the template variable value, perform template rendering on the preset configuration template to generate distributed training configuration information adapted to the target training framework.

[0051] It should be understood that, reference Figure 3 The parameters required by different frameworks are standardized into a unified CUE template, i.e., a preset configuration template, which is adapted to different training frameworks and hardware platforms. Based on the unified preset configuration template, different load types and operational characteristics are rendered to generate distributed training configuration information for the target training framework, so as to uniformly manage training instances of different frameworks.

[0052] In the specific implementation, the corresponding communication parameter template (such as nccl_config.cue) is loaded according to the detected hardware type, and distributed training parameters are dynamically generated, such as: "MA_NUM_HOSTS = len(Workers)", "VC_TASK_INDEX = worker_index", "VC_MASTER_HOST = master_worker_ip", thereby generating a distributed training configuration file (such as pytorch_dist.conf) that conforms to the target framework specification, which is the distributed training configuration information.

[0053] Step S50: Inject the distributed training configuration information of the target training framework into the corresponding task unit instance, so that the target training framework performs task training based on the task unit instance.

[0054] It should be noted that the reference Figure 3 Distributed training configuration information is injected into each Worker Pod through Kubernetes' custom resource mechanism. Distributed training tasks are managed through Kubernetes custom objects, shielding them from the complexity of the physical environment.

[0055] It is understandable that the task lifecycle management can be achieved by monitoring the status (Pending / Running / Succeeded / Failed) of Worker / Worker Pod through the Kubernetes event mechanism.

[0056] Specifically, the core information of the VModel instance (Worker Pod name, corresponding Node IP, Worker number, and status) is written to the "node registry" in shared storage for cluster scheduling, monitoring, and querying. The VModel Operator continuously monitors the VModel instance status through a Reconcile loop. The initial status is set to "Pending (waiting for configuration injection and network formation)"; when the Worker Pod starts successfully and configuration injection is completed, the status is updated to "Running (ready)".

[0057] Furthermore, the number of standardized logical task units can be adjusted based on the GPU utilization of the standardized logical task units.

[0058] It should be noted that adjusting the number of standardized logical task units means expanding or shrinking the standardized logical task units, which is equivalent to increasing or decreasing the number of standardized logical task units. After expanding or shrinking the standardized logical task units, the task unit instances will also be adjusted accordingly.

[0059] It is understood that this embodiment supports dynamically expanding the number of Workers (e.g., from 4 Workers to 8 Workers) and dynamically reducing the number of Workers (e.g., from 8 Workers to 4 Workers) without manual intervention, and standardizes task management through Kubernetes custom objects. When a user submits a training task, a VModel instance is automatically created, and the task status is monitored cyclically using Reconcile. When a change in computing power requirements is detected, the Operator automatically triggers Worker scaling operations.

[0060] In one feasible implementation, the step of adjusting the number of standardized logical task units based on the GPU utilization of standardized logical task units may include: when the number of busy units in the standardized logical task units is greater than a first quantity threshold, expanding the number of standardized logical task units, wherein the GPU utilization of the busy unit is greater than a first preset utilization threshold, the busy duration of the busy unit is greater than a preset busy duration, and there are tasks in the waiting queue of the busy unit; and when the number of idle units in the standardized logical task units is greater than a second quantity threshold, reducing the number of standardized logical task units, wherein the GPU utilization of the idle unit is less than a second preset utilization threshold, the idle duration of the idle unit is greater than a preset idle duration, and there are no tasks in the waiting queue of the idle unit.

[0061] It should be noted that a busy unit refers to a relatively busy worker. The first preset utilization threshold is a pre-set threshold for GPU utilization, which is usually a large value. The busy duration is the duration for which GPU utilization exceeds the first preset utilization threshold. The preset busy duration is a pre-set threshold for busy duration. The waiting queue is the queue of tasks waiting to be executed. If a worker's GPU utilization exceeds the first preset utilization threshold, and the busy duration exceeds the preset busy duration, and there are tasks in the waiting queue, then this worker is considered a busy unit.

[0062] Additionally, it should be noted that an idle unit refers to a relatively idle Worker, the second preset utilization threshold is a pre-set threshold for GPU utilization (usually a small value), the idle duration is the duration during which GPU utilization is less than the second preset utilization threshold, and the preset idle time is a pre-set threshold for the idle duration, for example, 15 minutes. If a Worker's GPU utilization is less than the second preset utilization threshold, and the idle duration is greater than the preset idle time, and there are no tasks in the waiting queue, then the Worker is considered an idle unit.

[0063] Understandably, the first threshold is for the number of busy units, and the second threshold is for the number of idle units. If the number of busy units exceeds the first threshold, it means that most workers are continuously busy and will continue to be busy, in which case expansion is triggered, increasing the number of workers. If the number of idle units exceeds the second threshold, it means that most workers are continuously idle and will continue to be idle, in which case shrinkage is triggered, reducing the number of workers.

[0064] In practice, the GPU utilization metric is monitored via HPA (Horizontal Pod Autoscaler). When the utilization consistently exceeds the maximum threshold and there are tasks in the waiting queue, the number of Workers is automatically expanded from 4 to 8. The GPU utilization of idle Workers is monitored via the Controller component, and resources are automatically reclaimed when it falls below the minimum threshold for 15 consecutive minutes.

[0065] This embodiment provides a standardized deployment method for multi-machine training tasks. It obtains the training task requirements in a multi-machine, multi-GPU environment and abstracts these requirements into standardized logical task units. Based on a preset instance template and custom resource objects, it generates a container configuration for the target training framework. Based on the container configuration of the target training framework, it instantiates the standardized logical task units to obtain task unit instances of the target training framework. Based on a preset configuration template, it dynamically generates distributed training configuration information adapted to the target training framework. It injects the distributed training configuration information of the target training framework into the corresponding task unit instances, enabling the target training framework to perform task training based on the task unit instances. This embodiment logically abstracts and standardizes the physical deployment environment, training configuration, and underlying communication layer by layer. Combined with the ability to dynamically adapt to different training frameworks and hardware platforms, it improves the deployment efficiency and resource utilization of multi-machine training tasks, while eliminating the adaptation costs for users with heterogeneous hardware and multiple frameworks, achieving environmental uniformity, framework uniformity, and communication uniformity for training tasks.

[0066] Based on the first embodiment of this application, in the second embodiment of this application, the content that is the same as or similar to that in Embodiment 1 above can be referred to the above description, and will not be repeated hereafter. Based on this, please refer to... Figure 4 Step S50 may be followed by steps S601 to S604: Step S601: Based on the standardized logical task unit, create an associated service corresponding to the task unit instance, and bind the task unit instance to the associated service corresponding to the task unit instance. It should be noted that container communication mechanisms simplify distributed network configuration. First, this embodiment creates a Headless Service for each Worker Pod, i.e., an associated service. The Worker is bound to the Headless Service, and a Headless Service with the same name is created for each Worker. For example, the Headless Service corresponding to Worker-0 is named worker-0-service.

[0067] It is understood that this embodiment takes into account communication protocol adaptation and supports underlying communication libraries such as HCCL, NCCL, and XDR, so users do not need to manually configure communication parameters.

[0068] Step S602: When the task unit instance starts, determine the name of the associated service based on the bound associated service; Understandably, using Headless Service to dynamically bind the Master node IP simplifies distributed network configuration.

[0069] Step S603: Parse the associated service name to obtain the Internet Protocol address of the standardized logical task unit; Understandably, the Kubernetes DNS service enables automatic registration and resolution of Pod IPs. WorkerPod obtains a list of all Worker IP addresses by resolving the service name (such as worker-0-service.default.svc.cluster.local).

[0070] Step S604: Write the Internet Protocol address of the standardized logical task unit into the node registry in the shared storage, so that the task unit instance can determine the Internet Protocol address of the standardized logical task unit in real time based on the node registry during task training.

[0071] Understandably, when a Pod starts, it obtains the IP address of the current Worker through the Kubernetes API, writes the IP address to the node registration in shared storage, and periodically updates the IP information in the registry to ensure the real-time validity of the IP address, thereby avoiding training interruptions caused by IP drift.

[0072] In practical implementation, a Communication Adapter Layer is constructed based on the communication protocol adaptation mechanism. This layer includes a protocol selection module, a parameter conversion module, and a fault recovery module. The protocol selection module automatically selects the communication protocol based on the detected hardware type. The parameter conversion module converts the unified configuration parameters into hardware vendor-specific parameters (e.g., converting "world_size" to "hccl_world_size" in HCCL). The fault recovery module automatically switches to a backup communication path when a communication anomaly is detected.

[0073] For example, when a user selects the MindSpore framework and deploys it on a Huawei 910B card, the system automatically performs the following operations: Generate a configuration file containing HCCL parameters: hccl_config.json = { "world_size": 8, "rank_id": ${VC_TASK_INDEX}, "master_ip": ${VC_MASTER_HOST} } The system automatically loads the Huawei HCCL communication library and initializes the distributed training environment. It then automatically obtains the IP addresses of all workers via a headless service to build the distributed training network.

[0074] In addition, initiate ping tests between Worker Pods to ensure that all Pods can communicate with each other.

[0075] This embodiment provides a standardized deployment method for multi-machine training tasks. Based on standardized logical task units, it creates associated services corresponding to task unit instances and binds the task unit instances to their corresponding associated services. When a task unit instance starts, the associated service name is determined based on the bound associated service. The associated service name is parsed to obtain the Internet Protocol address (IPA) of the standardized logical task unit. The IPA of the standardized logical task unit is written to a node registry in shared storage, so that the task unit instance can determine the IPA of the standardized logical task unit in real time based on the node registry during task training. This embodiment abstracts and standardizes the physical deployment environment, training configuration, and underlying communication layer by layer, and combined with the ability to dynamically adapt to different training frameworks and hardware platforms, it significantly improves the deployment efficiency, resource utilization, and system scalability of multi-machine training tasks. At the same time, it eliminates the adaptation costs for users with heterogeneous hardware and multiple frameworks, achieving environmental uniformity, framework uniformity, and communication uniformity for training tasks, providing an efficient, stable, and scalable deployment foundation for large-scale AI training scenarios.

[0076] It should be noted that the above examples are only for understanding this application and do not constitute a limitation on the standardized deployment method of multi-machine training tasks in this application. Any simple modifications based on this technical concept are within the protection scope of this application.

[0077] This application also provides a standardized deployment device for multi-machine training tasks; please refer to [reference needed]. Figure 5 The standardized deployment device for multi-machine training tasks includes: Instance management module 10 is used to obtain training task requirements in a multi-machine, multi-card environment and abstract the training task requirements into standardized logical task units; The instance management module 10 is also used to generate the container configuration of the target training framework based on the preset instance template and the custom resource object. The instance management module 10 is also used to instantiate the standardized logical task unit based on the container configuration of the target training framework to obtain a task unit instance of the target training framework. The dynamic configuration injection module 20 is used to dynamically generate distributed training configuration information adapted to the target training framework based on a preset configuration template. The dynamic configuration injection module 20 is further configured to inject the distributed training configuration information of the target training framework into the corresponding task unit instance, so that the target training framework performs task training based on the task unit instance.

[0078] In one feasible implementation, the instance management module 10 is further configured to create an initial custom resource object based on the training task requirements in a multi-machine, multi-card environment. The training task requirements include at least logical resource requirements, compatibility requirements, and networking requirements. Based on physical resources, resources are allocated to the standardized logical task units to determine the optimal node combination of the standardized logical task units; The node information corresponding to the optimal node combination is added to the initial custom resource object to generate a custom resource object. The node information includes at least Internet protocol address, hardware type, video memory specification and version information.

[0079] In one feasible implementation, the instance management module 10 is further configured to allocate resources to the standardized logical task unit based on physical resources and determine the resource requirement vector of the standardized logical task unit. Based on the resource requirement vector of the standardized logical task unit, the corresponding target physical node is selected from the node pool; Based on the target physical nodes of the standardized logical task units, the optimal node combination of the standardized logical task units is determined.

[0080] In one feasible implementation, the instance management module 10 is further configured to obtain a preset configuration template, the preset configuration template including at least dynamically replaceable parameter placeholders, the parameter placeholders including at least resource placeholders, environment placeholders, framework placeholders and basic placeholders; The general resource parameters of the standardized logical task units are converted into corresponding hardware adaptation parameters. Obtain node information from the custom resource object, substitute the node information and the hardware adaptation parameters into the preset configuration template, and generate the container configuration of the target training framework.

[0081] In one feasible implementation, the dynamic configuration injection module 20 is further configured to convert the training task requirements into template variable values; Based on the template variable values, the preset configuration template is rendered to generate distributed training configuration information adapted to the target training framework.

[0082] In one feasible implementation, the instance management module 10 is further configured to expand the number of standardized logical task units when the number of busy units in the standardized logical task units is greater than a first quantity threshold, wherein the GPU utilization of the busy unit is greater than a first preset utilization threshold, the busy duration of the busy unit is greater than a preset busy duration, and there are tasks in the waiting queue of the busy unit. When the number of idle units in the standardized logical task unit is greater than the second quantity threshold, the number of the standardized logical task units is reduced. The GPU utilization of the idle unit is less than the second preset utilization threshold, the idle duration of the idle unit is greater than the preset idle duration, and there are no tasks in the waiting queue of the idle unit.

[0083] In one feasible implementation, the dynamic configuration injection module 20 is further configured to create an associated service corresponding to the task unit instance based on the standardized logical task unit, and bind the task unit instance to the associated service corresponding to the task unit instance. When a task unit instance starts, the name of the associated service is determined based on the bound associated service; The associated service name is parsed to obtain the Internet Protocol address of the standardized logical task unit; The Internet Protocol address of the standardized logical task unit is written into the node registry in the shared storage, so that the task unit instance can determine the Internet Protocol address of the standardized logical task unit in real time based on the node registry during task training.

[0084] The multi-machine training task standardization deployment device provided in this application, employing the multi-machine training task standardization deployment method described in the above embodiments, can solve the technical problems of complex, cumbersome, and covert communication in distributed training deployment of large models under multi-machine and multi-GPU environments, leading to low efficiency and resource waste. Compared with the prior art, the beneficial effects of the multi-machine training task standardization deployment device provided in this application are the same as those of the multi-machine training task standardization deployment method provided in the above embodiments, and other technical features in the multi-machine training task standardization deployment device are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.

[0085] This application provides a multi-machine training task standardization deployment device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the multi-machine training task standardization deployment method in Embodiment 1 above.

[0086] The following is for reference. Figure 6This document illustrates a structural schematic diagram of a standardized deployment device for multi-machine training tasks suitable for implementing embodiments of this application. The standardized deployment device for multi-machine training tasks in embodiments of this application may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Portable Application Description), PMPs (Portable Media Players), and in-vehicle terminals (e.g., in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 6 The standardized deployment equipment for multi-machine training tasks shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.

[0087] like Figure 6 As shown, the standardized deployment device for multi-machine training tasks may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in ROM (Read Only Memory) 1002 or a program loaded from storage device 1003 into RAM (Random Access Memory) 1004. RAM 1004 also stores various programs and data required for the operation of the standardized deployment device for multi-machine training tasks. The processing unit 1001, ROM 1002, and RAM 1004 are interconnected via bus 1005. Input / output (I / O) interface 1006 is also connected to the bus. Typically, the following systems can be connected to I / O interface 1006: input devices 1007 including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices 1008 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1003 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1009. Communication device 1009 allows the multi-machine training task standardization deployment equipment to communicate wirelessly or wiredly with other devices to exchange data. Although a multi-machine training task standardization deployment equipment with various systems is shown in the figure, it should be understood that it is not required to implement or possess all the systems shown. More or fewer systems can be implemented alternatively.

[0088] Specifically, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from ROM 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.

[0089] The multi-machine training task standardization deployment device provided in this application, employing the multi-machine training task standardization deployment method described in the above embodiments, can solve the technical problems of complex, cumbersome, and covert communication in distributed training deployment of large models under multi-machine and multi-GPU environments, leading to low efficiency and resource waste. Compared with the prior art, the beneficial effects of the multi-machine training task standardization deployment device provided in this application are the same as those of the multi-machine training task standardization deployment method provided in the above embodiments, and other technical features of this multi-machine training task standardization deployment device are the same as those disclosed in the previous embodiment method, and will not be repeated here.

[0090] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.

[0091] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

[0092] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, which are used to execute the multi-machine training task standardization deployment method in the above embodiments.

[0093] The computer-readable storage medium provided in this application may be, for example, a USB flash drive, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (Radio Frequency), etc., or any suitable combination thereof.

[0094] The aforementioned computer-readable storage medium may be included in the standardized deployment equipment for multi-machine training tasks; or it may exist independently and not be assembled into the standardized deployment equipment for multi-machine training tasks.

[0095] The aforementioned computer-readable storage medium carries one or more programs. When these programs are executed by a multi-machine training task standardization deployment device, the multi-machine training task standardization deployment device performs the following actions: acquires the training task requirements in a multi-machine, multi-GPU environment, and abstracts these requirements into standardized logical task units; generates a container configuration for the target training framework based on a preset instance template and custom resource objects; instantiates the standardized logical task units based on the container configuration of the target training framework to obtain task unit instances of the target training framework; dynamically generates distributed training configuration information adapted to the target training framework based on a preset configuration template; and injects the distributed training configuration information of the target training framework into the corresponding task unit instances, so that the target training framework performs task training based on the task unit instances.

[0096] Computer program code for performing the operations of this application can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0097] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0098] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.

[0099] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., a computer program) for executing the above-described standardized deployment method for multi-machine training tasks. This solves the technical problems of complex distributed training deployment of large models in multi-machine, multi-GPU environments, which leads to low efficiency and resource waste due to cumbersome configuration, hidden communication, and other issues. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the standardized deployment method for multi-machine training tasks provided in the above embodiments, and will not be elaborated upon here.

[0100] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the standardized deployment method for multi-machine training tasks as described above.

[0101] The computer program product provided in this application can solve the technical problems of complex deployment, cumbersome configuration, and hidden communication in distributed training of large models in multi-machine and multi-GPU environments, resulting in low efficiency and wasted resources. Compared with the prior art, the beneficial effects of the computer program product provided in this application are the same as those of the standardized deployment method for multi-machine training tasks provided in the above embodiments, and will not be repeated here.

[0102] The above are only some embodiments of this application and do not limit the patent scope of this application. All equivalent structural transformations made under the technical concept of this application and using the contents of the specification and drawings of this application, or direct / indirect applications in other related technical fields, are included in the patent protection scope of this application.

Claims

1. A standardized deployment method for multi-machine training tasks, characterized in that, The method includes: Obtain the training task requirements in a multi-machine, multi-GPU environment, and abstract the training task requirements into standardized logical task units; Based on preset instance templates and custom resource objects, generate the container configuration for the target training framework; Based on the container configuration of the target training framework, the standardized logical task unit is instantiated to obtain a task unit instance of the target training framework; Based on a preset configuration template, distributed training configuration information adapted to the target training framework is dynamically generated; The distributed training configuration information of the target training framework is injected into the corresponding task unit instance so that the target training framework performs task training based on the task unit instance.

2. The method as described in claim 1, characterized in that, The steps for generating the custom object include: Based on the training task requirements in a multi-machine, multi-card environment, an initial custom resource object is created. The training task requirements include at least logical resource requirements, compatibility requirements, and networking requirements. Based on physical resources, resources are allocated to the standardized logical task units to determine the optimal node combination of the standardized logical task units; The node information corresponding to the optimal node combination is added to the initial custom resource object to generate a custom resource object. The node information includes at least Internet protocol address, hardware type, video memory specification and version information.

3. The method as described in claim 2, characterized in that, The step of allocating resources to the standardized logical task units based on physical resources and determining the optimal node combination of the standardized logical task units includes: Based on physical resources, resources are allocated to the standardized logical task units to determine the resource requirement vector of the standardized logical task units; Based on the resource requirement vector of the standardized logical task unit, the corresponding target physical node is selected from the node pool; Based on the target physical nodes of the standardized logical task units, the optimal node combination of the standardized logical task units is determined.

4. The method as described in claim 1, characterized in that, The steps for generating the container configuration of the target training framework based on the preset configuration template and custom resource objects include: Obtain a preset configuration template, which includes at least dynamically replaceable parameter placeholders, including at least resource placeholders, environment placeholders, framework placeholders, and basic placeholders. The general resource parameters of the standardized logical task units are converted into corresponding hardware adaptation parameters. Obtain node information from the custom resource object, substitute the node information and the hardware adaptation parameters into the preset configuration template, and generate the container configuration of the target training framework.

5. The method as described in claim 1, characterized in that, The step of dynamically generating distributed training configuration information adapted to the target training framework based on a preset configuration template includes: Convert the training task requirements into template variable values; Based on the template variable values, the preset configuration template is rendered to generate distributed training configuration information adapted to the target training framework.

6. The method as described in claim 1, characterized in that, The method further includes: When the number of busy units in the standardized logical task unit is greater than a first quantity threshold, the number of the standardized logical task units is expanded. The GPU utilization of the busy unit is greater than a first preset utilization threshold, the busy duration of the busy unit is greater than a preset busy duration, and there are tasks in the waiting queue of the busy unit. When the number of idle units in the standardized logical task unit is greater than the second quantity threshold, the number of the standardized logical task units is reduced. The GPU utilization of the idle unit is less than the second preset utilization threshold, the idle duration of the idle unit is greater than the preset idle duration, and there are no tasks in the waiting queue of the idle unit.

7. The method as described in claim 1, characterized in that, The method further includes: Based on the standardized logical task unit, create an associated service corresponding to the task unit instance, and bind the task unit instance to the associated service corresponding to the task unit instance; When a task unit instance starts, the name of the associated service is determined based on the bound associated service; The associated service name is parsed to obtain the Internet Protocol address of the standardized logical task unit; The Internet Protocol address of the standardized logical task unit is written into the node registry in the shared storage, so that the task unit instance can determine the Internet Protocol address of the standardized logical task unit in real time based on the node registry during task training.

8. A standardized deployment device for multi-machine training tasks, characterized in that, The device includes: The instance management module is used to obtain training task requirements in a multi-machine, multi-GPU environment and abstract the training task requirements into standardized logical task units. The instance management module is also used to generate the container configuration of the target training framework based on the preset instance template and the custom resource object. The instance management module is also used to instantiate the standardized logical task unit based on the container configuration of the target training framework to obtain a task unit instance of the target training framework. The dynamic configuration injection module is used to dynamically generate distributed training configuration information that adapts to the target training framework based on a preset configuration template. The dynamic configuration injection module is further configured to inject the distributed training configuration information of the target training framework into the corresponding task unit instance, so that the target training framework performs task training based on the task unit instance.

9. A standardized deployment device for multi-machine training tasks, characterized in that, The device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being configured to implement the steps of the multi-machine training task standardization deployment method as described in any one of claims 1 to 7.

10. A storage medium, characterized in that, The storage medium is a computer-readable storage medium, and a computer program is stored on the storage medium. When the computer program is executed by a processor, it implements the steps of the multi-machine training task standardization deployment method as described in any one of claims 1 to 7.