Elastic deep learning job scheduling method and system, and computer device
By partitioning the model and assigning it to neighboring nodes for delayed computation during node preemption, the high overhead and cost of resource scheduling strategies in existing technologies are solved. This achieves seamless node preemption and return, improving resource utilization and reducing job completion time.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN INST OF ADVANCED TECH
- Filing Date
- 2022-11-18
- Publication Date
- 2026-06-16
Smart Images

Figure CN116069495B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of information technology, and in particular to a flexible deep learning job scheduling method, system, and computer device. Background Technology
[0002] Distributed training migrates models from single-machine, single-GPU setups to multi-machine, multi-GPU clusters, leveraging the computing resources of multiple nodes to accelerate training. Data parallelism and pipelined parallelism are currently the mainstream solutions for distributed deep learning training. Data parallelism distributes training data across different nodes for simultaneous computation, pushing gradients to each node and updating model weights through synchronous communication. Pipeline parallelism partitions the model, assigning it to different nodes and dividing mini-batches into micro-batches, allowing computation on different nodes to be performed in parallel via a pipelined approach. Each node stores only a portion of the model parameters, thus avoiding GPU memory bottlenecks. The network communication overhead of pipelined parallelism is lower than that of data parallelism and operator-based model parallelism. Mainstream distributed training solutions typically employ hybrid parallel training, combining data parallelism and pipelined parallelism.
[0003] In "Singularity: Planet-Scale, Preemptive and ElasticScheduling of AI Workloads," Dharma Shukla et al. proposed preemptive job scheduling, achieving time-slice switching between different jobs by exchanging data between GPU memory and CPU memory, thus realizing transparent preemption and elasticity of nodes. Sanjith Athlur et al., in "Varuna: Scalable, Low-cost Training of Massive Deep Learning Models," proposed a pipelined parallel strategy for auction instances, reconfiguring nodes after they are preempted. Varuna enables jobs to resume from checkpoints when preempted through pipeline schedules and consecutive checkpoints. John Thorpe et al., in "Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs," addressed the preemption problem of auction instances by adding redundant computation to the pipeline bubble. When there is no consecutive preemption of nodes in the cluster, any node can be replaced by the previous node when preemption occurs. When consecutive nodes in the cluster are preempted, the system degrades to reconfiguration and recovery from checkpoints.
[0004] Existing cluster scheduling strategies typically preempt computing resources through time-slice switching, recovery from checkpoints through reconfiguration, or redundancy backup using additional nodes. The drawbacks of these techniques are: both time-slice switching and parallel reconfiguration require recovery from checkpoints, incurring significant memory swapping overhead; redundant computation is only available on spot instances, resulting in substantial additional costs across the cluster, and it's impossible to complete all computations within the bubble time; currently, there is no universal mechanism for preemption without checkpoint recovery. Summary of the Invention
[0005] Therefore, it is necessary to provide a flexible deep learning job scheduling method, system, and computer equipment that improves the resource utilization of the cluster and reduces the average job completion time, addressing the shortcomings of existing technologies.
[0006] To solve the above problems, this application adopts the following technical solution:
[0007] One of the objectives of this application is to provide a flexible deep learning job scheduling method, which includes the following steps: including the preemption and return of nodes.
[0008] In some embodiments, the preemption of a node includes the following steps:
[0009] Obtain partition configuration and pipeline orchestration;
[0010] The model of the preemptible node is partitioned and assigned to the bubbles of neighboring nodes, and delayed computation is performed;
[0011] Upon receiving a preemption command, calculate the preemption sequence and unload the preempted node;
[0012] Determine whether the preemption sequence involves critical nodes that cannot be preempted;
[0013] If so, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
[0014] If not, the additional partitions [L] stored in the neighboring nodes of the unloaded node. i-1 ,L' i ] and [L' i ,L i The computation process shifts from delayed computation to immediate computation, and then the synchronous communication network topology is adjusted by adding neighboring nodes to the network topology. During the synchronization phase, these nodes replace the unloaded nodes to perform synchronization.
[0015] In some embodiments, the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
[0016] Obtain the partition configuration and pipeline orchestration from the scheduler. Let G be the total number of GPUs, and P and D be the GPUs used for pipeline parallelism and data parallelism, respectively. Then, the partition size is G = P × D, and the partition hyperparameters mini-batch size and micro-batch size are obtained.
[0017] Profiling data using models computed offline and run on GPUs, with forward computation time F = {f} per layer. i}, Backpropagation time per layer B = {b i}, The memory usage of each GPU layer is M = {m i} and the upper bound of GPU memory sup(M), create partition L = {L i},satisfy
[0018] The corresponding model partitions are assigned to each node of the job, establishing a network topology for pipelined parallel computing and data parallel synchronous communication.
[0019] In some embodiments, the step of partitioning the model of a preemptible node and assigning it to bubbles of neighboring nodes, and performing delayed computation, specifically includes the following steps:
[0020] For preemptible node N i The model is then averaged and repartitioned. The partitions are stored separately in two neighboring nodes on the pipeline, making the partitions [L] i-1 ,L′ i ] and [L′ i ,L i The calculation of ] is performed on neighbor node N respectively. i-1 and N i+1 The calculation is delayed until the bubble time is reached, which is equivalent to a delayed calculation.
[0021] In some embodiments, the steps of receiving a preemption command, calculating the preemption sequence, and unloading the preempted node specifically include the following steps:
[0022] Upon receiving a preemption command, the optimal preemption sequence is calculated based on the partition configuration and pipeline schedule, i.e., the sequence that satisfies... The node sequence, unload the preempted node.
[0023] In some embodiments, the step of returning the node specifically includes the following steps:
[0024] Nodes that have completed their tasks are added to the standby queue, and preempted tasks obtain available nodes from the standby queue.
[0025] Calculate the optimal return sequence based on partition configuration and pipeline schedule;
[0026] The available nodes are loaded with the corresponding model partitions and pipeline configurations, and the state is restored from the checkpoint;
[0027] The returning node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier;
[0028] The additional partitions of neighboring nodes are converted into delayed computations and are computed within bubble time.
[0029] In some embodiments, in the step of calculating the optimal return sequence based on the partition configuration and pipeline schedule, the following conditions are met: Given a sequence of nodes, return the sequence to its neighboring node N. i-1 and N i+1 Send a return message.
[0030] In some embodiments, the step of the returning node asynchronously pulling intermediate states and gradients from neighboring nodes and performing forced synchronization at the next synchronization barrier specifically includes the following steps: the returning node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier, that is, ensuring that after the synchronization barrier, node N i Data [L] i-1 ,L i [With neighboring node N] i-1 and N i+1 Additional partitions for upper storage [L] i-1 ,L′ i ] and [L′ i ,L i The data in [ ] is the same, and the normal synchronous communication network topology is restored after synchronization.
[0031] The second objective of this application is to provide a flexible deep learning job scheduling system, including: a processing unit for handling the preemption and return of nodes.
[0032] The third objective of this application is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the method described above.
[0033] The present application adopts the above technical solution, and its beneficial effects are as follows:
[0034] The elastic deep learning job scheduling method, system, and computer equipment provided in this application include node preemption and return, which enables nodes in pipeline parallel operations in a cluster to be preempted without affecting training. Preemption within a certain range does not require reconfiguration or checkpoint recovery, and nodes that have been preempted can be returned to the job at any time, thereby improving the cluster's resource utilization and reducing the average job completion time. Attached Figure Description
[0035] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0036] Figure 1 The flowchart illustrates the steps for obtaining partition configuration and pipeline orchestration provided in Embodiment 1 of this application.
[0037] Figure 2 The flowchart shows the steps for applying for the return of the node provided in Embodiment 1.
[0038] Figure 3 This is a schematic diagram of the structure of the computer device provided in Embodiment 3 of this application. Detailed Implementation
[0039] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0040] In the description of this application, it should be understood that the terms "upper", "lower", "horizontal", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, and are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application.
[0041] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0042] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments.
[0043] Example 1
[0044] This application provides a flowchart of a flexible deep learning job scheduling method according to an embodiment, including steps for preempting and returning nodes. The implementation of each step is described in detail below.
[0045] Please see Figure 1 The steps for preempting a node provided in this embodiment include the following steps S110 to S160. The implementation of each step is described in detail below.
[0046] Step S110: Obtain partition configuration and pipeline orchestration.
[0047] In some embodiments, the steps of obtaining partition configuration and pipeline orchestration specifically include the following steps:
[0048] Step S111: Obtain the partition configuration and pipeline orchestration from the scheduler. Let G be the total number of GPUs, and P and D be the GPUs used for pipeline parallelism and data parallelism, respectively. Then the partition size is G = P × D, and the partition hyperparameters mini-batch size and micro-batch size are obtained.
[0049] Step S112: Profiling data by running the offline computed model on the GPU, with the forward computation time for each layer being F = {f i}, Backpropagation time per layer B = {b i}, The memory usage of each GPU layer is M = {m i} and the upper bound of GPU memory sup(M), create partition L = {L i},satisfy
[0050]
[0051] Step S113: Assign the corresponding model partitions to each node of the job and establish a network topology for pipelined parallel computing and data parallel synchronous communication.
[0052] Through the above steps S111 to S113, the acquisition of partition configuration and pipeline orchestration can be achieved.
[0053] Step S120: Partition the model of the preemptible node and assign it to the bubbles of the neighboring nodes, and perform delayed calculation.
[0054] In some embodiments, the step of partitioning the model of a preemptible node and assigning it to bubbles of neighboring nodes, and performing delayed computation, specifically includes the following steps:
[0055] For preemptible node N i The model is then averaged and repartitioned. The partitions are stored separately in two neighboring nodes on the pipeline, making the partitions [L] i-1 ,L′ i ] and [L′ i ,L i The calculation of ] is performed on neighbor node N respectively. i-1 and N i+1 The calculation is delayed until the bubble time is reached, which is equivalent to a delayed calculation.
[0056] Step S130: Upon receiving the preemption instruction, calculate the preemption sequence and unload the preempted node.
[0057] In some embodiments, the steps of receiving a preemption command, calculating the preemption sequence, and unloading the preempted node specifically include the following steps:
[0058] Upon receiving a preemption command, the optimal preemption sequence is calculated based on the partition configuration and pipeline schedule, i.e., the sequence that satisfies... The node sequence, unload the preempted node.
[0059] Step S140: Determine whether the preemption sequence involves critical nodes that cannot be preempted;
[0060] Step S150: If yes, reconfigure the model partitions of all nodes and restore the state from the checkpoint;
[0061] Step S160: If not, store the additional partitions [L] in the neighboring nodes of the node to be unloaded. i-1 ,L' i ] and [L' i ,L i The computation process shifts from delayed computation to immediate computation, and then the synchronous communication network topology is adjusted by adding neighboring nodes to the network topology. During the synchronization phase, these nodes replace the unloaded nodes to perform synchronization.
[0062] It is understandable that the above steps can achieve node preemption, and the model of the preemptible node can be evenly distributed to the neighboring nodes for backup, so that the calculation can be completed within bubble time; when a node is preempted, the neighboring nodes can seamlessly replace the node to perform the calculation.
[0063] Please see Figure 2 The flowchart of the steps for returning a node provided in this embodiment includes steps S210 to S250. The implementation of each step is described in detail below.
[0064] Step S210: Add the completed node to the standby queue, and the preempted job obtains an available node from the standby queue.
[0065] Step S220: Calculate the optimal return sequence based on the partition configuration and pipeline schedule.
[0066] In some embodiments, in the step of calculating the optimal return sequence based on the partition configuration and pipeline schedule, the following conditions are met: Given a sequence of nodes, return the sequence to its neighboring node N. i-1 and N i+1 Send a return message.
[0067] Step S230: Load the corresponding model partition and pipeline configuration for the obtained available nodes, and restore the state from the checkpoint.
[0068] Step S240: The returned node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier.
[0069] In some embodiments, the step of the returning node asynchronously pulling intermediate states and gradients from neighboring nodes and performing forced synchronization at the next synchronization barrier specifically includes the following steps: the returning node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier, that is, ensuring that after the synchronization barrier, node N i Data [L] i-1 ,L i [With neighboring node N] i-1 and N i+1 Additional partitions for upper storage [L] i-1 ,L′ i ] and [L′ i ,L i The data in [ ] is the same, and the normal synchronous communication network topology is restored after synchronization.
[0070] Step S250: The additional partitions of neighboring nodes are converted to delayed computation and computed within the bubble time.
[0071] It can be understood that by following the above steps, the node can be returned.
[0072] The elastic deep learning job scheduling method provided in this application includes node preemption and return, which enables nodes in pipeline parallel operations in the cluster to be preempted without affecting training. Preemption within a certain range does not require reconfiguration or checkpoint recovery, and nodes that have been preempted can be returned to the job at any time, thereby improving the cluster's resource utilization and reducing the average job completion time.
[0073] Example 2
[0074] This embodiment also provides an elastic deep learning job scheduling system, including: a processing unit for handling the preemption and return of nodes.
[0075] The detailed working method of the design system provided in this embodiment 2 can be found in embodiment 1, and will not be repeated here.
[0076] The elastic deep learning job scheduling system provided in this application includes node preemption and return, which enables nodes in pipeline parallel operations in the cluster to be preempted without affecting training. Preemption within a certain range does not require reconfiguration or checkpoint recovery, and nodes that have been preempted can be returned to the job at any time, thereby improving the cluster's resource utilization and reducing the average job completion time.
[0077] Example 3
[0078] Please see Figure 3 This is a schematic diagram of a computer device structure according to an embodiment of this application. The computer device 50 includes a processor 51 and a memory 52 coupled to the processor 51.
[0079] The memory 52 stores program instructions for implementing the error correction method for the memristor accuracy reconstruction calculation described above.
[0080] The processor 51 is used to execute program instructions stored in the memory 52 to implement the elastic deep learning job scheduling method.
[0081] The processor 51 can also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor.
[0082] It is understood that the technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0083] The above are merely preferred embodiments of this application, and only specifically describe the technical principles of this application. These descriptions are only for explaining the principles of this application and should not be construed as limiting the scope of protection of this application in any way. Based on this explanation, any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application, as well as other specific embodiments of this application that can be conceived by those skilled in the art without creative effort, should be included within the scope of protection of this application.
Claims
1. A flexible deep learning job scheduling method, characterized in that, This includes the preemption and return of nodes. The preemption process for nodes includes the following steps: Obtain partition configuration and pipeline orchestration; The model of the preemptible node is partitioned and assigned to the bubbles of neighboring nodes, and delayed computation is performed; Upon receiving a preemption command, calculate the preemption sequence and unload the preempted node; Determine whether the preemption sequence involves critical nodes that cannot be preempted; If so, reconfigure the model partitions of all nodes and restore the state from the checkpoint; If not, the additional partitions stored in the neighboring nodes of the node to be unloaded. and The process involves switching from delayed computation to immediate computation, adjusting the synchronous communication network topology, adding neighboring nodes to the network topology, and replacing the unloaded nodes during the synchronization phase.
2. The elastic deep learning job scheduling method as described in claim 1, characterized in that, The steps for obtaining partition configuration and pipeline orchestration specifically include the following: Retrieve partition configuration and pipeline orchestration from the scheduler. Let G be the total number of GPUs, and P and D be the GPUs used for pipeline parallelism and data parallelism, respectively. Then, the partition size is obtained as follows: And the hyperparameters of partitioning, mini-batch size and micro-batch size; Profiling data using offline computed models run on GPUs, with forward computation time per layer. Backpropagation time per layer GPU memory usage per layer and GPU memory upper limit Create partitions ,satisfy ; The corresponding model partitions are assigned to each node of the job, establishing a network topology for pipelined parallel computing and data parallel synchronous communication.
3. The elastic deep learning job scheduling method as described in claim 1, characterized in that, The process of partitioning the model of preemptible nodes and assigning them to bubbles of neighboring nodes, and performing delayed computation, specifically includes the following steps: For preemptible nodes The model is then averaged and repartitioned. The data is stored in two neighboring nodes on the pipeline, respectively, to achieve partitioning. and The calculations are performed on the neighboring nodes respectively. and The calculation is delayed until the bubble time is reached, which is equivalent to a delayed calculation.
4. The elastic deep learning job scheduling method as described in claim 1, characterized in that, The steps of receiving a preemption command, calculating the preemption sequence, and unloading the preempted node specifically include the following steps: Upon receiving a preemption command, the optimal preemption sequence is calculated based on the partition configuration and pipeline schedule, i.e., the sequence that satisfies... The node sequence, unload the preempted node.
5. The elastic deep learning job scheduling method as described in claim 1, characterized in that, The process of returning a node specifically includes the following steps: Nodes that have completed their tasks are added to the standby queue, and preempted tasks obtain available nodes from the standby queue. Calculate the optimal return sequence based on partition configuration and pipeline schedule; The available nodes are loaded with the corresponding model partitions and pipeline configurations, and the state is restored from the checkpoint; The returning node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier; The additional partitions of neighboring nodes are converted into delayed computations and are computed within bubble time.
6. The elastic deep learning job scheduling method as described in claim 5, characterized in that, In the step of calculating the optimal return sequence based on the partition configuration and pipeline schedule, the following conditions must be met: The node sequence is returned to the neighboring nodes in the sequence. and Send a return message.
7. The elastic deep learning job scheduling method as described in claim 1, characterized in that, The step of the returning node asynchronously pulling intermediate states and gradients from neighboring nodes and performing forced synchronization at the next synchronization barrier specifically includes the following steps: the returning node asynchronously pulls intermediate states and gradients from neighboring nodes and performs forced synchronization at the next synchronization barrier, that is, ensuring that the node after the synchronization barrier... Data with neighboring nodes and Additional partitions for upper storage and The data is identical, and the normal synchronous communication network topology is restored after synchronization.
8. A flexible deep learning job scheduling system, characterized in that, include: The processing unit is used to handle the preemption and return of nodes. The preemption process for a node includes the following steps: Obtain partition configuration and pipeline orchestration; The model of the preemptible node is partitioned and assigned to the bubbles of neighboring nodes, and delayed computation is performed; Upon receiving a preemption command, calculate the preemption sequence and unload the preempted node; Determine whether the preemption sequence involves critical nodes that cannot be preempted; If so, reconfigure the model partitions of all nodes and restore the state from the checkpoint; If not, the additional partitions stored in the neighboring nodes of the node to be unloaded. and The process involves switching from delayed computation to immediate computation, adjusting the synchronous communication network topology, adding neighboring nodes to the network topology, and replacing the unloaded nodes during the synchronization phase.
9. A computer device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the method as described in any one of claims 1-7.