A container-based parallel deep learning scheduling training method and system

A deep learning and training method technology, applied in the field of cloud computing and deep learning, can solve the problems of large amount of upper development, mutual influence, lack of scheduling ability, etc., to improve the utilization rate of computing resources, improve resource utilization rate, and accelerate iteration speed effect

Active Publication Date: 2021-07-16
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF9 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical task of the present invention is to provide a container-based parallel deep learning scheduling training method and system to solve how to avoid the inability to isolate the various Task resources of TensorFlow during training, mutual influence due to resource preemption, lack of scheduling ability, and large amount of upper-layer development And the problem of viewing each Task training task and log is inconvenient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A container-based parallel deep learning scheduling training method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] The container-based parallel deep learning scheduling training method of the present invention utilizes Kubernetes containers to realize the configuration and scheduling of computing resources for tasks, provides multiple resource management mechanisms such as ResourceQuota and LimitRanger, and uses pod nodes in a container cluster through communication between pod nodes. Communication to achieve resource isolation between tasks; the same training node starts training pods and lifecycle management pods at the same time, and LCM performs unified resource job scheduling. The microservice architecture itself is deployed as a POD, relying on the latest version of Kubernetes to effectively mobilize GPUs Use, when the K8S job crashes due to any failure cause in OS, docker or machine failure, restart the microservice architecture and report the health of the microservice architecture; training jobs are scheduled in FIFO order by default, LCM supports job priority, For each trai...

Embodiment 2

[0057] In the container-based parallel deep learning scheduling training method of the present invention, the specific steps of the method are as follows:

[0058] S1. Pre-install Kubernetes containers (above 1.3) on the host, designate one pod as the scheduling node, one pod as the monitoring node, and n pods as the task node;

[0059] S2. The scheduling node is responsible for submitting job tasks, and a task node is designated for a round of iteration through the scheduling algorithm; the scheduling algorithm is as follows:

[0060] (1) When the threshold is exceeded, the newly allocated computing node will transfer the computing task, and a spare task node (Pod) will appear at this time;

[0061] (2) Based on the size of the resources (GPU) occupied by the spare task nodes, set the weight (weight),

[0062] (3) The larger the resource occupied, the greater the weight;

[0063] (4) When a new node is required to be allocated over the threshold again, it will be selected f...

Embodiment 3

[0077] as attached figure 1 As shown, the container-based parallel deep learning scheduling training system of the present invention includes a micro-service architecture, learning and training (DL), container cluster management and life cycle management (LCM);

[0078] Among them, the microservice architecture is used to reduce the coupling between components, keep each component as single and as stateless as possible, isolate each other, and allow each component to independently develop, test, deploy, scale and upgrade; and through dynamic registration RESTAPI service instances to achieve load balancing;

[0079] Learning and training (DL) consists of a single learning node (Learning Pod) in a kubernetes container that uses GPUs, and user code instantiates the framework kubernetes service; usually, learning and training jobs use several GPUs / CPUs or are synchronized by several learning nodes. A centralized parameter service is used on MPI; users submit training tasks and ma...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a container-based parallel deep learning scheduling training method and system, belonging to the technical field of cloud computing and deep learning. The technical solution adopted is: this method uses Kubernetes containers to realize the configuration and scheduling of computing resources for tasks, and provides ResourceQuota, LimitRanger A variety of resource management mechanisms, through the communication between pod nodes in the container cluster, to achieve resource isolation between tasks; the same training node starts the training pod and life cycle management pod at the same time, and the LCM uniformly performs resource job scheduling, microservices The architecture itself is deployed as a POD, relying on the features of the latest version of Kubernetes to effectively mobilize the use of GPUs. The invention also discloses a container-based parallel deep learning scheduling training system.

Description

technical field [0001] The invention relates to the technical field of cloud computing and deep learning, in particular to a container-based parallel deep learning scheduling training method and system. Background technique [0002] With the rapid development of machine learning and deep learning technologies, more and more individuals and enterprises prefer to use the TensorFlow framework released by Google for deep learning training. The framework is an open-source software library that uses data flow graphs for numerical computation. Sometimes the deep learning model requires too much computation, which requires distributed computing, submitting a session through the client, defining a worker, and specifying a specific CPU / GPU to run the training task. However, when running the parallel computing mode of the framework, both the synchronous mode and the asynchronous mode have certain defects. [0003] Task resources of TensorFlow cannot be isolated during training, which...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F9/48G06F9/50G06F9/455
Inventor 窦洋杨继伟方亚东
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products