A parallel deep learning scheduling training method and system based on a container

A deep learning and training method technology, applied in the field of cloud computing and deep learning, can solve problems such as inability to isolate Task resources, inconvenient training tasks and logs, and a large amount of upper-level development, so as to improve resource utilization and computing resource utilization rate, increase the effect of scheduling tasks

Active Publication Date: 2019-06-14
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF9 Cites 58 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical task of the present invention is to provide a container-based parallel deep learning scheduling training method and system to solve how to avoid the inability to isolate the various Task resources of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A parallel deep learning scheduling training method and system based on a container

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] The container-based parallel deep learning scheduling training method of the present invention, the method is to use the Kubernetes container to realize the configuration and scheduling of the computing resources of the task, provide ResourceQuota, LimitRanger multiple resource management mechanisms, through the pod nodes in the container cluster Communication to achieve resource isolation between tasks; the same training node starts training pods and lifecycle management pods at the same time, and LCM performs unified resource job scheduling. The microservice architecture itself is deployed as a POD, relying on the latest version of Kubernetes to effectively mobilize GPUs When the K8S job crashes due to any failure reasons in OS, docker or machine failure, restart the microservice architecture and report the health of the microservice architecture; the training work is arranged in FIFO order by default, and LCM supports job priority. For each training task, LCM uses on-...

Embodiment 2

[0057] The container-based parallel deep learning scheduling training method of the present invention, the specific steps of the method are as follows:

[0058] S1. Pre-install the Kubernetes container (above 1.3) on the host machine, designate one pod as a scheduling node, one pod as a monitoring node, and n pods as task nodes;

[0059] S2. The scheduling node is responsible for submitting job tasks, and specifies a task node to perform a round of iterations through the scheduling algorithm; the scheduling algorithm is as follows:

[0060] (1) When the threshold is exceeded, the newly assigned computing node will transfer the computing task, and at this time there will be a spare task node (Pod);

[0061] (2), based on the resource (GPU) size occupied by the spare task node, set the weight (weight),

[0062] (3), the greater the resources occupied, the greater the weight;

[0063] (4) When a new node needs to be allocated when the threshold value is exceeded again, it is pr...

Embodiment 3

[0077] as attached figure 1 As shown, the container-based parallel deep learning scheduling training system of the present invention includes microservice architecture, learning training (DL), container cluster management and life cycle management (LCM);

[0078] Among them, the microservice architecture is used to reduce the coupling between components, keep each component single and as stateless as possible, isolate each other, and allow each component to be independently developed, tested, deployed, scaled and upgraded; and through dynamic registration RESTAPI service instance to achieve load balancing;

[0079] Learning and training (DL) is composed of a single learning node (Learning Pod) in the kubernetes container using GPU, and the user code instantiates the framework kubernetes service; usually, learning and training jobs use several GPUs / CPUs or are synchronized by several learning nodes in the Centralized parameter service is used on MPI; users submit training task...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel deep learning scheduling training method and system based on a container, and belongs to the technical field of cloud computing and deep learning. The technical problem to be solved by the invention is how to avoid that each Task resource of TensorFlow cannot be isolated during training, the mutual influence is caused by resource preemption, the defect schedulingcapability, the upper-layer development amount is large, and the checking of each Task training task and log is inconvenient. The adopted technical scheme is as follows a Kubernetes container is utilized to realize the configuration and scheduling of the computing resources of tasks, a plurality of resource management mechanisms such as ResponceQuota and LimitRanger are provided, and the resourceisolation among the tasks is realized through communication among pod nodes in a container cluster; the same training node starts training pod and life cycle management pod at the same time, the LCMcarries out resource job scheduling in a unified mode, and the micro-service framework serves as POD deployment and depends on the latest version characteristic of Kubernetes to effectively mobilize the use of the GPU. The invention also discloses a parallel deep learning scheduling training system based on the container.

Description

technical field [0001] The invention relates to the technical field of cloud computing and deep learning, in particular to a container-based parallel deep learning scheduling training method and system. Background technique [0002] With the rapid development of machine learning and deep learning technology, more and more individuals and enterprises prefer to use the TensorFlow framework released by Google for deep learning training. The framework is an open source software library that uses data flow graphs for numerical calculations. Sometimes the amount of calculation required by the deep learning model is too large, which requires the use of distributed computing. Submit the Session through the Client, define a worker, and specify a specific CPU / GPU to run the training task. However, when running the parallel computing mode of the framework, there are certain defects in both the synchronous mode and the asynchronous mode. [0003] The Task resources of TensorFlow canno...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F9/48G06F9/50G06F9/455
Inventor 窦洋杨继伟
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products