Unlock instant, AI-driven research and patent intelligence for your innovation.

Distributed training method and device for deep learning model

A deep learning and distributed technology, applied in the field of deep learning and distributed training, can solve problems such as adaptive adjustment of the number of unworkable servers, GPU vacancy, and low utilization of GPU clusters.

Pending Publication Date: 2020-11-27
CHINA UNIONPAY
View PDF0 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, when the deep learning model is performing different training tasks, some training tasks require more GPUs, while others only require less GPUs, or some special training tasks will show a certain cycle when using GPUs. Sexual characteristics, there are peaks and valleys in use, resulting in an idle state of the GPU in some training tasks
Therefore, for different training tasks, the number of working servers cannot be adaptively adjusted, resulting in low GPU cluster utilization

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed training method and device for deep learning model
  • Distributed training method and device for deep learning model
  • Distributed training method and device for deep learning model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0091] Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

[0092] like figure 1 As shown, this embodiment provides a distributed training method for a deep learning model, including the following steps:

[0093] Step S110: Obtain the training state data corresponding to the training task sent by the deep learning platform;

[0094] Step S120: Generate an elastic scaling policy according to the resource requirements corresp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed training method and device for a deep learning model. According to the specific implementation scheme, the method comprises the steps of obtaining training statedata corresponding to a training task sent by a deep learning platform; generating an elastic scaling strategy according to a cluster resource demand corresponding to the training task; dynamically adjusting the number of training nodes corresponding to the training task by adopting an elastic scaling strategy; and executing a training task according to the training state data and the adjusted training node. According to the method, the adaptability of the cluster resource demand corresponding to the training task is improved, the GPU or CPU resource utilization rate is improved, and it can beensured that the training task can be correctly and efficiently executed by utilizing the adjusted training node under the condition that the training node is added or deleted at any time.

Description

technical field [0001] This application relates to the field of deep learning, especially to the field of distributed training. Background technique [0002] The deep learning framework / platform supports a distributed training mode, that is, multiple devices can be used, and multiple GPUs (Graphics Processing Units) can be set on each device, and the deep learning model is parallelized on the GPUs in each device. train. Existing deep learning frameworks / platforms, for example, TensorFlow (based on data flow programming, dataflow programming) native PS (parameter service, Parameter server) architecture supports asynchronous training mode. When the deep learning framework / platform is running, the deep learning framework / platform can be deployed to a specific physical cluster. The nodes in the TensorFlow cluster are divided into two categories: parameter server and worker. The parameter server stores the parameters of the model, and the working server is responsible for calcu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06N3/063G06N3/04G06N3/08
CPCG06F9/5027G06F9/5066G06N3/063G06N3/08G06N3/045
Inventor 乔萧雅刘国宝周雍恺
Owner CHINA UNIONPAY