Resource management method and system for large-scale distributed deep learning

A resource management and deep learning technology, applied in neural learning methods, resource allocation, electrical and digital data processing, etc., can solve problems such as low utilization of training system resources, inability to coordinate management of multiple resources, and out-of-bounds memory, and achieve reasonable guarantees. Configure, solve the missing gradient data, reduce the effect of memory consumption

Pending Publication Date: 2020-10-30
HUAZHONG UNIV OF SCI & TECH
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Aiming at the defects of the prior art, the purpose of the present invention is to provide a resource management method and system for large-scale distributed deep learning, aiming at solving the problem of simultaneously managing and configuring computing, memory and bandwidth resources in the prior art. The method of directly combining memory optimization and communication optimization is adopted. This method cannot coordinate and manage multiple resources, resulting in low resource utilization of the training system and easily causing data loss and memory out-of-bounds problems.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Resource management method and system for large-scale distributed deep learning
  • Resource management method and system for large-scale distributed deep learning
  • Resource management method and system for large-scale distributed deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0059] First of all, explain and illustrate the technical terms involved in the present invention:

[0060] Parameter weight (weight): Since neurons are distributed layer by layer in the neural network structure, neurons in adjacent layers are connected to each other, and each connection has a parameter weight to determine the degree of influence of input data on neurons. The parameter weights are arranged in layers to form a parameter matrix, so that the forward calculation and backward calculation of each layer of the neural network can be expressed in the form of matrix multiplication.

[0061]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a resource management method and system for large-scale distributed deep learning, and the method and system achieve the memory resource optimization management of parameters and gradient intermediate data during the training operation of a neural network, and guarantee the reasonable configuration of distributed communication bandwidth resources. Cross-layer memory reuse is realized again, intermediate data required by iterative computation and sparse communication are migrated into a CPU main memory and then migrated back as required, and interlayer memory consumptionis reduced; on the basis of reasonable migration of CPU-GPU data, independence of intra-layer memory reuse, intra-layer calculation mining and memory access operation is achieved, and intra-layer memory consumption is reduced as much as possible. Distributed parameter communication optimization is realized while efficient utilization of memory resources is ensured. Data access in the distributedparameter updating stage is reasonably redirected, a CPU main memory serves as a mirror image access area, the data access to parameters and gradients is completed, and the problems of gradient data missing and parameter writing border crossing are solved.

Description

technical field [0001] The invention belongs to the technical field of distributed systems, and more specifically relates to a resource management method and system for large-scale distributed deep learning. Background technique [0002] Deep learning (deep neural network) has achieved breakthroughs in fields such as computer vision, language modeling, and speech recognition. Compared with the traditional artificial neural network, the characteristics of the deep neural network are: it has more hidden layers and neurons, the calculation volume of the training and reasoning phase is huge, and a large amount of intermediate data will be generated at the same time. These characteristics make most deep learning applications have a large demand for storage space and computing resources in the training process, and the current high-performance acceleration hardware (such as GPU, etc.) cannot meet their needs well. So there has been a lot of optimization work on memory management ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06N3/08
CPCG06F9/5016G06F9/5027G06N3/084
Inventor 王芳冯丹赵少锋刘博
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products