Deep learning-oriented multi-type GPU cluster resource management scheduling method and system

A GPU cluster and resource management technology, which is applied in the field of multi-type GPU cluster resource management and scheduling, can solve problems such as single job resource allocation mode, increase the difficulty of using cluster users, and difficulty in meeting various user needs, so as to reduce the number and improve Resource utilization, the effect of simplifying management complexity

Active Publication Date: 2019-11-12
HANGZHOU EBOYLAMP ELECTRONICS CO LTD
View PDF6 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, GPU cluster user management currently lacks such personalized requirements, and it is difficult to meet the needs of various users.
[0005] Moreover, the current GPU cluster management system has a relatively simple way of allocating job resources. It can only allocate the number of GPUs or specify a specific GPU. Users need to evaluate resource requirements and understand GPU parameters by themselves, which increases the difficulty of cluster users.
[0006] To sum up, the existing GPU cluster system scheduling function and user management cannot effectively meet the following requirements: different priorities of users, different computing power requirements; compatible with different GPU types, unified management in the same cluster, and easy expansion need

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Deep learning-oriented multi-type GPU cluster resource management scheduling method and system
  • Deep learning-oriented multi-type GPU cluster resource management scheduling method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

[0045] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the description of the application are only for the purpose of describing specific embodiments, and are not intended to limit the application.

[0046] In one embodiment, a multi-type GPU cluster resource management and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a deep learning-oriented multi-type GPU cluster resource management and scheduling method and system. The method comprises the following steps: dividing a GPU cluster into a plurality of GPU groups according to the model of a GPU, counting the idle operational capability of each GPU group, obtaining all users accessing the GPU cluster, and recording the minimum operationalcapability requirement of each user; and periodically accessing the job queue, obtaining the job to be processed with the highest priority in the job queue, and scheduling GPU cluster resources according to the job to be processed. According to the invention, GPUs of different brands and models are uniformly managed as one cluster for deep learning, the number of maintained GPU clusters is reduced, and the GPU cluster management complexity is simplified; the requirements of different users in deep learning can be met; reasonable user attributes are set according to user requirements, users donot need to be familiar with and care about GPU cluster environments, resource scheduling is carried out according to operational capability requirements and priorities of the users, resources meetingthe requirements can be automatically allocated through the scheduling method, and the resource utilization rate of different GPU type groups is increased.

Description

technical field [0001] The present application belongs to the field of high-performance computing, and specifically relates to a deep learning-oriented multi-type GPU cluster resource management and scheduling method and system. Background technique [0002] In many technical fields such as image, speech recognition, natural language processing, and reinforcement learning, deep learning has been proven to be very effective, and has reached or even surpassed human performance on some problems. However, deep learning has a great dependence on computing power, and the resource limitation of a single GPU often cannot meet the processing requirements for large-scale data and models. Multi-GPU parallel computing can effectively reduce the time for deep learning. Deep learning frameworks such as TensorFlow, Caffe, and PyTorch already support multi-GPU parallel computing, but when multi-GPU parallel computing, the best performance can only be achieved when the GPU models are the sa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50
CPCG06F9/5083Y02D10/00
Inventor 丁钢波蔡晓晰杨杰高翔王铜铜韩樑
Owner HANGZHOU EBOYLAMP ELECTRONICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products