High-performance cluster resource fair allocation method based on multi-agent reinforcement learning

A reinforcement learning and multi-agent technology, which is applied in the field of resource scheduling of high-performance clusters, can solve the problems that cluster resources cannot be executed immediately, and achieve the effect of flexible and fast adjustment process, reducing time cost and improving generalization ability

Pending Publication Date: 2022-06-17
BEIHANG UNIV
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Every user hopes that the jobs submitted by them can be run efficiently, but the limited resources of the cluster are doomed to make it impossible for every user's jobs to be executed immediately
Therefore, the problem can be modeled as a multi-agent scheduling problem: each user hopes to optimize the waiting time of his own job, but the amount of resources is rated, so a resource competition relationship is formed among users

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-performance cluster resource fair allocation method based on multi-agent reinforcement learning
  • High-performance cluster resource fair allocation method based on multi-agent reinforcement learning
  • High-performance cluster resource fair allocation method based on multi-agent reinforcement learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0024] like figure 1 As shown, a method for fair allocation of high-performance cluster resources based on multi-agent reinforcement learning provided by an embodiment of the present invention includes the following steps:

[0025] Step S1: establishing a Markov game model for high-performance cluster resource scheduling, including: defining job characteristic state, cluster resource usage state, single user state, and environment state of a single agent;

[0026] Step S2: collect real cluster data, use the simulation environment to perform job playback, and build a high-performance cluster simulation environment;

[0027] Step S3: train the strategy and state value evaluation network in a high-performance cluster simulation environment; wherein, the strategy and state value evaluation network includes: an action strategy neural network NN actor Sum value evaluation neural network NN critic , and respectively construct the corresponding loss function for parameter update.

...

Embodiment 2

[0095] like Image 6 As shown, the embodiment of the present invention provides a high-performance cluster resource fair allocation system based on multi-agent reinforcement learning, including the following modules:

[0096] Establishing a Markov game model module 41 for: establishing a Markov game model for high-performance cluster resource scheduling, including: defining job characteristic state, cluster resource usage state, single user state, and environment state of a single agent;

[0097] Building a high-performance cluster simulation environment module 42 for collecting real cluster data, using the simulation environment for job playback, and building a high-performance cluster simulation environment;

[0098] The training strategy and state value evaluation network module 43 is used to train the strategy and state value evaluation network in a high-performance cluster simulation environment; wherein, the strategy and state value evaluation network includes: an action...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a high-performance cluster resource fair allocation method and system based on multi-agent reinforcement learning. The method comprises the following steps: S1, establishing a Markov game model of high-performance cluster resource scheduling; s2, collecting real cluster data, performing job playback by using a simulation environment, and constructing a high-performance cluster simulation environment; s3, training the strategy and state value evaluation network in a high-performance cluster simulation environment; wherein the strategy and state value evaluation network comprises an action strategy neural network NNactor and a value evaluation neural network NNcritic, and corresponding loss functions are respectively constructed for parameter updating. According to the method provided by the invention, the resource use fairness among the users can be maintained under the condition that the cluster resource utilization rate is not influenced.

Description

technical field [0001] The invention relates to the field of resource scheduling of high-performance clusters, in particular to a method and system for fair resource allocation of high-performance clusters based on multi-agent reinforcement learning. Background technique [0002] In recent years, deep learning has made tremendous progress and development in a large number of different fields, such as computer vision, image recognition, natural language processing, recommendation algorithms. In order to improve the accuracy of training results, the scale of the models that people use continues to increase, and the amount of training data continues to expand. For example, for the BERT model proposed by Google in 2018, the pre-training process used 16 TPU v3 chips, and it took 3 days to complete the pre-training task. It took 29 hours to train a Resnet-50 model on 8 Tesla P100 GPUs. With the continuous expansion of training calculations, in order to ensure that the training t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F30/27G06F9/48G06F9/50
CPCG06F30/27G06F9/4806G06F9/5005
Inventor 李巍孙元昊李云春
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products