Multi-task reinforcement learning method for realizing parallel task scheduling

A technology of reinforcement learning and task scheduling, applied in the fields of information, distribution and parallel computing, it can solve problems such as the difficulty of accurate modeling and the difficulty of heuristic algorithms to show scheduling performance, and achieve the effect of improving generalization.

Inactive Publication Date: 2019-12-17
BEIJING UNIV OF POSTS & TELECOMM
5 Cites 12 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, the computing platform environment is always dynamic and large-scale, and it is very difficult to ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

A multi-task reinforcement learning method for realizing parallel task scheduling is realized based on an asynchronous advantage actor-critic algorithm, and comprises the following operation steps: (1) setting an algorithm model to better solve the parallel multi-task scheduling problem, including setting a state space, setting an action space and setting an award definition; (2) improving the algorithm network as follows: using a deep neural network to represent a strategy function and a value function, wherein the global network is composed of an input layer, a shared sub-network and an output sub-network; (3) setting a new loss function of the algorithm; and (4) training the algorithm network by utilizing the collected and observed parallel task scheduling data, and applying the algorithm network to parallel task scheduling after algorithm convergence.

Application Domain

Technology Topic

Global networkDeep neural networks +7

Image

  • Multi-task reinforcement learning method for realizing parallel task scheduling
  • Multi-task reinforcement learning method for realizing parallel task scheduling
  • Multi-task reinforcement learning method for realizing parallel task scheduling

Examples

  • Experimental program(1)

Example Embodiment

[0038] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings.
[0039] see figure 1 , introduce a kind of multi-task reinforcement learning method that realizes parallel task scheduling that the present invention proposes, realize based on the Asynchronous Advantage Actor-Critic algorithm of asynchronous advantage performer critic, described method comprises the following operation steps:
[0040] (1) Perform the following setting operations on the Asynchronous Advantage Actor-Critic algorithm model to better solve the parallel multi-task scheduling problem:
[0041] (1.1) set the state space S as a set, that is: S={F task ,L,T,F node}, where,
[0042] f task ={f 1 , f 2 , f 3 ,..., f M} represents the CPU instruction number of a job, where M is a natural number, representing the maximum number of subtasks of a job; f 1 Indicates the first subtask, f 2 Indicates the second subtask, f 3 Indicates the third subtask, f M Indicates the Mth subtask; the job refers to assigning parallel tasks to server nodes with different computing capabilities and resources;
[0043] L={L 1 , L 2 , L 3 ,...,L i ,...,L M} represents the information of M subtasks, L i ={l 1 , l 2 , l 3 ,..., l j..., l N} represents the length and storage location of the data to be processed in the i-th subtask, if the data to be processed is stored in the server node j, the element l j Set to the length of the data to be processed, and set other elements to zero; N is a natural number, indicating the maximum number of server nodes;
[0044] T represents the estimated remaining execution time of tasks to be executed in each sub-thread of all server nodes; F node Indicates the current CPU frequency of all server nodes.
[0045] In our experiments, the maximum number of subtasks M=5, and the number of computing nodes N=10.
[0046] (1.2) Setting the action space: divide the overall task of a job into M sub-decisions, corresponding to M sub-tasks; for each sub-decision, the action space is given by {1, 2, 3, ..., N}, if If the action is i, it means that the subtask will be dispatched to the i-th server node; if the number of subtasks is less than M, the corresponding output action will be discarded directly; the complete action a of a job t Expressed as follows: a t ={a t,1 , a t,2..., a t,i …a t,M}, where a t,i Indicates the number of the server node to which the i-th subtask is assigned at time t;
[0047] (1.3) Set the reward definition: set the reward to minimize the average job execution time, that is, set the reward r at each decision point t for: r t =T base -T job (s t , a t ), where T base is the baseline of job execution time, T job (s t , a t ) is the actual execution time of the corresponding job in the decision at time t; s t Indicates the state of the job scheduling problem at time t, a t for being in state s t The decision-making action taken in the situation; in our experiment T base =9;
[0048] (2) see figure 2 , the Asynchronous Advantage Actor-Critic algorithm network is improved as follows:
[0049] (2.1) Use the deep neural network to represent the policy function and the value function, that is, use the performer Actor network to represent the policy function, and use the critic Critic network to represent the value function; set up multiple performer Actor networks to be responsible for the subtasks Scheduling separately, therefore, in the neural network, include M softmax output branch sub-networks for policy π i (a t,i |s t; θ i ) and a linear output branch subnetwork for the value function V(s t , θ v ), π i (a t,i |s t; θ i ) represents the strategy corresponding to the i-th subtask given by the i-th softmax output branch sub-network, a t,i Indicates the action corresponding to the i-th subtask at time t, s t Indicates the state of the job scheduling problem at time t, θ i Indicates the network parameters of the i-th softmax output branch sub-network, θ v Represents the network parameters of the linear output branch sub-network; they share a plurality of non-output layers; each softmax output branch sub-network contains N output nodes, showing the probability distribution of assigning subtasks to server nodes;
[0050] (2.2) The global network is composed of an input layer, a shared sub-network and an output sub-network; wherein the input of the input layer is the state of the job scheduling problem; wherein the shared sub-network is composed of 3 layers of fully connected layers; The output sub-network is composed of the aforementioned M softmax output branch sub-networks and a linear output branch sub-network; the softmax output branch sub-network is composed of a fully connected layer and a softmax output layer; the linear output The branch sub-network consists of a fully connected layer and a linear output layer;
[0051] (3) Set the loss function of the Asynchronous Advantage Actor-Critic algorithm as follows:
[0052]
[0053] where L actor (θ i ) is the loss function of the i-th branch sub-network, the calculation formula is as follows:
[0054] L actor (θ i ) = log π i (a t,i |s t; θ i )(R t-V(s t; θ v ))+βH(π i (s t; θ i ))
[0055] where π i (a t,i |s t; θ i ) outputs action a for the i-th sub-network t,i Probability of , π i (s t; θ i ) outputs the probability distribution of each action for the sub-network, that is, the probability distribution of selecting one of the N nodes to perform the task, and the slice (π i (s t; θ i )) is the entropy of the probability distribution, the parameter β is used to control the strength of the entropy regularization term, and the slice (π i (s t; θ i )) The calculation formula is:
[0056]
[0057] The probability of selecting action j for subnetwork i at time t, that is, the probability of selecting node j to perform subtask i.
[0058] L critic (θ v ) is the loss function of the shared critic network, and the calculation formula is as follows:
[0059] L critic (θ v )=(R t -V(s i; θ v )) 2
[0060] where R t Represents cumulative rewards, the calculation formula is:
[0061]
[0062] The parameter γ∈[0, 1] is a discount factor; in the embodiment, β is set to 0.001, and γ is set to 0.9.
[0063] (4) Using the collected and observed parallel task scheduling data, train the aforementioned Asynchronous AdvantageActor-Critic algorithm network, and use the algorithm network for parallel task scheduling after the algorithm converges.
[0064] In step (2.2), in the shared subnetwork, the first fully connected layer consists of 128 nodes, the second fully connected layer consists of 256 nodes, and the third fully connected layer consists of 128 nodes.
[0065] In step (2.2), the fully connected layer in the softmax output branch sub-network consists of 64 nodes; the fully connected layer in the linear output branch sub-network consists of 64 nodes.
[0066] a set of data{s t , a t , r t ,s t+1} is only used to train the parameters of the critic network and the performer actor network of the subtasks related to this scheduling, instead of all the parameters in the global network. where s t Represents the state of the job scheduling problem at time t, a t for being in state s t The decision-making action taken in the situation, r t For this action a t rewards received, s t+1 Indicates the state of the job scheduling problem at time t+1.
[0067] Set the same goal for all subtasks in a job, that is: arrange the most suitable server allocation scheme for jobs containing multiple subtasks, so that the job execution is the shortest, so there is no specific setting for the output of each actor Actor network The reward, the aforementioned reward definition r t =T base -T job (s t , a t ) parameters used to train the Actor network for all performers.
[0068] The inventor has carried out a large number of experiments on the method proposed by the present invention, and the experiments have proved that in the same network environment, the method of the present invention can effectively schedule network resources, improve network utilization, and can better reduce network congestion and achieve higher network efficiency. throughput.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Unified low sample relation extraction method and device based on multi-choice matching network

PendingCN114528400AReduce computational cost and computational speedImprove generalizationSemantic analysisCharacter and pattern recognitionLanguage modelNetwork architecture
Owner:INST OF SOFTWARE - CHINESE ACAD OF SCI

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products