Action set output method and system based on multi-agent reinforcement learning

A technology of reinforcement learning and collective output, applied in instruments, character and pattern recognition, computer components, etc., to achieve good scalability

Pending Publication Date: 2020-10-30
赵佳
View PDF0 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] The technical problem to be solved by the present invention is to overcome the defect that it is difficult to accurately and efficiently output action sets in a large-s

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Action set output method and system based on multi-agent reinforcement learning
  • Action set output method and system based on multi-agent reinforcement learning
  • Action set output method and system based on multi-agent reinforcement learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] This embodiment provides an action set output method based on multi-agent reinforcement learning. The method can deal with the action set output problem of a large-scale action space through the mutual cooperation of multi-agents in a tree structure. Specifically, it can be expanded. The problem of outputting a set of thousands of actions in an action space of tens of millions of levels.

[0041] Such as figure 1 As shown, the described action set output method based on multi-agent reinforcement learning comprises the following steps:

[0042] Step 101, building a tree-structured model architecture;

[0043] Wherein, in this embodiment, the model architecture of TDM (Tree-based Deep Model, based on the depth model of the tree) is specifically constructed, and a 4-layer 12-fork tree is specifically constructed, and the TPGR (Tree-based Policy Gradient Recommendation System) is used. The method for constructing a balanced clustering tree, the clustering method includes ...

Embodiment 2

[0070] This embodiment provides an action set output system based on multi-agent reinforcement learning, such as figure 2 As shown, the system includes: model building module 21, agent modeling module 22, reinforcement learning training module 23 and decision-making module 24;

[0071] Wherein, the action set output system based on multi-agent reinforcement learning in this embodiment corresponds to the action set output method based on multi-agent reinforcement learning in Embodiment 1, so the model construction module 21, the agent modeling module 22, the reinforcement The learning and training module 23 and the decision-making module 24 can respectively execute step 101 , step 102 , step 103 and step 104 in Embodiment 1.

[0072] Specifically, the model building module 21 is used to build a tree-structured model architecture;

[0073] Wherein, in this embodiment, the model architecture of TDM (Tree-based Deep Model, based on the depth model of the tree) is specifically co...

Embodiment 3

[0100] The present invention also provides an electronic device, such as image 3 As shown, the electronic device may include a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the multi-agent-based reinforcement learning in the foregoing embodiment 1 is implemented. The steps of the action set output method.

[0101] Understandably, image 3 The electronic device shown is just an example and should not limit the functions and scope of use of the embodiments of the present invention.

[0102] Such as image 3 As shown, the electronic device 2 may be in the form of a general-purpose computing device, for example, it may be a server device. Components of the electronic device 2 may include, but are not limited to: at least one processor 3 , at least one memory 4 , and a bus 5 connecting different system components (including the memory 4 and the processor 3 ).

[0103] The bus 5 ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an action set output method and system based on multi-agent reinforcement learning. The method comprises the following steps: S1, constructing a model architecture of a tree structure; s2, modeling each child node in the tree structure constructed in the step S1 as an intelligent agent, and modeling a multi-intelligent-agent reinforcement learning system through a hierarchical extended Markov game; s3, enabling all agents to interact with the environment, and carrying out reinforcement learning training to form an action set output model; and S4, scoring each action inthe action space to be processed by utilizing the multi-agent reinforcement learning action set output model, and generating a target action set for recommendation. According to the method, a multi-agent reinforcement learning method is used for processing an action set decision problem of a large-scale action space, so that good expandability and more accurate and faster training and reasoning speed can be obtained; according to the invention, the MCTS algorithm is used to increase the amount of information for decision making of the upper-layer agent, effective search can be carried out, anda more accurate decision can be obtained.

Description

technical field [0001] The invention relates to multi-agent reinforcement learning technology, in particular to an action set output method and system based on multi-agent reinforcement learning, electronic equipment and a storage medium. Background technique [0002] In reinforcement learning, the problem is usually modeled as a Markov decision process MDP<S,A,R,P,γ> in which the agent interacts with the environment, where S is the state space, A is the action space, and R is the reward function , P:S×A→S is the probability transition operator, γ is the discount factor, and t is the time step. The strategy of the agent is π:S→A, and the agent accepts the state s of the environment feedback t , to obtain the observation state o t , by observing the state o t make an action a t , applied to the environment, and the environment receives the agent’s action a t After that, the state s of the next moment will be fed back to the agent t+1 and reward r t+1 . The agent...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/62
CPCG06F18/214G06F18/295
Inventor 赵佳
Owner 赵佳
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products