A multi-agent reinforcement learning method and device based on skill discovery and distribution
By using a dynamic skill discovery and allocation method, the problem of behavior homogenization caused by parameter sharing in multi-agent reinforcement learning is solved, thereby improving the agents' collaborative ability and task adaptability in complex scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2024-01-09
- Publication Date
- 2026-06-23
AI Technical Summary
In multi-agent reinforcement learning, parameter sharing leads to homogenization of behaviors among agents, hindering adaptability in complex coordination scenarios.
We adopt a skill discovery and allocation-based approach to learn dynamic skills for multiple agents in an unsupervised manner. We use gated recurrent units and two fully connected neural networks to transform observation information and dynamically allocate skills. We combine Gumbel Softmax and hybrid networks to optimize the agents' behavioral strategies and introduce Lipschitz constraints to optimize the latent variables of observation.
It enhances the diversity and adaptability of agent behavior, improves the collaborative ability of multi-agent systems in complex scenarios, and enables them to better achieve task objectives.
Smart Images

Figure CN117828477B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multi-agent reinforcement learning, and more specifically to a multi-agent reinforcement learning method and apparatus based on skill discovery and allocation. Background Technology
[0002] With the advancement of machine learning methods in promoting agent-based collaborative behavior, Multi-Agent Reinforcement Learning (MARL) has become a key technology for solving collaborative decision-making and collaborative tasks. In multi-agent systems, collaborative decision-making and task allocation are complex challenges. The rise of MARL addresses the complexities arising from the interactions between agents in practical applications, such as collaborative control and resource allocation. By learning together and optimizing rewards, multi-agents can achieve more efficient collaborative behavior, applicable to fields such as autonomous driving, collaborative robots, and distributed systems. Therefore, as a natural extension of reinforcement learning, MARL provides a powerful tool for solving practical collaborative tasks and decision-making problems.
[0003] Specifically, in single-agent reinforcement learning, each agent optimizes different task rewards, while in cooperative multi-agent reinforcement learning, agents share a common goal, and each agent needs to learn appropriate behaviors to promote effective cooperation and adaptation. Based on this, to simultaneously address the problems of unstable training processes and difficulty in convergence caused by mutual influence among multiple agents, and the excessive computational overhead required to train multiple agent networks, many current methods use parameter sharing techniques. During training, a single neural network receives observation information from multiple agents, all agents are trained centrally in a global state, and then different value decomposition methods are used during execution. Each agent executes a distributed strategy based on its local observations. This parameter sharing among distributed executors is considered an effective technique to promote agent cooperation and significantly improve training efficiency. However, parameter sharing brings the problem that while the same network can effectively promote cooperative behavior among agents, it can easily lead to homogeneous behaviors among agents, which may hinder the adaptability of agents in scenarios requiring complex coordination. For example, Figure 1 The scenario shown is a complex one in the Google Research Football (GRF) environment. Two players from the same team are competing to chase a football. Due to the problem of shared parameters, players from the same team may choose the same action when faced with similar observations. In other words, multiple agents may unintentionally pursue the same football. Figure 1 The result shown is that players on the same team choose the same action when faced with similar observations, which hinders the possibility of establishing an effective role distinction between players and positional players, thus reducing the potential of cooperative teams.
[0004] In multi-agent reinforcement learning, parameter sharing has been widely used to significantly improve training efficiency and promote collaborative behavior among agents. However, parameter sharing can easily lead to homogeneous behavior among agents, which may hinder their adaptability in scenarios requiring complex coordination. Summary of the Invention
[0005] This invention provides a multi-agent reinforcement learning method and apparatus based on skill discovery and allocation, which can solve the problem of homogenization of behavior among agents due to parameter sharing in the prior art, enhance the diversity of agent behavior, and thus better adapt to task scenarios that require complex coordination.
[0006] By teaching meaningful and dynamic complex skills to multiple agents in an unsupervised manner, the entire skill set encompasses diverse behavioral capabilities, enabling agents to flexibly choose skills based on the scenario. Simultaneously, in downstream multi-agent tasks, by exploring the agents' action and state spaces, agents can select appropriate skills based on observations to improve their ability to adapt to complex scenarios (such as reward-sparse environments), thereby promoting collaborative behavior among agents to maximize team rewards and providing support for agent collaboration strategies in real-world scenarios.
[0007] This invention provides a multi-agent reinforcement learning method based on skill discovery and allocation, comprising:
[0008] The intrinsic reward of the agent in the current time period is obtained based on the agent's actions, observation information, skill vector, observation information and actions of the agent in the previous time period.
[0009] The observation information of the agent in the current time period is converted into the observation latent variables of the agent in the current time period, representing the agent's behavior pattern, through a gated recurrent unit and a two-layer fully connected neural network. The skill probability of each skill included in the skill set is obtained according to the parameterized neural network and the observation latent variables of each agent. The skill to be executed by the agent in the next time period is determined according to Gumbel Softmax and the skill probability.
[0010] The total value function of the agent in the current time period is obtained based on the skills to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period.
[0011] The loss function of the agent is obtained based on the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period.
[0012] Preferably, before obtaining the agent's intrinsic reward for the current time period, the method further includes:
[0013] The agent is given different potential abilities and skills. The encoding vector for each skill is obtained by one-hot encoding. The encoding vector is used to obtain a skill vector through an encoding network. A skill set is formed based on multiple different skill vectors.
[0014] The skill vector is obtained using the following formula:
[0015] z j =f e (e j ;θ e )
[0016] The initial weights of the encoding network are updated using the following formula:
[0017]
[0018] Among them, e j Represents the encoded vector, θ e f represents the initial weights. e (·;θ e ) represents the coding network, z i and z j This represents any two distinct skill vectors. This indicates the regularization target.
[0019] Preferably, the intrinsic reward is represented by the following formula:
[0020]
[0021]
[0022] Among them, u t o represents the agent's action in time period t, o represents the agent's observation information in time period t, and o′ represents the action u taken by the agent in time period t+1. t Then, the agent's observation information, φ(o) represents the representation function of the agent's observation information o in time period t, φ(o′) represents the representation function of the agent's observation information o′ in time period t+1, z j Let represent the skill vector for time interval t, ||·|| represent the Euclidean distance, L represent the Lipschitz constant, x and y represent any two states, and φ(x) and φ(y) represent the representation functions of any two states in state space O.
[0023] Preferably, the observed latent variables of the agent in the current time period are determined by the following formula:
[0024]
[0025] The skill probability of each skill included in the skill set is expressed by the following formula:
[0026]
[0027] The skill probability of each skill included in the agent is sampled using the following formula:
[0028]
[0029] in, This represents the embedding encoded by the gated cyclic unit, θ FCN These represent the parameters of a two-layer fully connected network. θ represents the latent variable observed by the agent during the current time period. w The parameters of the neural network, z represents the skill probability of each skill included in the agent, given the observed latent variables of the agent in the current time period. sample This represents a skill obtained from a sample.
[0030] Preferably, the step of obtaining the total value function of the agent in the current time period based on the skill to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period includes:
[0031] Based on the skills that the agent needs to perform in the next time period, the observed latent variables of the agent in the current time period and the skill strategy of the agent in the current time period are used to obtain the evaluation value of the agent in the current time period.
[0032] The evaluation value of each agent in the current time period is used to obtain the total value function of the agents in the current time period through a hybrid network;
[0033] The total value function of the agent in the current time period is expressed by the following formula:
[0034]
[0035] Among them, Q total The function represents the total value of the agent in the previous time period, and MixNet represents a hybrid network. This represents the evaluation value of the first agent. Let θ represent the evaluation value of the nth agent. mix This represents the parameters of the hybrid network.
[0036] Preferably, the loss function of the agent is expressed by the following formula:
[0037]
[0038] in, Let β represent the loss function. d γ represents the weighting coefficient of skill reward, γ represents the discount factor for future rewards, o represents the agent's observation information in time period t, and o′ represents the action u taken by the agent in time period t+1. t The agent's observation information is then used, where z represents the agent's skill vector for the current time period, and z′ represents the agent's skill vector for the next time period. This represents the total evaluation value of all agents on the training network. It represents the total evaluation value of all agents on the target network.
[0039] This invention provides a multi-agent reinforcement learning device based on skill discovery and allocation, comprising:
[0040] The skill discovery module is used to obtain the intrinsic reward of the agent in the current time period based on the agent's actions in the current time period, the agent's observation information, the agent's skill vector, the agent's observation information in the previous time period, and the agent's actions in the previous time period.
[0041] The skill allocation module is used to convert the observation information of the agent in the current time period into the observation latent variables of the agent in the current time period, representing the agent's behavior pattern, through a gated recurrent unit and a two-layer fully connected neural network. Based on the parameterized neural network and the observation latent variables of each agent, the module obtains the skill probability of each skill included in the skill set. It then determines the skill to be executed by the agent in the next time period based on Gumbel Soft max and the skill probability. Finally, it obtains the total value function of the agent in the current time period based on the skill to be executed in the next time period, the observation latent variables of the agent in the current time period, and the skill policy of the agent in the current time period.
[0042] The skill learning module is used to obtain the agent's loss function based on the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period.
[0043] Preferably, it further includes a skill coding module, the skill coding module being used for:
[0044] The agent is given different potential abilities and skills. The encoding vector for each skill is obtained by one-hot encoding. The encoding vector is used to obtain a skill vector through an encoding network. A skill set is formed based on multiple different skill vectors.
[0045] The skill vector is obtained using the following formula:
[0046] z j =f e (ej ;θ e )
[0047] The initial weights of the encoding network are updated using the following formula:
[0048]
[0049] Among them, e j Represents the encoded vector, θ e f represents the initial weights. e (·;θ e ) represents the coding network, z i and z j This represents any two distinct skill vectors. This indicates the regularization target.
[0050] This invention provides a computer device, which includes a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor performs any of the above-described multi-agent reinforcement learning methods based on skill discovery and allocation.
[0051] This invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform any of the above-described multi-agent reinforcement learning methods based on skill discovery and allocation.
[0052] In summary, embodiments of the present invention provide a multi-agent reinforcement learning method and apparatus based on skill discovery and allocation. The method includes: obtaining the intrinsic reward of the agent in the current time period based on the agent's actions, the agent's observation information, the agent's skill vector, the agent's observation information and actions in the previous time period; converting the agent's observation information into latent variables representing the agent's behavioral patterns using a gated recurrent unit and a two-layer fully connected neural network; obtaining the skill probability of each skill included in the skill set based on the parameterized neural network and the latent variables of each agent; determining the skill to be executed by the agent in the next time period based on GumbelSoft max and the skill probability; obtaining the total value function of the agent in the current time period based on the skill to be executed by the agent in the next time period, the latent variables of the agent in the current time period, and the skill policy of the agent in the current time period; and obtaining the loss function of the agent based on the intrinsic reward of the agent in the current time period, the total value function of the agent in the current time period, and the total value function of the agent in the next time period. This method optimizes latent observation variables using Lipschitz constraint techniques. To enable agents to quickly adapt to constantly changing observations and select appropriate skills in real time, suitable skills are assigned to each agent based on current partial latent observation variables, improving the learning and collaborative capabilities of multi-agent reinforcement learning models for agent behavioral diversity. By considering the similarity of agent observations across time periods to adjust the skill switching frequency, agents can dynamically assign appropriate skill combinations based on their local observations, enabling collaborative task completion of complex objectives. After dynamically assigning skills to agents, skill latent variables obtained through skill discovery guide the optimization process of each skill strategy, allowing agents to learn diverse capabilities adaptable to various scenarios and collaborate effectively to maximize team rewards. This method addresses the problem of homogenized agent behavior caused by parameter sharing in existing technologies, enhancing agent behavioral diversity and better adapting to complex coordination-required task scenarios. Attached Figure Description
[0053] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0054] Figure 1 A schematic diagram of a multi-agent reinforcement learning method based on skill discovery and allocation provided in an embodiment of the present invention;
[0055] Figure 2 A schematic diagram of homogeneous behavior in GRF is provided for embodiments of the present invention;
[0056] Figure 3 This invention provides a performance comparison diagram of different multi-agent reinforcement learning methods in the StarCraft II environment for embodiments of the present invention;
[0057] Figure 4 This invention provides a performance comparison diagram of different multi-agent reinforcement learning methods in a GRF environment for embodiments of the present invention;
[0058] Figure 5 This is a schematic diagram of the results of a multi-agent reinforcement learning device based on skill discovery and allocation provided in an embodiment of the present invention. Detailed Implementation
[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0060] In the embodiments of the present invention, the technical terms involved are as follows:
[0061] 1. One-Hot encoding, also known as one-bit valid encoding, mainly uses an N-bit status register to encode N states. Each state has its own independent register bit, and only one bit is valid at any given time.
[0062] 2. Skill: Refers to a specific sequence of actions or behaviors that an intelligent agent can perform. These skills are the basic units that an intelligent agent uses to interact with its environment and complete specific tasks based on actions. Examples of skills include finding a path in a game, avoiding obstacles, or making buying and selling decisions in a trading system.
[0063] 3. Skill Set: This refers to the set of all skills an agent is capable of performing. In different multi-agent systems, each agent may have a different skill set, which can be adjusted and optimized according to task requirements and environmental conditions. The concept of skill set helps in designing flexible and adaptable agents, enabling them to demonstrate efficient learning and execution capabilities in different scenarios and tasks.
[0064] 4. Encoded Vectors: In machine learning, encoded vectors typically refer to a low- or high-dimensional representation of input data. This vectorized representation helps machine learning models process and learn data more effectively. In multi-agent reinforcement learning, encoded vectors may be used to represent the agent's state, observations, or decisions.
[0065] 5. Skill Vectors: Here, skill vectors are specific encoded vectors that represent one or more specific skills that an agent can perform. These vectors are typically used to guide the agent's behavioral choices and learning process. For example, in a soccer robot's learning model, different skill vectors might represent skills such as kicking, passing, or defending.
[0066] In multi-agent reinforcement learning, parameter sharing has been widely used to significantly improve training efficiency and promote collaborative behavior among agents. However, parameter sharing can easily lead to homogeneous behavior among agents, which may hinder their adaptability in scenarios requiring complex coordination. To enhance the behavioral diversity among agents, a novel dynamic skill discovery and skill allocation method is proposed to achieve more effective adaptation and collaboration in complex tasks. By learning diverse skills without external rewards and then dynamically allocating these skills to agents, the method provided in this invention can promote collaborative behavior among agents and improve their ability to adapt to complex scenarios.
[0067] Figure 1 This invention provides a schematic flowchart of a multi-agent reinforcement learning method based on skill discovery and allocation, as shown in the embodiments of the present invention. Figure 1 As shown, the multi-agent reinforcement learning method based on skill discovery and allocation provided in this embodiment of the invention mainly includes the following steps:
[0068] Step 101: Obtain the intrinsic reward of the agent in the current time period based on the agent's actions, observation information, skill vector, observation information and actions of the agent in the previous time period.
[0069] Step 102: Convert the observed latent variables of the agent in the current time period to represent the agent's behavior pattern; obtain the skill probability of each skill included in the skill set based on the parameterized neural network and the observed latent variables of each agent; determine the skill to be executed by the agent in the next time period based on Gumbel Soft max and the skill probability.
[0070] Step 103: Based on the skills to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period, obtain the total value function of the agent in the current time period.
[0071] Step 104: Obtain the loss function of the agent based on the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period.
[0072] It should be noted that the execution subject of the multi-agent reinforcement learning method based on skill discovery and allocation provided in this embodiment of the invention is the terminal.
[0073] Before step 101, since skills are not defined in the environment, it is necessary to first pre-define skills with different potential abilities for the agent, obtain the encoding vector corresponding to each skill through one-hot encoding, obtain the skill vector through the encoding vector through the encoding network, and form a skill set based on multiple different skill vectors.
[0074] The specific method involves converting each potential ability of a skill into a skill vector, such that only one element in the skill vector is 1, and all other elements are 0. Each skill has a unique skill vector, and the position of the 1 corresponds to the index of that skill in the skill set. The skill set is represented by the following formula:
[0075] Z = {s j |j=1,2,...,k} (1)
[0076] Where Z represents the skill set, s j Indicates skill.
[0077] In practical applications, for the skill set shown in formula (1), each skill s j This corresponds to a skill vector of length k. In this skill set, skill s... j Let j be the j-th skill. The j-th position of the vector is 1, and the other positions are 0. The main advantage of this encoding is that it can clearly distinguish each skill because the encoding of each skill is unique.
[0078] Furthermore, in order to enhance the semantic capacity of the skill vectors rather than simply representing each skill, this embodiment of the invention employs a two-layer fully connected neural network to map the variable vectors obtained through one-hot encoding to the skill vectors, as shown in formula (2):
[0079]
[0080] Among them, e j Represents the encoded vector, θ e This represents the weights initialized randomly. As a coding network representing skills, z here j Represents a skill vector.
[0081] In this embodiment of the invention, the function of the encoding network is to encode the low-dimensional one-hot encoded vector e. j Transform into a higher-dimensional skill vector z j This transformation makes the representation of skills richer and more distinguishable. The encoding network uses the function f...e To implement this mapping, the function takes each k-dimensional one-hot encoded vector as input and outputs the corresponding m-dimensional skill vector.
[0082] In this embodiment of the invention, in order to ensure the generated skill vector z j To ensure sufficient differentiation between them, the system introduces a regularization term, which is based on... Distance is used to quantify and maximize the skill vector z j The regularization method encourages the skill vectors generated by the encoding network to be as far apart as possible from each other in the multidimensional space, thereby promoting the uniqueness and distinguishability of the skill vectors. The specific regularization term is shown in Equation (3):
[0083]
[0084] Among them, z i and z j Represent any two skill vectors, according to the regularization target The skill coding network f in formula (2) e (·;θ e The parameter θ e Updated through optimization process. This indicates the regularization target.
[0085] In practical applications, parameter θ e The update process typically involves gradient descent or other optimization algorithms to adjust θ based on the objective of maximizing the differences between skill vectors. e The result of parameter updates is an improvement in the encoding network, enabling it to generate skill vectors with higher discriminative power, thereby enhancing the overall system performance.
[0086] In multi-agent tasks, it is crucial that different agents possess a wide variety of skills. For example, in a game like soccer, agents should demonstrate a range of skills such as passing, dribbling, and defense. However, manually defining these skills and designing rewards to guide their acquisition can be impractical. Therefore, embodiments of this invention propose unsupervised skill discovery, enabling agents to autonomously discover these skills. Its basic principle is to explore the environment in an unsupervised manner and use intrinsic rewards to encourage the generation of meaningful skills.
[0087] In step 101, the intrinsic reward of the agent in the current time period is obtained based on the agent's actions in the current time period, the agent's observation information, the agent's skill vector, the agent's observation information in the previous time period, and the agent's actions in the previous time period.
[0088] Specifically, the intrinsic reward is formed by the inner product between the skill vector and the generated trajectory representation. This allows for the assessment of the alignment between changes in the agent's state representation and the skill vector. Here, the trajectory representation mainly refers to the trajectory over a specific time period, such as the trajectory representation between time period t and time period t+1. In this embodiment of the invention, the intrinsic reward can be represented by the following formula (4):
[0089]
[0090] Among them, u t o represents the agent's action in time period t, o represents the agent's observation information in time period t, and o′ represents the action u taken by the agent in time period t+1. t Then, the agent's observation information, φ(o) represents the representation function of the agent's observation information o in time period t, φ(o′) represents the representation function of the agent's observation information o′ in time period t+1, z j Represents a skill vector.
[0091] In this embodiment of the invention, o represents the observation information of the agent in time period t. The observation information here refers to the target of the agent or the actions of other agents related to the target. φ(·) is the representation function of observation o. That is, the observation information of the agent in time period t is mapped into observation latent variables containing historical trajectory information through a layer of GRU (gated recurrent unit) and two layers of fully connected neural network.
[0092] In this embodiment of the invention, by maximizing intrinsic rewards in the optimization objective, the agent can be encouraged to perform actions that closely align changes in its state representation with its skill vector; that is, intrinsic rewards guide the agent to optimize its strategy during exploration. Based on this, to limit the distance between the state representations learned by the agent to a certain range, this embodiment of the invention introduces a Lipschitz constraint on the state representation function. Specifically, the Lipschitz constraint on the state representation function φ ensures that for all possible observations x and y, the change in the output of function φ will not exceed the proportion of the change in the input, i.e., there exists a constant L such that equation (5) holds:
[0093] ||φ(x)-φ(y)||≤L||xy||, (5)
[0094] Where ||·|| represents the Euclidean distance, L is the Lipschitz constant, and x and y are any two states. This constraint limits the rate of change of the state representation function φ, preventing it from reacting too strongly to small changes in the input state. This constraint can be implemented using various regularization techniques; in this embodiment, it is directly enforced by imposing constraints on the network parameters. Therefore, the final intrinsic reward of this embodiment is expressed by the following formula (6):
[0095]
[0096]
[0097] Among them, u t o represents the agent's action in time period t, o represents the agent's observation information in time period t, and o′ represents the action u taken by the agent in time period t+1. t Then, the agent's observation information, φ(O) represents the representation function of the agent's observation information o in time period t, φ(o′) represents the representation function of the agent's observation information o′ in time period t+1, z j Let ||·|| represent the skill vector over time interval t, ||·|| represent the Euclidean distance, L represent the Lipschitz constant, and x and y represent any two states. Let φ(x) and φ(y) represent any two states x and y in state space O, and let r be the representation function of any two states x and y in state space O. d (o,u t ,o′) represents the agent action u t The intrinsic reward r formed by observing the transition from o to o′ is observed later. d .
[0098] After obtaining diverse skills through the skill encoding and skill discovery modules, the skill allocator can assign different skills to agents based on their local observations. However, relying solely on current observation features during skill allocation may lead the skill selector to assign the same skills to agents with similar observation features. In real-world scenarios, agents with similar observation results often need to exhibit different behaviors. For example, ... Figure 2 As shown, in a football match, when two players are close to the ball, the optimal actions might involve one player dribbling the ball while the other moves without the ball to look for an opportunity. Just as in real life, individuals are often categorized into different groups based on their behavior and performance, similarly, here we view the observed action history of a single agent as a reflection of its behavioral patterns and potential capabilities.
[0099] In step 102, the observation information of the agent in the current time period is converted into the observation latent variables of the agent in the current time period, representing the agent's behavioral pattern, through a gated recurrent unit and a two-layer fully connected neural network. That is, in order to encode the agent's observation latent variables, a shared trajectory encoder composed of a gated recurrent unit (GRU) and two fully connected networks (FCNs) is used, as shown in formula (7):
[0100]
[0101] in, This represents the embedding encoded by the gated cyclic unit, θ FCN These represent the parameters of a two-layer fully connected network. O represents the observed latent variables of the agent during the current time period. 1:t h represents the observation sequence of the agent from the initial time period to the current time period t. t-1 It is the hidden state of the previous time period, θ GRU This represents the parameters of the gated loop unit.
[0102] In this embodiment of the invention, the aforementioned latent variables represent the action observation history of each agent, and the goal is to assign more suitable skills to the agents based on their historical behavior. The skill probability of each skill included in the skill set can be obtained based on the parameterized neural network and the latent variables of each agent, as shown in formula (8):
[0103]
[0104] in, θ represents the skill probability of each skill included in the agent, given the observed latent variables of the agent during a given time period. w The parameters of the neural network, This represents the latent variables observed by the agent during the current time period.
[0105] Furthermore, during the training process, the skill to be executed by the agent in the next time interval is determined based on the Gumbel Soft max and the skill probability. The skill to be executed by the agent in the next time interval is expressed by the following formula:
[0106]
[0107] Among them, z sample This represents a sampled skill, which is representative of the skill level of a given observed latent variable. Under certain conditions, the agent selects a specific skill from the skill set. Gumbel Softmax achieves this sampling by transforming the probability distribution, which allows optimization using gradient descent during training. This sampling method is particularly useful when training neural networks because it allows the probability of discrete choices to be modeled while maintaining the computability of gradients. During testing, a greedy method is used to select the optimal skill, i.e., choosing the skill with the highest probability each time.
[0108] In step 103, based on the skill to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period, the evaluation value of the agent in the current time period is obtained. The evaluation value of each agent in the current time period is used to obtain the total value function of the agent in the current time period through a hybrid network.
[0109] In this embodiment of the invention, the skill policy of the agent in the current time period can be represented as π(u i |o i ,z j ), where u i This represents the action of agent i during the current time period, o i z represents the observation information of the agent in the current time period. j This represents the skill vector selected by the agent during the current time period.
[0110] Furthermore, in multi-agent systems, the interaction and cooperation between agents are key factors. Hybrid networks, by considering these interactions, can better evaluate the performance of the entire system. Therefore, the individual evaluation value of each agent can be increased. A total value function Q is generated by combining the results through a hybrid network (MixNet). total Specifically, as shown in formula (10):
[0111]
[0112] Among them, Q total The function represents the total value of the agent in the previous time period, and MixNet represents a hybrid network. This represents the evaluation value of the first agent. Let θ represent the evaluation value of the nth agent. mix This represents the parameters of the hybrid network.
[0113] The dynamic skill discovery mechanism ensures that the agent continuously explores and learns diverse skills, while the skill allocation controller guarantees the timely allocation of the most suitable skills in different scenarios. The ultimate optimization objective of the skill learning provided in this embodiment of the invention is the loss function.
[0114] In step 104, the agent's loss function is obtained based on the agent's intrinsic reward for the current time period, the agent's total value function for the current time period, and the agent's total value function for the next time period. That is, the expected value is used to measure the difference between the actual reward *r* and the skill reward *r*. d Discounted future Q value With the current Q value The difference between them is expressed by the loss function of the agent through the following formula (11):
[0115]
[0116] in, Let β represent the loss function. d γ represents the weighting coefficient of skill reward, γ represents the discount factor for future rewards, o represents the agent's observation information in time period t, and o′ represents the action u taken by the agent in time period t+1. t The agent's observation information is then used, where z represents the agent's skill vector for the current time period, and z′ represents the agent's skill vector for the next time period. This represents the total evaluation value of all agents on the training network. It represents the total evaluation value of all agents on the target network.
[0117] In summary, this invention provides a multi-agent reinforcement learning method and apparatus based on skill discovery and allocation. This method optimizes latent observation variables using Lipschitz constraint techniques to enable agents to quickly adapt to constantly changing observations and select appropriate skills in real time. It assigns suitable skills to each agent based on current partial latent observation variables, improving the multi-agent reinforcement learning model's ability to learn and collaborate on diverse agent behaviors. By considering the similarity of agent observations across time periods to adjust the skill switching frequency, agents can dynamically allocate appropriate skill combinations based on their local observations, enabling collaborative task completion of complex objectives. After dynamically allocating skills to agents, the latent skill variables obtained through skill discovery guide the optimization process of each skill strategy, allowing agents to learn diverse capabilities adaptable to various scenarios and collaborate effectively to maximize team rewards. This solves the problem of homogenized behavior among agents due to parameter sharing in existing technologies, enhancing the diversity of agent behavior and better adapting to complex coordination-required task scenarios.
[0118] Figure 3 This invention provides a performance comparison diagram of different multi-agent reinforcement learning methods in the StarCraft II environment for embodiments of the present invention; Figure 4 This diagram illustrates the performance comparison of different multi-agent reinforcement learning methods in a GRF environment, as provided in embodiments of the present invention. Figure 3 and Figure 4As shown, the embodiments of the present invention consistently outperform baseline methods on six tasks of SMAC and two task scenarios of GRF. The effectiveness of collaboration is particularly evident in handling the complexity of soccer dynamics, constantly changing field conditions, and executing offensive and defensive strategies in a star-studded environment. These demonstrate that the long-term skills learned by the agent can effectively solve tasks with complex state and action spaces, which is especially crucial in reward-sparse scenarios. Successful application in various maps and scenarios demonstrates its potential applicability in real-world multi-agent systems, such as robotics, autonomous vehicles, and complex simulation environments.
[0119] Based on the same inventive concept, this invention provides a multi-agent reinforcement learning device based on skill discovery and allocation. Since the principle by which this device solves the technical problem is similar to that of a multi-agent reinforcement learning method based on skill discovery and allocation, the implementation of this device can refer to the implementation of the method, and the repeated parts will not be described again.
[0120] like Figure 5 As shown, the device includes a skill discovery module, a skill allocation module, a skill learning module, and a skill encoding module.
[0121] The skill discovery module 501 is used to obtain the intrinsic reward of the agent in the current time period based on the agent's actions in the current time period, the agent's observation information, the agent's skill vector, the agent's observation information in the previous time period, and the agent's actions in the previous time period.
[0122] The skill allocation module 502 is used to convert the observation information of the agent in the current time period into the observation latent variables of the agent in the current time period, representing the agent's behavior pattern, through a gated recurrent unit and a two-layer fully connected neural network; obtain the skill probability of each skill included in the skill set based on the parameterized neural network and the observation latent variables of each agent; determine the skill to be executed by the agent in the next time period based on Gumbel Soft max and the skill probability; and obtain the total value function of the agent in the current time period based on the skill to be executed by the agent in the next time period, the observation latent variables of the agent in the current time period, and the skill policy of the agent in the current time period.
[0123] The skill learning module 503 is used to obtain the loss function of the agent based on the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period.
[0124] Preferably, it further includes a skill coding module 504, the skill coding module 504 being used for:
[0125] The agent is given different potential abilities and skills. The encoding vector for each skill is obtained by one-hot encoding. The encoding vector is used to obtain a skill vector through an encoding network. A skill set is formed based on multiple different skill vectors.
[0126] The skill vector is obtained using the following formula:
[0127] z j =f e (e j ;θ e )
[0128] The initial weights of the encoding network are updated using the following formula:
[0129]
[0130] Among them, e j Represents the encoded vector, θ e f represents the initial weights. e (·;θ e ) represents the coding network, z i and z j This represents any two distinct skill vectors. This indicates the regularization target.
[0131] It should be understood that the units included in the above-described multi-agent reinforcement learning device based on skill discovery and allocation are merely a logical division based on the functions implemented by the device. In practical applications, the above units can be superimposed or split. Furthermore, the functions implemented by the multi-agent reinforcement learning device based on skill discovery and allocation provided in this embodiment correspond one-to-one with the multi-agent reinforcement learning method based on skill discovery and allocation provided in the above embodiment. The more detailed processing flow implemented by this device has been described in detail in the first embodiment of the method above, and will not be described in detail here.
[0132] Another embodiment of the present invention provides a computer device, the computer device including: a processor and a memory; the memory is used to store computer program code, the computer program code including computer instructions; when the processor executes the computer instructions, the electronic device executes each step of the multi-agent reinforcement learning method based on skill discovery and allocation in the method flow shown in the above method embodiment.
[0133] Another embodiment of the present invention provides a computer-readable storage medium storing computer instructions that, when executed on a computer device, cause the computer device to perform each step of the multi-agent reinforcement learning method based on skill discovery and allocation in the method flow shown in the above method embodiment.
[0134] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0135] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A multi-agent reinforcement learning method based on skill discovery and allocation, characterized in that, include: Based on the agent's actions, observation information, and skill vectors in the current time period. The agent's observation information and actions in the previous time period are used to obtain the agent's intrinsic reward in the current time period; The observation information of the agent in the current time period is converted into the observation latent variables of the agent in the current time period, representing the agent's behavior pattern, through a gated recurrent unit and a two-layer fully connected neural network. The skill probability of each skill included in the skill set is obtained according to the parameterized neural network and the observation latent variables of each agent. The skill to be executed by the agent in the next time period is determined according to Gumbel Soft max and the skill probability. The total value function of the agent in the current time period is obtained based on the skills to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period. The loss function of the agent is obtained by considering the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period. The loss function of the agent is expressed by the following formula: in, Represents the loss function. This represents the weighting coefficient of skill rewards. This represents a discount factor for future rewards. Indicates the time period The agent's observation information, Indicates the time period The intelligent agent takes action The agent's subsequent observation information, This represents the agent's skill vector for the current time period. This represents the agent's skill vector for the next time period. This represents the total evaluation value of all agents on the training network. This represents the total evaluation value of all agents on the target network. Indicates external rewards. It represents intrinsic reward.
2. The method as described in claim 1, characterized in that, Before obtaining the agent's intrinsic reward for the current time period, the process also includes: The agent is given different potential abilities and skills. The encoding vector for each skill is obtained by one-hot encoding. The encoding vector is used to obtain a skill vector through an encoding network. A skill set is formed based on multiple different skill vectors. The skill vector is obtained using the following formula: The initial weights of the encoding network are updated using the following formula: in, Represents the encoded vector. Indicates the initial weights. Represents the coding network, and This represents any two distinct skill vectors. This indicates the regularization target.
3. The method as described in claim 1, characterized in that, The intrinsic reward is expressed by the following formula: in, Indicates intrinsic reward. Indicates the time period The actions of the intelligent agent, Indicates the time period The agent's observation information, Indicates the time period The intelligent agent takes action The agent's subsequent observation information, Indicates time period Observation information of intelligent agents The representation function, Indicates time period Observation information of intelligent agents The representation function, Indicates time period Skill vectors, Represents Euclidean distance. This represents the Lipschitz constant. and Represents any two states, and Representing the state space The representation function of any two states in the array.
4. The method as described in claim 1, characterized in that, The observed latent variables of the agent in the current time period are determined using the following formula: The skill probability of each skill included in the skill set is expressed by the following formula: The skill probability of each skill included in the agent is sampled using the following formula: in, This indicates an embedding encoded by a gated cyclic unit. These represent the parameters of a two-layer fully connected network. This represents the observed latent variables of the agent during the current time period. The parameters of the neural network, This represents the skill probability of each skill included in the agent, given the observed latent variables of the agent in the current time period. This represents a skill obtained from a sample.
5. The method as described in claim 1, characterized in that, The process of obtaining the total value function of the agent in the current time period based on the skill to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period includes: Based on the skills that the agent needs to perform in the next time period, the observed latent variables of the agent in the current time period and the skill strategy of the agent in the current time period are used to obtain the evaluation value of the agent in the current time period. The evaluation value of each agent in the current time period is used to obtain the total value function of the agents in the current time period through a hybrid network; The total value function of the agent in the current time period is expressed by the following formula: in, This represents the total value function of the agent in the previous time period. Indicates a hybrid network. This represents the evaluation value of the first agent. This represents the evaluation value of the nth agent. This represents the parameters of the hybrid network.
6. A multi-agent reinforcement learning device based on skill discovery and allocation, characterized in that, include: The skill discovery module is used to obtain the intrinsic reward of the agent in the current time period based on the agent's actions, the agent's observation information, the agent's skill vector, the agent's observation information and actions in the previous time period. The skill allocation module is used to convert the observation information of the agent in the current time period into the observation latent variables of the agent in the current time period, representing the agent's behavior pattern, through a gated recurrent unit and a two-layer fully connected neural network. Based on the parameterized neural network and the observation latent variables of each agent, the skill probability of each skill included in the skill set is obtained. Based on Gumbel Soft max and the skill probability, the skill to be executed by the agent in the next time period is determined. The total value function of the agent in the current time period is obtained based on the skills to be executed by the agent in the next time period, the observed latent variables of the agent in the current time period, and the skill policy of the agent in the current time period. The skill learning module is used to obtain the agent's loss function based on the agent's intrinsic reward in the current time period, the agent's total value function in the current time period, and the agent's total value function in the next time period. The loss function of the agent is expressed by the following formula: in, Represents the loss function. This represents the weighting coefficient of skill rewards. This represents a discount factor for future rewards. Indicates the time period The agent's observation information, Indicates the time period The intelligent agent takes action The agent's subsequent observation information, This represents the agent's skill vector for the current time period. This represents the agent's skill vector for the next time period. This represents the total evaluation value of all agents on the training network. This represents the total evaluation value of all agents on the target network. Indicates external rewards. It represents intrinsic reward.
7. The apparatus as claimed in claim 6, characterized in that, It also includes a skill coding module, which is used for: The agent is given different potential abilities and skills. The encoding vector for each skill is obtained by one-hot encoding. The encoding vector is used to obtain a skill vector through an encoding network. A skill set is formed based on multiple different skill vectors. The skill vector is obtained using the following formula: The initial weights of the encoding network are updated using the following formula: in, Represents the encoded vector. Indicates the initial weights. Represents the coding network, and This represents any two distinct skill vectors. This indicates the regularization target.
8. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the multi-agent reinforcement learning method based on skill discovery and allocation as described in any one of claims 1-5.
9. A computer-readable storage medium, characterized in that, The system stores a computer program that, when executed by a processor, causes the processor to perform the multi-agent reinforcement learning method based on skill discovery and allocation as described in any one of claims 1-5.