Multi-agent communication information prediction method based on neural network acml model
By using the ACML model based on neural networks to optimize the prediction of multi-agent communication information, the problem of limited communication budget is solved, and the efficiency of information utilization and task completion are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV AT QINHUANGDAO
- Filing Date
- 2024-08-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies in multi-agent communication suffer from low information utilization efficiency due to limited communication budgets, which affects policy learning performance.
The ACML model based on neural networks is adopted, including Actor network, Critic network, information generation network, information coordination network and information prediction network. By training these networks, information utilization is optimized without limiting the communication budget. An information prediction network is established to predict communication information and achieve efficient utilization.
Without limiting the communication budget, it improves the completion time of multi-agent tasks and team rewards, and achieves efficient use of information in a small ball environment.
Smart Images

Figure CN119047509B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multi-agent communication, specifically relating to a method for predicting multi-agent communication information under the ACML model based on neural networks. Background Technology
[0002] Inter-agent communication information processing is an important aspect of the field of multi-agent communication, and it is of great significance for solving communication bandwidth limitations and realizing the efficient use of information.
[0003] Through communication, agents can exchange information to reduce the difficulty of finding good policies. For example, agents exploring different parts of the environment can share observations to mitigate partial observability and share their intentions to predict each other's actions to handle non-stationarity. However, in the real world, communication budgets are often limited. For example, in wired packet routing systems, links have limited transmission capacity; in wireless IoT, sensors have limited battery capacity. SchedNet considers the case of shared channels and limited bandwidth, selecting only a subset of agents to convey their information based on its importance, thus reducing the total amount of communication information by limiting the number of agents communicating. Gated-ACML uses gating units between each agent and a centralized information coordination module to reduce the amount of communication information transmitted. These methods mitigate the limitation of communication bandwidth by increasing the importance of the information that can be communicated, but the insufficient amount of information still has an adverse effect on the agent's policy learning. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a method for predicting multi-agent communication information in the ACML model based on neural networks. This method improves the efficiency of multi-agents in utilizing information obtained through communication when communication budgets are limited, and achieves efficient utilization of communication information in a small ball environment.
[0005] A method for predicting multi-agent communication information in the ACML model based on neural networks includes the following steps:
[0006] S1: Establish a spherical environment; the spherical environment includes several intelligent agents, several fixed obstacles, and a target landmark. Each intelligent agent performs a simultaneous arrival task, that is, all intelligent agents arrive at the target landmark at the same time.
[0007] S2: Define the agent's actions, local observations, and joint rewards for the agent team in the ball environment;
[0008] The actions of the intelligent agent include two types of discrete actions, one of which is displacement action a. i= {south, east, north, west, none}, where south, east, north, and west represent accelerations applied to the agent in the directions of south, east, north, and west, respectively, and none represents no acceleration applied; one type is the communication action c. i = {communicate, silent}, where communicate indicates that the agent communicates at this time step, and silent indicates that the agent does not communicate at this time step; local observation of the i-th agent. Where, p = [p i,x ,p i,y ] indicates the position of the agent, p i,x p represents the x-coordinate of the agent's position. i,y The vertical coordinate representing the position of the agent, g = [g x ,g y ] indicates the location of the target landmark, g x g represents the x-coordinate of the target landmark's location. y The vertical coordinate represents the location of the target landmark, and l represents the communication budget. Indicates the location of the obstacle. The x-coordinate represents the location of the obstacle. The vertical coordinate represents the location of obstacles; the number of obstacles is less than the number of agents. The number of elements contained depends on the number of obstacles;
[0009] The joint reward r of the agent team is shown in equations (1) to (4):
[0010] r=-λ·r1-r2 (1)
[0011]
[0012] Where r1 represents the sum of distances between the agent and the target landmark, r2 represents the sum of pairwise distances between all agents and the target landmark, λ represents the weight coefficient and λ∈(0,1), d(p i d(p) represents the distance between the i-th agent and the target landmark, g) represents the distance between the i-th agent and the target landmark. j (g) represents the distance between the j-th agent and the target landmark, p i p represents the position of the i-th agent. j The position of the j-th agent;
[0013] S3: Build the ACML model;
[0014] The ACML model includes several Actor networks, several information generation networks, one Critic network, one information coordination network, and one information prediction network. One agent corresponds to one Actor network and one information generation network.
[0015] The information generation network is used to generate information based on the local observations of an agent. i And based on local observations i Generate the communication information m of this intelligent agent i When the agent's communication action is silent, the communication information... When the agent's communication action is "communicate", the communication information is m. i It is a one-dimensional array containing two elements;
[0016] The working process of the information generation network is shown in equation (5):
[0017] m i =m(o i (5)
[0018] Where m represents the information generation network mapping function, m i Represents the communication information of the i-th agent, o i This represents the local observation of the i-th agent;
[0019] The information prediction network is used to generate communication information m for each agent based on all information generation networks. i To obtain the predicted communication information of each agent.
[0020] The working process of the information prediction network is shown in equation (6):
[0021]
[0022] Where, m hh This represents the information prediction network mapping function. This represents the communication information of the prediction of the i-th agent generated by the information prediction network, where i = 1, 2, ..., N, and N represents the number of agents, m. h m represents the historical communication information of the intelligent agent. -i This represents the communication information of other intelligent agents besides the i-th intelligent agent;
[0023] The information coordination network is used to communicate the predictions generated by the information prediction network for each agent. Generate global information M for each agent. i ;
[0024] The working process of the information coordination network is shown in equation (7):
[0025]
[0026] Where M represents the information coordination network mapping function, M i This represents the global information of the i-th agent generated by the information coordination network, where i = 1, 2, ..., N. This represents the communication information predicted by other intelligent agents besides the i-th intelligent agent;
[0027] The Actor network is used to base its local observations on the i-th agent. i With global information M i Generate the agent's displacement action a i ;
[0028] The Actor network operates as shown in equation (8):
[0029]
[0030] in, Let θ represent the mapping function of the Actor network for the i-th agent. i Represents the parameters of the Actor network for the i-th agent;
[0031] The Critic network is used to generate agent action values Q based on the local observations and displacement actions of all agents. a ;
[0032] The working principle of the Critic network is shown in equation (9):
[0033]
[0034] Where Q represents the mapping function of the Critic network, and ω represents the parameters of the Critic network. It is the set of local observations from other agents that do not include the i-th agent. It is the set of displacement actions of other agents that do not include the i-th agent;
[0035] The networks used in the Actor network, Critic network, information coordination network, information prediction network, and information generation network are all DNN neural networks;
[0036] S4: Without limiting the communication budget, all agents can communicate fully. Train the Actor network, Critic network, information coordination network, and information generation network in the ACML model to obtain the trained Actor network, Critic network, information coordination network, and information generation network.
[0037] S4.1: Initialize the Actor network, Critic network, information coordination network, and information generation network;
[0038] S4.2: Initialize the experience replay buffer D, and set the current iteration count to 0;
[0039] D is an experience replay buffer containing the most recent experience tuple (o, a, r, o′). o′ i Is the i-th intelligent agent in o i The next step of local observation is to follow up with the next step. It is the set of local observations of other agents at the next time step that do not include the i-th agent;
[0040] S4.3: Determine if the current iteration count has reached the maximum iteration count. If yes, execute S4.13; otherwise, execute S4.4.
[0041] S4.4: Initialize the ball's environment;
[0042] S4.5: Determine whether the agent's displacement in the current ball environment has reached the set maximum number of steps. If yes, increment the iteration count by 1 and return to S4.3; otherwise, execute S4.6.
[0043] S4.6: The information generation network corresponding to the i-th agent obtains the local observations of the i-th agent. i Generate this local information m i And transmit it to the information coordination network;
[0044] S4.7: The information coordination network generates local information m for all agents based on the acquired information. i Generate global information M for each agent. i And send it to the Actor network of the corresponding intelligent agent;
[0045] The working process of the information coordination network is shown in equation (10):
[0046] M i =M(m) i ,m -i (10)
[0047] S4.8: The Actor network of the i-th agent obtains the local observations of the i-th agent. i With global information M i Generate the agent's displacement action a i ;
[0048] S4.9: All agents execute the displacement actions generated by their Actor network according to S4.8, and obtain new local observation o′ and joint agent team reward r;
[0049] S4.10: Store the experience (o,a,r,o′) into the experience replay buffer D;
[0050] S4.11: Extract experience from the experience replay buffer D and update the parameters of the Actor network, Critic network, information coordination network, and information generation network;
[0051] The update of the Critic network parameter ω is shown in equations (11)-(15):
[0052]
[0053]
[0054] Where L(ω) represents the loss function for updating the parameters of the Critic network. Let δ represent the expectation function, δ represent the loss function, and γ represent the discount rate. The value of the agent's displacement action output by the Critic network, where a′ represents the displacement action at the next time step after displacement action a, and μ θ (o′) represents the action function of the local observation o′ in the Actor network at the next time step after the local observation o, ω - These represent the parameters of the target network. The target network structure is the same as the Critic network, but the parameters are different. ω represents the value of the displacement action at the next time step after displacement action a. new Let τ represent the parameters of the updated Critic network, τ represent the hyperparameters, and α represent the step size for gradient descent. This represents the gradient with respect to ω. This represents the updated parameters of the target network, a′. i Is the i-th agent in a i The displacement action at the next time step;
[0055] The Actor network parameters θ of the i-th agent i The update is shown in equations (15)-(16):
[0056]
[0057] in, Indicates the relationship with θ i The gradient, J(θ) i ) represents the value function. Indicates expectation and o i Obtain from D. Indicates local observation o i In the action function of the Actor network, Indicates that for a i The gradient, θ new Here are the updated Actor network parameters, and β is the step size for gradient ascent. Let J(θ) denote the gradient with respect to θ, and J(θ) denote the value function.
[0058] Parameters θ of information coordination network and information generation network pc The update is performed using a chain rule, as shown in equations (17)-(18):
[0059]
[0060] in, Indicates the relationship with θ pc The gradient, J(θ) pc ) represents the value function. M represents expectation. i (m1,m2,…,m N ;θ pc ) represents global information. Indicates M i Expectations Here are the updated parameters, and η is the step size of the gradient ascent;
[0061] S4.12: Increment the current time step by 1 and return to S4.5;
[0062] S4.13: Record the parameters of the current Actor network, Critic network, information coordination network, and information generation network to obtain the trained Actor network, Critic network, information coordination network, and information generation network;
[0063] S5: Without limiting the communication budget, use the trained Actor network, Critic network, information coordination network and information generation network to obtain several multi-agent communication records, and then establish an information prediction network dataset with communication budget constraints.
[0064] The specific form of the agent communication record is 2×N columns of data, where any row of data contains the communication information of all agents at the same time step;
[0065] The method for establishing an information prediction network dataset regarding communication budget constraints is as follows: By default, communication occurs at the first time step of each multi-agent communication record. Communication information is retained at a fixed frequency of n records, where n is the communication budget. A random number k is selected, and the time step containing this k-time is the communication time that needs to be predicted for that multi-agent communication record. Communication information from time k onwards in this multi-agent communication record is cleared. The processed communication records are used as the input features of the dataset, and the actual communication information at time k is used as the label. This yields the information prediction network dataset regarding communication budget constraints. The multi-agent communication records in the dataset are denoted as m. h,b b is the number of the multi-agent communication record in the dataset;
[0066] S6: Train the information prediction network in the ACML model using the information prediction network dataset on communication budget constraints, and obtain the parameters of the trained information prediction network;
[0067] S7: Apply communication budget constraints, and conduct inter-agent communication at a set fixed frequency. Load the parameters of the trained information prediction network, actor network, critic network, information coordination network, and information generation network into the ACML model, and execute the ACML model to obtain the agent's running trajectory and team reward R.
[0068] The team reward R is calculated using the method shown in equation (19):
[0069]
[0070] Where Z represents the time step of this trajectory, γ z-1 R represents the discount rate γ raised to the power of z-1. z This represents the joint reward for the team of agents at the z-th time step.
[0071] The beneficial effects of the proposed method for predicting inter-agent communication information in an ACML model based on neural networks are as follows:
[0072] 1. This invention establishes a dataset based on the agent's real communication information without limiting the communication budget, thus ensuring the effectiveness of the information prediction network input data during training.
[0073] 2. This invention achieves a reduction in multi-agent task completion time and a significant increase in team rewards in a ball-shaped environment by predicting the agent's information during communication prohibition moments. Attached Figure Description
[0074] Figure 1 This is a flowchart of the multi-agent communication information prediction method based on the ACML model of neural networks in this embodiment;
[0075] Figure 2 This is a schematic diagram of the environment of the small ball in this embodiment;
[0076] Figure 3 This is a diagram of the ACML model architecture with an added information prediction network in this embodiment;
[0077] Figure 4 This is an ACML model architecture diagram that does not include the information prediction network in this embodiment;
[0078] Figure 5 This is a flowchart of the ACML model training process that does not include the information prediction network in this embodiment;
[0079] Figure 6 This is an iterative curve of the information prediction network loss in this embodiment;
[0080] Figure 7 This is a trajectory diagram of an agent with an information prediction network operating in a small ball environment in this embodiment. Detailed Implementation
[0081] To make the technical solution of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and embodiments.
[0082] A method for predicting multi-agent communication information in the ACML model based on neural networks, such as Figure 1 As shown, it includes the following steps:
[0083] S1: Establish a spherical environment; the spherical environment includes several intelligent agents, several fixed obstacles, and a target landmark. Each intelligent agent performs a simultaneous arrival task, that is, all intelligent agents arrive at the target landmark at the same time.
[0084] In this embodiment, taking two intelligent agents as an example, the spherical environment is established as follows: Figure 2 As shown. The initial positions of the agent and the target landmark are random, obstacles are always located near the line connecting a certain agent to the target landmark, and the communication budget represents the remaining number of communication attempts.
[0085] S2: Define the agent's actions, local observations, and joint rewards for the agent team in the ball environment;
[0086] The actions of the intelligent agent include two types of discrete actions, one of which is displacement action a. i = {south, east, north, west, none}, where south, east, north, and west represent accelerations applied to the agent in the directions of south, east, north, and west, respectively, and none represents no acceleration applied; one type is the communication action c. i= {communicate, silent}, where communicate indicates that the agent communicates at this time step, and silent indicates that the agent does not communicate at this time step; local observation of the i-th agent. Where, p = [p i,x ,p i,y ] indicates the position of the agent, p i,x p represents the x-coordinate of the agent's position. i,y The vertical coordinate representing the position of the agent, g = [g x ,g y ] indicates the location of the target landmark, g x g represents the x-coordinate of the target landmark's location. y The vertical coordinate represents the location of the target landmark, and l represents the communication budget. Indicates the location of the obstacle. The x-coordinate represents the location of the obstacle. The vertical coordinate represents the location of obstacles; the number of obstacles is less than the number of agents. The number of elements contained depends on the number of obstacles.
[0087] In this embodiment, simultaneous arrival at the task requires two agents to arrive at the target landmark simultaneously, and the joint reward r of the agent team is shown in equations (1) to (4):
[0088] r=-λ·r1-r2 (1)
[0089]
[0090]
[0091] Where r1 represents the sum of distances between the agent and the target landmark, r2 represents the sum of pairwise distances between all agents and the target landmark, λ represents the weight coefficient and λ∈(0,1), d(p i d(p) represents the distance between the i-th agent and the target landmark, g) represents the distance between the i-th agent and the target landmark. j (g) represents the distance between the j-th agent and the target landmark, p i p represents the position of the i-th agent. j The team joint reward for the position of the j-th agent is to encourage agents to reach the target landmark simultaneously by using r2 and adjusting λ.
[0092] S3: Build the ACML model;
[0093] like Figure 3As shown, the ACML model includes several Actor networks, several information generation networks, one Critic network, one information coordination network, and one information prediction network. One agent corresponds to one Actor network and one information generation network.
[0094] The information generation network is used to generate information based on the local observations of an agent. i And based on local observations i Generate the communication information m of this intelligent agent i When the agent's communication action is silent, the communication information... When the agent's communication action is "communicate", the communication information is m. i It is a one-dimensional array containing two elements;
[0095] The working process of the information generation network is shown in equation (5):
[0096] m i =m(o i (5)
[0097] Where m represents the information generation network mapping function, m i Represents the communication information of the i-th agent, o i This represents the local observation of the i-th agent;
[0098] The information prediction network is used to generate communication information m for each agent based on all information generation networks. i To obtain the predicted communication information of each agent.
[0099] The working process of the information prediction network is shown in equation (6):
[0100]
[0101] Where, m hh This represents the information prediction network mapping function. This represents the communication information of the prediction of the i-th agent generated by the information prediction network, where i = 1, 2, ..., N, and N represents the number of agents, m. h m represents the historical communication information of the intelligent agent. -i This represents the communication information of other intelligent agents besides the i-th intelligent agent;
[0102] The information coordination network is used to communicate the predictions generated by the information prediction network for each agent. Generate global information M for each agent. i ;
[0103] The working process of the information coordination network is shown in equation (7):
[0104]
[0105] Where M represents the information coordination network mapping function, M i This represents the global information of the i-th agent generated by the information coordination network, where i = 1, 2, ..., N. This represents the communication information predicted by other intelligent agents besides the i-th intelligent agent;
[0106] The Actor network is used to base its local observations on the i-th agent. i With global information M i Generate the agent's displacement action a i ;
[0107] The Actor network operates as shown in equation (8):
[0108]
[0109] in, Let θ represent the mapping function of the Actor network for the i-th agent. i Represents the parameters of the Actor network for the i-th agent;
[0110] The Critic network is used to generate agent action values Q based on the local observations and displacement actions of all agents. a ;
[0111] The working principle of the Critic network is shown in equation (9):
[0112]
[0113] Where Q represents the mapping function of the Critic network, and ω represents the parameters of the Critic network. It is the set of local observations from other agents that do not include the i-th agent. It is the set of displacement actions of other agents that do not include the i-th agent;
[0114] The Actor network, Critic network, information coordination network, information prediction network, and information generation network all employ DNN neural networks.
[0115] The network structure of DNN neural networks all includes an input layer, two hidden layers and an output layer connected in sequence, with ReLU as the activation function;
[0116] S4: Without limiting the communication budget, all agents can communicate fully. Train the Actor network, Critic network, information coordination network, and information generation network in the ACML model to obtain the trained Actor network, Critic network, information coordination network, and information generation network.
[0117] In this embodiment, the ACML model excluding the information prediction network, such as... Figure 4 As shown, the dashed lines represent the flow of communication information, and the specific training process is as follows: Figure 5 As shown:
[0118] S4.1: Initialize the Actor network, Critic network, information coordination network, and information generation network;
[0119] S4.2: Initialize the experience replay buffer D, and set the current iteration count to 0;
[0120] In this embodiment, D is an experience replay buffer containing the most recent experience tuple (o,a,r,o′). o i ′ is the i-th agent in o i The next step of local observation is to follow up with the next step. It is the set of local observations of other agents at the next time step that do not include the i-th agent;
[0121] S4.3: Determine if the current iteration count has reached the maximum iteration count. If yes, execute S4.13; otherwise, execute S4.4.
[0122] S4.4: Initialize the ball's environment;
[0123] In this embodiment, the positions of the agents and the target landmarks are randomly distributed. The positions of obstacles are always located near the line connecting a certain agent to the target landmark, and the number of obstacles is less than the number of agents.
[0124] S4.5: Determine whether the agent's displacement in the current ball environment has reached the set maximum number of steps. If yes, increment the iteration count by 1 and return to S4.3; otherwise, execute S4.6.
[0125] In this embodiment, the maximum number of steps the agent can move is set to 25;
[0126] S4.6: The information generation network corresponding to the i-th agent obtains the local observations of the i-th agent. i Generate this local information m i And transmit it to the information coordination network;
[0127] S4.7: The information coordination network generates local information m for all agents based on the acquired information.i Generate global information M for each agent. i And send it to the Actor network of the corresponding intelligent agent;
[0128] In this embodiment, the information coordination network operates as shown in equation (10):
[0129] M i =M(m) i ,m -i (10)
[0130] S4.8: The Actor network of the i-th agent obtains the local observations of the i-th agent. i With global information M i Generate the agent's displacement action a i The number of Actor networks equals the number of agents;
[0131] S4.9: All agents execute the displacement actions generated by their Actor network according to S4.8, and obtain new local observation o′ and joint agent team reward r;
[0132] S4.10: Store the experience (o,a,r,o′) into the experience replay buffer D;
[0133] S4.11: Extract experience from the experience replay buffer D and update the parameters of the Actor network, Critic network, information coordination network, and information generation network;
[0134] In this embodiment, the update of the Critic network parameter ω is shown in equations (11)-(15):
[0135]
[0136]
[0137] Where L(ω) represents the loss function for updating the parameters of the Critic network. Let δ represent the expectation function, δ represent the loss function, and γ represent the discount rate. The value of the agent's displacement action output by the Critic network, where a′ represents the displacement action at the next time step after displacement action a, and μ θ (o′) represents the action function of the local observation o′ in the Actor network at the next time step after the local observation o, ω - This represents the parameters of the target network. The target network has the same structure as the Critic network, but different parameters. The purpose is to mitigate the bias caused by the bootstrapping of the Critic network. ω represents the value of the displacement action at the next time step after displacement action a. newLet represent the updated parameters of the Critic network, τ represent the hyperparameters, α represent the weighted update parameters of the target network, and α represent the step size for gradient descent. This represents the gradient with respect to ω. This represents the updated parameters of the target network, a. i ′ is the i-th agent in a i The displacement action at the next time step;
[0138] In this embodiment, the Actor network parameter θ of the i-th agent is... i The update is shown in equations (15)-(16):
[0139]
[0140] in, Indicates the relationship with θ i The gradient, J(θ) i ) represents the value function. Indicates expectation and o i Obtain from D. Indicates local observation o i In the action function of the Actor network, Indicates that for a i The gradient, θ new Here are the updated Actor network parameters, and β is the step size for gradient ascent. Let J(θ) denote the gradient with respect to θ, and J(θ) denote the value function.
[0141] In this embodiment, the parameters θ of the information coordination network and the information generation network are... pc The update is performed using a chain rule, as shown in equations (17)-(18):
[0142]
[0143] in, Indicates the relationship with θ pc The gradient, J(θ) pc ) represents the value function. M represents expectation. i (m1,m2,…,m N ;θ pc ) represents global information. Indicates M i Expectations Here are the updated parameters, and η is the step size of the gradient ascent;
[0144] S4.12: Increment the current time step by 1 and return to S4.5;
[0145] S4.13: Record the parameters of the current Actor network, Critic network, information coordination network, and information generation network to obtain the trained Actor network, Critic network, information coordination network, and information generation network;
[0146] S5: Without limiting the communication budget, use the trained Actor network, Critic network, information coordination network and information generation network to obtain several multi-agent communication records, and then establish an information prediction network dataset with communication budget constraints.
[0147] In this embodiment, the specific form of the agent communication record is data in 2×N columns and no more than 25 rows. Any row of data is the communication information of all agents at the same time step. Finally, the trained Actor network, Critic network, information coordination network and information generation network are executed to obtain 10,000 multi-agent communication records.
[0148] In this embodiment, the process of executing the trained Actor network, Critic network, information coordination network, and information generation network is as follows: (1) The information generation network of the i-th agent obtains the local observation of the i-th agent. i Generate communication information m i And transmit to the information coordination network; (2) The information coordination network predicts all communication information m generated by the network based on the acquired information. i Generate global information M i And send it to the corresponding agent (3) The Actor network of the i-th agent to obtain the local observation of the i-th agent. i With global information M i Generate the agent's displacement action a i (4) Repeat steps 1-3 until the maximum number of time steps is reached, to obtain the agent communication information m1, m2, ..., m N The trajectory;
[0149] In this embodiment, the method for processing each multi-agent communication record and establishing an information prediction network dataset regarding communication budget constraints is as follows: The first time step of each multi-agent communication record is assumed to be communication; communication information is retained at a fixed frequency of n records, where n is the communication budget; the time step containing a random number k is chosen as the communication time to be predicted for this record; communication data from time k onwards in this record is cleared, k∈(2,…,25); the processed record is used as the input feature of the dataset; and the actual communication information at time k is used as the label to obtain the information prediction network dataset regarding communication budget constraints. The multi-agent communication records in the dataset are denoted as m. h,b b is the number of the multi-agent communication record in the dataset, b∈(1,…,10000).
[0150] S6: Train the information prediction network in the ACML model using the information prediction network dataset on communication budget constraints, and obtain the parameters of the trained information prediction network;
[0151] In this embodiment, the information prediction network uses a DNN neural network, comprising an input layer, two hidden layers, and an output layer, with ReLU activation function. The Adam optimizer is used, with a learning rate of 0.01, and each input batch consists of 512 data points. The mean squared error loss function is selected as the loss function. After reaching the maximum number of iterations, the information prediction network parameters and the loss function iteration curve are output. The loss function iteration curve is shown below. Figure 6 As shown.
[0152] S7: Impose a communication budget constraint. Communication between multiple agents is carried out at a fixed frequency. Load the parameters of the trained information prediction network, actor network, critic network, information coordination network, and information generation network into the ACML model. Execute the ACML model to obtain the agent's running trajectory and team reward R.
[0153] In this embodiment, the team reward R is calculated as shown in equation (19):
[0154]
[0155] Where Z represents the time step of this trajectory, γ z-1 R represents the discount rate γ raised to the power of z-1. z This represents the joint reward for the agent team at the z-th time step;
[0156] In this embodiment, after adding the information prediction network, the multi-agent operating trajectory is as follows: Figure 7 As shown, when r2 is large, agents closer to the target landmark move around while avoiding obstacles; agents farther from the target landmark always move in a straight line.
[0157] Table 1 compares the performance of ACML models with and without the information prediction module in achieving a multi-agent simultaneous task in a ball environment, under communication constraints and with communication budgets of 5 and 10, respectively. Data includes team rewards and the number of steps required to complete the task. The data represent the average of ten experiments.
[0158] Table 1 Comparison of results between the present invention and the ACML model without the information prediction module.
[0159]
[0160] Table 1 shows that, with communication budgets of 5 and 10, multi-agent teams in the ACML model, with and without the information prediction module, can complete the task simultaneously before the maximum number of steps. After adding the information prediction module, the multi-agent teams perform better in terms of team rewards and the number of steps required to complete the task, and this effect is more pronounced when the communication budget decreases. This indicates that adding the information prediction module allows multi-agent teams to make fuller use of known communication information.
Claims
1. A method for predicting communication information of multiple agents under an ACML model based on a neural network, characterized in that, Includes the following steps: S1: Establish a spherical environment; the spherical environment includes several intelligent agents, several fixed obstacles, and a target landmark. Each intelligent agent performs a simultaneous arrival task, that is, all intelligent agents arrive at the target landmark at the same time. S2: Define the agent's actions, local observations, and joint rewards for the agent team in the ball environment; S3: Build the ACML model; S4: Without limiting the communication budget, all agents can communicate fully. Train the Actor network, Critic network, information coordination network, and information generation network in the ACML model to obtain the trained Actor network, Critic network, information coordination network, and information generation network. S5: Without limiting the communication budget, use the trained Actor network, Critic network, information coordination network and information generation network to obtain several multi-agent communication records, and then establish an information prediction network dataset with communication budget constraints. The specific form of the multi-agent communication record is The data of the column, and any row data is the communication information of all agents at the same time step. The method for establishing an information prediction network dataset regarding communication budget constraints is as follows: by default, communication occurs at the first time step of each multi-agent communication record, and communication information is retained at a fixed frequency. strip, For communication budget; generate random numbers The time step in this multi-agent communication record is the communication time that needs to be predicted. Communication information at and after a given time is cleared. The processed communication records are used as input features for the dataset. Using real-time communication information as labels, we obtain a network dataset with information about communication budget constraints. The multi-agent communication records in the dataset are denoted as... b is the number of the multi-agent communication record in the dataset; S6: Train the information prediction network in the ACML model using the information prediction network dataset on communication budget constraints, and obtain the parameters of the trained information prediction network; S7: Impose communication budget constraints, and conduct inter-agent communication at a set fixed frequency. Load the parameters of the trained information prediction network, actor network, critic network, information coordination network, and information generation network into the ACML model, and execute the ACML model to obtain the agent's running trajectory and team reward.
2. The method of claim 1, wherein the neural network-based ACML model is a neural network-based multi-agent communication information prediction method. The actions of the agent described in S2 include two types of discrete actions, one of which is displacement action. ,in, , , , These represent accelerations applied to the agent in the directions of south, east, north, and west, respectively. This indicates that no acceleration is applied; one is a communication action. , This indicates that the intelligent agent communicates at this time step. This indicates that the agent does not communicate at this time step; Local observation of an agent ,in, This indicates the location of the intelligent agent. The x-coordinate representing the position of the agent. The vertical coordinate represents the position of the agent. Indicates the location of the target landmark. The x-coordinate represents the location of the target landmark. The vertical coordinate representing the location of the target landmark. Indicates the communication budget. Indicates the location of the obstacle. The x-coordinate represents the location of the obstacle. The vertical coordinate represents the location of obstacles; the number of obstacles is less than the number of agents. The number of elements contained depends on the number of obstacles.
3. The method of claim 2, wherein the neural network-based ACML model is a neural network-based ACML model for predicting multi-agent communication information. Joint reward for the agent team described in S2 As shown in equations (1) to (4): (1) (2) (3) (4) in, This represents the sum of distances between the agent and the target landmark. This means summing the pairwise distances between all agents and the target landmark. Represents the weight coefficient and , Indicates the first The distance between an intelligent agent and the target landmark Indicates the first The distance between an intelligent agent and the target landmark Indicates the first The location of each agent. No. The location of each agent.
4. The method of claim 3, wherein the neural network-based ACML model is a neural network-based ACML model for predicting multi-agent communication information. The ACML model described in S3 includes several Actor networks, several information generation networks, one Critic network, one information coordination network, and one information prediction network. One agent corresponds to one Actor network and one information generation network. The information generation network is used to generate information based on the local observations of an agent. And based on local observations Generate the communication information of this intelligent agent When the agent's communication action is At that time, communication information When the agent's communication action is At that time, communication information It is a one-dimensional array containing two elements; The working process of the information generation network is shown in equation (5): (5) in, This represents the information generation network mapping function. Indicates the first Communication information of individual agents Indicates the first Local observation of each agent; The information prediction network is used to generate the communication information of each agent according to all information generated by the network , obtaining the predicted communication information of each agent ; The working process of the information prediction network is shown in equation (6): (6) in, This represents the information prediction network mapping function. The information prediction network generates the first... Predictive communication information of an agent, , Indicates the number of intelligent agents. This represents the agent's historical communication information. Indicates except the first Communication information between an intelligent entity and other intelligent entities outside the entity; The information coordination network is used to generate predicted communication information of each agent according to the predicted information network , generate global information of each agent ; The working process of the information coordination network is shown in equation (7): (7) in, This represents the information coordination network mapping function. The information coordination network generates the first Global information of each agent , Indicates except the first Predictive communication information between other intelligent agents outside of an intelligent entity; The Actor network is used to determine the... Local observation of an agent With global information Generate the displacement action of the intelligent agent. ; The Actor network operates as shown in equation (8): (8) in, Indicates the first The mapping function of an Actor network for each agent. Indicates the first Parameters of the Actor network for each agent; The Critic network is configured to generate an agent action value based on local observations and movement actions of all agents ; The working principle of the Critic network is shown in equation (9): (9) in, The mapping function representing the Critic network, The parameters representing the Critic network, Does not include the first The set of local observations of other agents of an agent. Does not include the first The set of displacement actions of the other agents of the given agent; The networks used in the Actor network, Critic network, information coordination network, information prediction network, and information generation network are all DNN neural networks.
5. The method of claim 4, wherein the neural network-based ACML model is a neural network-based ACML model for predicting multi-agent communication information. S4 specifically includes: S4.1: Initialize the Actor network, Critic network, information coordination network, and information generation network; S4.2: Initialize the experience replay buffer D, and set the current iteration count to 0; It contains the most recent experience tuple. Experience replay buffer, , , , It is the first An intelligent agent in The next step of local observation is to follow up with the next step. Does not include the first The set of local observations of other agents at the next time step for each agent; S4.3: Determine if the current iteration count has reached the maximum iteration count. If yes, execute S4.13; otherwise, execute S4.
4. S4.4: Initialize the ball's environment; S4.5: Determine whether the agent's displacement in the current ball environment has reached the set maximum number of steps. If yes, increment the iteration count by 1 and return to S4.3; otherwise, execute S4.
6. S4.6: The The information generation network corresponding to each agent obtains the first... Local observation of an agent Generate local information And transmit it to the information coordination network; S4.7: The information coordination network generates local information for all agents generated by the network based on the acquired information. Generate global information for each agent. And send it to the Actor network of the corresponding intelligent agent; The working process of the information coordination network is shown in equation (10): (10) S4.8: The The Actor network of the [number] agents obtains the [number]th [action]. Local observation of an agent With global information Generate the displacement action of the intelligent agent. ; S4.9: All agents execute the displacement actions generated by their Actor networks according to S4.8, and obtain new local observations. Joint rewards with the intelligent agent team ; S4.10: Store experience to experience replay buffer ; S4.11: From the experience replay buffer Extract experience and update the parameters of the Actor network, Critic network, information coordination network, and information generation network; S4.12: Increment the current time step by 1 and return to S4.5; S4.13: Record the parameters of the current Actor network, Critic network, information coordination network and information generation network, to obtain the trained Actor network, Critic network, information coordination network and information generation network.
6. The method of claim 5, wherein the neural network-based ACML model is a neural network-based ACML model for predicting multi-agent communication information. S4.11 Critic network parameters The updates of the parameters of the Critic network as shown in equations (11) - (15): (11) (12) (13) , (14) in, This represents the loss function used to update the parameters of the Critic network. Represents the expectation function, Represents the loss function. Indicates the discount rate. The value of the agent's displacement action output by the Critic network. Indicates displacement action The displacement action in the next time step. Indicates local observation Local observation of the next time step In the action function of the Actor network, These represent the parameters of the target network. The target network structure is the same as the Critic network, but the parameters are different. Indicates displacement action The value of the displacement action in the next time step. This represents the parameters of the updated Critic network. Indicates hyperparameters, This indicates the step size for gradient descent. Indicates to gradient, This represents the updated parameters of the target network. It is the first An intelligent agent in The displacement action at the next time step; No. Actor network parameters of each agent The update is shown in equations (15)-(16): (15) (16) in, Indicates to gradient, Represents the value function. Indicates expectation and from Obtain from, Indicates local observation In the action function of the Actor network, Indicates to gradient, For the updated Actor network parameters, To determine the step size for gradient ascent, Indicates to gradient, Represents the value function; Parameters of information coordination network and information generation network The update is performed using a chain rule, as shown in equations (17)-(18): (17) (18) in, Indicates to gradient, Represents the value function. Expressing expectations, Represents global information. Indicates to Expectations For the updated parameters, This represents the step size for gradient ascent.
7. The method for predicting multi-agent communication information in an ACML model based on a neural network according to claim 2, characterized in that, Team rewards as described in S7 The calculation method is shown in equation (19): (19) in, This indicates the number of time steps for this trajectory. Indicates discount rate of Power of 1 Indicates the first A joint reward is given to teams of intelligent agents at each time step.