A heterogeneous robot collaboration method
By training heterogeneous groups of robots to collaborate using machine learning methods and employing deep reinforcement learning and curriculum learning mechanisms, the problem of time-consuming modeling in traditional methods is solved, enabling efficient collaboration of heterogeneous robots in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN INST OF ARTIFICIAL INTELLIGENCE & ROBOTICS FOR SOC
- Filing Date
- 2023-03-07
- Publication Date
- 2026-06-12
Smart Images

Figure CN116408796B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of robotics, specifically to a method for heterogeneous robot collaboration. Background Technology
[0002] Traditional methods for solving robot planning and control problems struggle to address the challenges posed by heterogeneity and multiple agents, such as the heterogeneous multi-robot encirclement problem. This is primarily because traditional methods require independent modeling for different robots. When robots are heterogeneous, individual robot programming and modeling are necessary, a process that typically consumes a significant amount of time.
[0003] To address these challenges, a solution for rapidly modeling robot planning and control problems is needed. Summary of the Invention
[0004] This application provides a heterogeneous robot collaboration method, which offers a solution for rapidly modeling robot planning and control problems.
[0005] The first aspect of this application provides a heterogeneous robot collaboration method, including:
[0006] Machine learning methods are used to train heterogeneous groups of robots to cooperate. During the training process, robots in the same group share machine learning training parameters and perform the same tasks. Among the heterogeneous groups of robots, at least two robots meet the following conditions: they belong to the same group of robots and are heterogeneous robots with different performance parameters.
[0007] Based on the first aspect of the embodiments of this application, in one implementation of the first aspect of the embodiments of this application, there are heterogeneous robots with different performance parameters, specifically including: heterogeneous robots with different structural parameters, model parameters, motion parameters or sensor parameters.
[0008] Based on the first aspect of the embodiments of this application, one implementation of the first aspect of the embodiments of this application includes a heterogeneous robot with different structural parameters, specifically comprising:
[0009] Heterogeneous robots with legged and wheeled structures respectively.
[0010] Based on the first aspect of the embodiments of this application, one implementation of the first aspect of the embodiments of this application involves using machine learning methods to train heterogeneous groups of robots to collaborate, specifically including:
[0011] Deep reinforcement learning is used to train multiple heterogeneous groups of robots to cooperate.
[0012] Based on the first aspect of the embodiments of this application, one implementation of the first aspect of the embodiments of this application involves training multiple groups of robots to collaborate using a deep reinforcement learning method, specifically including:
[0013] Determine the current learning stage based on the course learning mechanism, and select at least two groups of robots from multiple heterogeneous groups of robots as target robots based on the current learning stage.
[0014] Deep reinforcement learning methods are used to train the target robot to cooperate.
[0015] Determine whether multiple groups of robots have been added as the target robot;
[0016] If multiple groups of robots are not added as target robots, the current learning stage is updated according to the course learning mechanism, and at least one group of robots is selected from multiple groups of robots to be added as target robots based on the updated current learning stage.
[0017] Perform the step of "training the target robot to cooperate using deep reinforcement learning methods";
[0018] If multiple groups of robots are added as target robots, then the training is considered complete.
[0019] Based on the first aspect of the embodiments of this application, in the first implementation of the first aspect of the embodiments of this application, the heterogeneous multiple groups of robots include a capture robot, an escape robot, an observation robot, and a communication robot.
[0020] Based on the first aspect of the embodiments of this application, in one implementation of the first aspect of the embodiments of this application, the capture robot, the escape robot, and the communication robot are all ground motion robots;
[0021] The observation robot is a flying drone;
[0022] Based on the first aspect or the first implementation of the embodiments of this application, in the second implementation of the first aspect of the embodiments of this application, the current learning stage is determined according to the course learning mechanism, and at least two groups of robots are selected as target robots from multiple heterogeneous groups of robots according to the current learning stage, specifically including:
[0023] Based on the course learning mechanism, the current learning stage is determined to be the first learning stage;
[0024] Based on the current learning stage as the first learning stage, select a capture robot and an escape robot as target robots from multiple groups of robots.
[0025] Based on any one of the first aspect, the first implementation, and the second implementation of the embodiments of this application, in the third implementation of the first aspect of this application, the current learning stage is updated according to the course learning mechanism, and at least one group of robots is selected from multiple groups of robots and added as the target robot according to the updated current learning stage, specifically including:
[0026] The current learning stage will be updated to the second learning stage according to the course learning mechanism;
[0027] Based on the updated current learning stage being the second learning stage, select the observation robot from multiple groups of robots and add it as the target robot.
[0028] Based on the first aspect of the embodiments of this application, and any one of the first to third implementations of the first aspect, in the fourth implementation of the first aspect of the embodiments of this application, the current learning stage is updated according to the course learning mechanism, and at least one group of robots is selected from multiple groups of robots and added as the target robot according to the updated current learning stage, specifically including:
[0029] The current learning stage has been updated to the third learning stage according to the course learning mechanism;
[0030] Based on the updated current learning stage being the third learning stage, select a communication robot from multiple groups of robots and add it as the target robot.
[0031] Based on the first aspect of the embodiments of this application, and any one of the first to fourth implementations of the first aspect, in the fifth implementation of the first aspect of the embodiments of this application, a deep reinforcement learning method is used to train the target robot to cooperate, specifically including:
[0032] Obtain the latent space state variables of the target robot;
[0033] Using a distributed policy network, the target robot is controlled to make decisions in asymmetric games based on the latent space state variables of the target robot.
[0034] Using a centralized comment network, the value of the current system state is estimated based on the latent space state variables of all target robots, and the current system state is determined by the target robots and the environment.
[0035] Based on any one of the first to fifth implementations of the embodiments of this application, in the sixth implementation of the first aspect of the embodiments of this application, the latent space state quantities include: self-state quantities, perception state quantities and / or communication state quantities.
[0036] Based on the first aspect and any one of the first to sixth implementations of the embodiments of this application, in the seventh implementation of the first aspect of this application, the optimization function of the comment network is:
[0037]
[0038] These represent capture robots, observation robots, and communication robots, respectively, and are defined as follows: , , ,in The total number of robots in each group;
[0039] The formula defines an escape robot. ,in This represents the total number of escaped robots;
[0040] The formula defines a threshold distance between the communication robot and the capture robot. The formula defines a distance threshold between the observing robot and the escaping robot. ;
[0041] Sets defined in the formula This includes the state of each group of robots at any time t.
[0042] Based on the first aspect of the embodiments of this application, and any one of the first to seventh implementations of the first aspect, in the eighth implementation of the first aspect of the embodiments of this application... Defined as:
[0043]
[0044] The `dist(·,·)` function calculates the Euclidean distance.
[0045] Based on the first aspect of the embodiments of this application, and any one of the first to eighth implementations of the first aspect, in the ninth implementation of the first aspect of the embodiments of this application... Discretized .
[0046] Based on any one of the first to ninth implementations of the embodiments of this application, in the tenth implementation of the first aspect of the embodiments of this application... Defined as:
[0047] .
[0048] Based on the first aspect of the embodiments of this application, and any one of the first to tenth implementations of the first aspect, in the eleventh implementation of the first aspect of the embodiments of this application... Defined as:
[0049] .
[0050] Based on the first aspect of the embodiments of this application, and any one of the first to eleventh implementations of the first aspect, in the twelfth implementation of the first aspect of the embodiments of this application... Defined as:
[0051]
[0052] in ;
[0053] Defined as a piecewise function:
[0054] .
[0055] Based on any one of the first to twelfth implementations of the embodiments of this application, in the thirteenth implementation of the first aspect of the embodiments of this application, the reward function of the policy network includes an obstacle avoidance reward for the capture robot:
[0056]
[0057] The set of obstacles is defined in the formula as follows: .
[0058] Based on any one of the first to thirteenth implementations of the embodiments of this application, in the fourteenth implementation of the first aspect of the embodiments of this application, the reward function of the policy network further includes a tracking reward for the capture robot:
[0059] .
[0060] Based on any one of the first to fourteenth implementations of the embodiments of this application, in the fifteenth implementation of the first aspect of the embodiments of this application, the reward function of the policy network further includes an observation robot tracking reward:
[0061] .
[0062] Based on any one of the first to fifteenth implementations of the embodiments of this application, in the sixteenth implementation of the first aspect of the embodiments of this application, the reward function of the policy network further includes a tracking reward for the communication robot:
[0063] .
[0064] As can be seen from the above technical solutions, the embodiments of this application have the following advantages:
[0065] In this embodiment, machine learning methods are used to train heterogeneous groups of robots, enabling them to collaborate. Compared to programming and modeling each group of robots individually, this approach simplifies the modeling process and saves significant time. Using machine learning allows for heterogeneity between different groups of robots performing different tasks, and also allows for heterogeneity within the same group of robots performing the same task, resulting in better compatibility and the ability to handle more complex tasks and environments.
[0066] A second aspect of this application provides a chip system including at least one processor and a communication interface, the communication interface and the at least one processor being interconnected via a line, the at least one processor being used to run a computer program or instructions to perform the method of the first aspect.
[0067] A third aspect of this application provides a computer device, including:
[0068] Central processing unit, memory, input / output interfaces, wired or wireless network interfaces, and power supply;
[0069] The memory can be either temporary or permanent storage.
[0070] The central processing unit is configured to communicate with the memory and execute instructions in the memory to perform the method of the first aspect.
[0071] A fourth aspect of this application provides a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to perform the method of the first aspect.
[0072] A fifth aspect of this application provides a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of the first aspect. Attached Figure Description
[0073] Figures 1 to 3 These are multiple flowcharts illustrating the heterogeneous robot collaboration method according to embodiments of this application;
[0074] Figure 4 This is a schematic diagram of the input-output structure of the strategy network of the heterogeneous robot collaboration method in the embodiments of this application;
[0075] Figure 5 This is a schematic diagram of the encirclement scenario modeling of the heterogeneous robot cooperation method in the embodiments of this application;
[0076] Figure 6 This is a schematic diagram of the training framework for the heterogeneous robot collaboration method in the embodiments of this application;
[0077] Figure 7This is a schematic diagram of the communication robot operation in the heterogeneous robot collaboration method of this application embodiment;
[0078] Figure 8 This is a communication diagram illustrating the heterogeneous robot collaboration method according to an embodiment of this application;
[0079] Figure 9 This is an experimental diagram of the learning phase of the heterogeneous robot collaboration method according to an embodiment of this application;
[0080] Figure 10 This is a comparison diagram between a simulation experiment and an actual experiment of the heterogeneous robot collaboration method according to the embodiments of this application;
[0081] Figure 11 This is a performance comparison chart of the heterogeneous robot collaboration method and the S2M2 method in the embodiments of this application;
[0082] Figure 12 This is a schematic diagram of the structure of a computer device according to an embodiment of this application. Detailed Implementation
[0083] Compared to traditional algorithms, the advantages of machine learning include: First, learning-based methods can automatically learn and generalize, rather than building mathematical models based on ideal assumptions, to handle more dynamic situations. Second, traditional methods lack the ability to abstract in increasingly complex task environments, a phenomenon known as a lack of representation learning ability. However, machine learning also has challenges that need to be overcome. The coordinated behavior of various robotic agents is difficult to learn together, and the exploration of agents during training is also a significant challenge.
[0084] This application provides a heterogeneous robot collaboration method, including:
[0085] Machine learning methods are used to train heterogeneous groups of robots to cooperate. During the training process, robots in the same group share machine learning training parameters and perform the same tasks. Among the heterogeneous groups of robots, at least two robots meet the following conditions: they belong to the same group of robots and are heterogeneous robots with different performance parameters.
[0086] Multiple groups of robots each have different tasks, thus from a strategy perspective, different groups of robots are heterogeneous; from a performance parameter perspective, different groups of robots can have robots with different performance parameters or robots with the same performance parameters. For example, two groups of robots may have the same structural and other performance parameters, but perform different tasks. Robots within the same group may perform the same task, thus from a strategy perspective, they are consistent within the same group; from a performance parameter perspective, robots within the same group can have different performance parameters or robots with the same performance parameters. For example, one group of robots consists of three robots with the same structural and other performance parameters; another group of robots consists of two robots with different structural and other performance parameters; a third group of robots consists of four robots, of which only two robots have the same performance parameters.
[0087] Machine learning methods can be used to build models for simulation training or for real-world practical training. Robots in the same group can share training parameters, including the weights of the neural network, to avoid individual robots performing significantly worse than others in the same group, and to effectively improve the training speed and convergence of machine learning.
[0088] In this embodiment, machine learning methods are used to train heterogeneous groups of robots, enabling them to collaborate. Compared to programming and modeling each group of robots individually, this approach simplifies the modeling process and saves significant time. Using machine learning allows for heterogeneity between different groups of robots performing different tasks, and also allows for heterogeneity within the same group of robots performing the same task, resulting in better compatibility and the ability to handle more complex tasks and environments.
[0089] In one implementation of this application, a heterogeneous robot with different performance parameters specifically includes a heterogeneous robot with different structural parameters, model parameters, motion parameters, or sensor parameters.
[0090] Structural parameters include the robot's dimensions, the components it assembles, and the connection methods of these components. Examples of robots with different structural parameters include: legged and wheeled robots; robots using stepper motors and servo motors; and robots using gear drives and synchronous belt drives.
[0091] Model parameters include the robot's brand, version number, etc. Different model parameters can refer to robots from different brands, or newer and older versions of the same brand.
[0092] Motion parameters include speed and range. Examples of robots with different motion parameters include robots with a maximum speed of 5 m / s and 2 m / s; and robots with ranges of 5 km and 10 km.
[0093] Sensor parameters include sensor type and sensor detection range. Robots with different sensor parameters include, for example, robots using radar ranging and infrared ranging; and robots with sensor detection ranges of 2 meters and 5 meters.
[0094] In one implementation of this application embodiment, a heterogeneous robot with different structural parameters is specifically included:
[0095] Heterogeneous robots with legged and wheeled structures respectively.
[0096] Heterogeneous robots with different structural parameters can be either legged or wheeled. Legged robots, such as robotic dogs, use alternating legs for locomotion. Wheeled robots, such as robotic vehicles, use four rotating wheels for locomotion. Legged robots can also be hexapod, octagonal, etc., without limitation. Wheeled robots can also be six-wheeled, eight-wheeled, etc., without limitation.
[0097] In one implementation of this application, machine learning methods are used to train heterogeneous groups of robots to collaborate, specifically including:
[0098] Deep reinforcement learning is used to train multiple heterogeneous groups of robots to cooperate.
[0099] Using deep reinforcement learning to achieve autonomous learning, the modeling process is relatively simple, the model is highly adaptable, and the training effect is also good.
[0100] like Figure 1 As shown, in one implementation of this application embodiment, a deep reinforcement learning method is used to train multiple heterogeneous groups of robots to cooperate, specifically including:
[0101] 101. Determine the current learning stage based on the course learning mechanism, and select at least two groups of robots from multiple heterogeneous groups of robots as target robots based on the current learning stage;
[0102] Heterogeneous groups of robots have different structures and tasks. A curriculum-based learning mechanism enables a gradual learning process, starting with simpler training and progressively increasing the difficulty. Using this mechanism improves training convergence and reduces training time and computational costs.
[0103] 102. Deep reinforcement learning methods are used to train the target robot to cooperate.
[0104] The target robot is trained using deep reinforcement learning methods. The target robot is an intelligent agent, and the heterogeneous robots, as intelligent agents, have different strategies and behaviors.
[0105] 103. Determine whether multiple groups of robots have been added as the target robot;
[0106] After training to a certain extent, it is determined whether all robots should be added as target robots. The extent of training can be determined by the number of training sessions, the training effect, the training time, or other indicators, without being limited to any specific criteria.
[0107] 104 If multiple groups of robots are not added as target robots, the current learning stage is updated according to the course learning mechanism, and at least one group of robots is selected from multiple groups of robots to be added as target robots according to the updated current learning stage.
[0108] Perform the step of "training the target robot to cooperate using deep reinforcement learning methods";
[0109] If there are still robots among the multiple groups that have not been added as target robots, the current learning stage is updated according to the course learning mechanism, and the program moves to the next learning stage after the previous one. After updating the current learning stage, at least one group of robots that have never participated in training is selected and added as target robots. This increases the number of target robot groups by at least one, increasing the training complexity. Robots that have not participated in training refer to those that have not been added as target robots.
[0110] After a new learning phase has begun and new target robots have been added, step 102 is performed to conduct deep reinforcement learning training using more target robots.
[0111] 105 If multiple groups of robots are added as target robots, then the training is considered complete.
[0112] Training is considered complete once all robots in multiple groups have been added as target robots and have undergone deep reinforcement learning training. After training, the trained robots can be used to perform tasks, the trained deep reinforcement learning network can be further optimized by adjusting parameters, or the trained deep reinforcement learning network can be packaged and output as a model; the specifics are not limited. Here, a deep reinforcement learning method under the centralized training and decentralized execution (CTDE) paradigm can be used.
[0113] In this embodiment, deep reinforcement learning is used to train multiple heterogeneous robots, which can effectively cope with complex constraints and improve the computational efficiency of the algorithm. Using a course-based learning mechanism, the heterogeneous robots are trained in stages, resulting in good training convergence and effectively saving training time and computing power.
[0114] In one implementation of this application, the heterogeneous groups of robots include a capture robot, an escape robot, an observation robot, and a communication robot.
[0115] In one implementation of this application, from a structural perspective, the capturing robot, the escape robot, and the communication robot are all ground-based robots, which can adopt wheeled or legged structures, etc.; the observation robot is a flying drone. The observation robot can move in the air, has a faster movement speed, and a wider observation range.
[0116] From a task perspective, the capture robot's task is to capture the escaped robot as quickly as possible, the escaped robot's task is to avoid being captured by the capture robot, the observation robot's task is to find and monitor the escaped robot, and the communication robot's task is to improve the communication between the capture robot and the observation robot.
[0117] like Figure 5 As shown, the capture robots possess exploration, tracking, and encirclement capabilities. Each capture robot can acquire information about its surrounding 40x40 area (the unit can be "unit length" or decimeters), identifying which game units exist in the vicinity, their locations, and quantities. A game unit refers to a robot. When an escape robot is nearby, the capture robot will quickly track it and encircle it. The observation robot, a drone, leverages its high altitude and high speed to conduct a wide-area search of ground targets, then transmits the search information to the communication robot, which assists the capture robot in better encircling the escape robot. The escape robot has the same or larger field of vision as the capture robot and escapes at a faster speed.
[0118] The robot's observation range at any given moment is obtained through a grid sensor, such as... Figure 4 As shown: The Grid Sensor combines the advantages of two-dimensional spatial representation in visual observation with the flexibility of defining detectable objects in physical raycasting. During observation, the sensor detects the presence of detectable objects in each cell and encodes them as a one-hot representation. The information collected from each cell forms a 3D tensor observation, which is fed into a Convolutional Neural Network (CNN) for agent policies, similar to visual observation. A cell is labeled whenever an object collides with it. The agent scans the surrounding environment (walls, capture robots, communication robots, observation robots, escape robots, etc.) using the Grid Sensor to obtain the number, location, and function of surrounding agents. In the capture scenario, the capture team's objective is to find all escape robots and then capture them.
[0119] like Figure 2 As shown in the embodiment of this application, to address the need for intelligent collaboration in heterogeneous robot systems, this embodiment proposes a framework based on multi-agent reinforcement learning to solve the problem of cooperative decision-making among heterogeneous robots in different tasks. Its basic principle is to use an asymmetric self-play mechanism for better exploration training and a reward function allocation based on curriculum learning to incentivize intelligent cooperative behavior among different robots under different conditions. This multi-agent reinforcement learning approach offers stronger reliability and generalization compared to traditional control and planning-based robot cooperation methods. Furthermore, its policy planning results transform from sparse short-term greedy strategies to long-term cooperative strategies based on the current situation, enabling a comprehensive improvement in multiple robot system metrics simultaneously.
[0120] In one implementation of this application, the course learning mechanism is divided into three stages: a first learning stage, a second learning stage, and a third learning stage. It should be noted that the course learning mechanism can also be divided into other numbers of learning stages; no specific number is limited.
[0121] 1011 determines the current learning stage as the first learning stage based on the course learning mechanism;
[0122] 1012, based on the current learning stage as the first learning stage, selects a capture robot and an escape robot as target robots from multiple groups of robots.
[0123] In the first learning phase, the capture robot and the escape robot were used as target robots for deep reinforcement learning training.
[0124] 1041 Updates the current learning stage to the second learning stage according to the course learning mechanism;
[0125] 1042 Based on the updated current learning stage as the second learning stage, select the observation robot from multiple groups of robots and add it as the target robot.
[0126] After completing the first learning phase, the second learning phase begins. In the second learning phase, the observation robot is added as the target robot, thus enabling the capture robot, observation robot, and escape robot to jointly participate in deep reinforcement learning training.
[0127] 1043 updates the current learning stage to the third learning stage according to the course learning mechanism;
[0128] 1044 Based on the updated current learning stage being the third learning stage, select a communication robot from multiple groups of robots and add it as the target robot.
[0129] After completing the second learning phase, the third learning phase begins. In the third learning phase, the communication robot is added as the target robot, thus enabling the capture robot, observation robot, communication robot, and escape robot to jointly participate in deep reinforcement learning training.
[0130] Specific training methods such as Figure 6 As shown, the training process begins with training a group of capture robots to surround and capture escaped robots. This is achieved through the use of a Multi-Agent Posthumous Credit Assignment (MA-POCA) algorithm and asymmetric self-games to stimulate the intelligence of both the escape and capture robot teams. Curriculum learning is then employed to accelerate the training speed and improve stability. Next, an observer robot is added, further enhancing the collaborative capture effect among the heterogeneous robots through MA-POCA and Curriculum Learning. Finally, a communication robot is added to further improve the capture effectiveness, using MA-POCA and Curriculum Learning to enhance the capture of escaped robots by the capture robots. From left to right, green, red, and blue represent the first, second, and third stages of the capture training. In the first stage, only the capture and escaped robots participate in the training. In the second stage, an observer robot is added to the environment to train the model's resilience against interference. In the third stage, the observer and communication robots are added to the capture team, and collaborative capture training among the heterogeneous robots is conducted.
[0131] The course-based learning mechanism can also be called course-based learning. Robots can also be called nodes; a capture robot can be called a capture node, an observation robot can be called an observation node, a communication robot can be called a communication node, and an escape robot can be called an escape node.
[0132] like Figure 3 As shown, in one implementation of this application embodiment, a deep reinforcement learning method is used to train the target robot to cooperate, specifically including:
[0133] 1021 Obtain the latent space state variables of the target robot;
[0134] Obtain the latent space state variables of each target robot. These latent space state variables are derived from the position, motion, and sensor data of each target robot. The latent space state variable of a single target robot characterizes its state, and the latent space state variables of all target robots characterize the state of the system.
[0135] 1022 uses a distributed policy network to control the target robot's decision-making in asymmetric games based on the target robot's latent space state variables;
[0136] A distributed policy network, where each target robot has its own policy network.
[0137] It should be noted that robots within the same group can share policy networks to avoid individual robots performing significantly worse than others. Policy network sharing can take several forms: assigning policy network weights from better-performing robots to poorer-performing robots; averaging policy network weights; or using genetic algorithms to recombine and mutate genes within the same group—the specific methods are not limited.
[0138] 1023 uses a centralized comment network to estimate the value of the current system state based on the latent space state variables of all target robots. The current system state is determined by the target robots and the environment.
[0139] A centralized comment network is used throughout the system, which consists of multiple heterogeneous groups of robots. The comment network estimates the value of the current system state based on the conditions of all target robots and the environment. The value of the current system state represents the game between the capturing team and the escape team; generally, it can be expressed as the win rate, the remaining time to complete the capture, etc. The capturing team includes capturing robots, observation robots, and communication robots, while the escape team includes escape robots.
[0140] In the centralized training phase, a multi-agent policy review (actor-critic) algorithm is used to train the policy network for each robot in the heterogeneous multi-machine system. During training, a centralized review network is trained to estimate the value of the current system state, which includes the latent space states of the entire system and all agents. It is important to note that the state of the entire system is only needed during training and can be easily obtained in a simulation environment. By using such a centralized review network, training the actor network (i.e., the policy network) becomes much simpler.
[0141] In the distributed execution phase, unlike the centralized training phase, each agent only needs its own latent space state, without needing the latent space states of other agents or the system state. Specifically, each agent only needs to input its own latent space state into the corresponding policy network to output its optimal action.
[0142] In one implementation of this application, the latent space state variables include: self-state variables, perception state variables, and / or communication state variables.
[0143] For each group of robots, the observation space of each robot includes the following three types: a) self-state variables: the robot's basic state information; b) perception state variables: precise local information within its perception range; c) communication state variables: coarse global information within its communication range.
[0144] The following details each observation of the robot: its own state includes global position and local velocity, which are encoded as one-dimensional vectors; both the perception state and the communication state are encoded as image information with the object's position, and this image will change as the agent moves and rotates.
[0145] In one implementation of this application, the optimization function of the comment network is:
[0146]
[0147] These represent capture robots, observation robots, and communication robots, respectively, and are defined as follows: , , ,in The total number of robots in each group;
[0148] The formula defines an escape robot. ,in This represents the total number of escaped robots;
[0149] The formula defines a threshold distance between the communication robot and the capture robot. The formula defines a distance threshold between the observing robot and the escaping robot. ;
[0150] Sets defined in the formula This includes the state of each group of robots at any time t.
[0151] In one implementation of the embodiments of this application, Defined as:
[0152]
[0153] The `dist(·,·)` function calculates the Euclidean distance.
[0154] from As the definition suggests, the goal is to shorten the distance between the capturing robot and the escaping robot. It is necessary to calculate the Euclidean distance between the capturing robot and the escaping robot at time t.
[0155] In one implementation of this application, in the Markov Decision Process (MDP) setting, Discretized .
[0156] In one implementation of the embodiments of this application, Defined as:
[0157]
[0158] from As the definition suggests, communication robots need to optimize communication throughout the entire capture team. The working process of a communication robot is as follows: Figures 7 to 8 As shown. In Figure 7 In the middle, the communication robot is located at the edge of the relative capture team and communication base station ( Figure 7 (Left image, bottom right corner) moved to the middle position relative to the arrest team and the communication base station. Figure 7 (In the middle of the right image), it acts as a communication relay, reducing communication blind spots and improving communication between the arrest team and the communication base station. For example... Figure 8 As shown, the information flow from right to left is as follows: the observation robot informs the communication robot of the location of the escaped robot, the observation robot informs the communication robot of its own location, and the communication robot informs the capture robot of the location of the escaped robot; the information flow from left to right is as follows: the capture robot informs the communication robot of its own location. Figure 8 The symbols in the text are explained as follows: Catch Robots, Communication Robots, Observer Robots (abbreviated as Obs), Target Robot, and the position of each robot is abbreviated as Pos.
[0159] Furthermore, a constraint needs to be added to the system to indicate signal loss. This application summarizes three types of constraints widely accepted in the mobile communications community and tests the optimal constraint in an outdoor environment, demonstrating that a suitable solution can be obtained using the Friis transmission formula.
[0160] In one implementation of the embodiments of this application, Defined as:
[0161]
[0162] from As can be seen from the definition, This is used to standardize the observation robot, enabling it to hover above the escape robot and provide the capture robot with the location of the escape robot.
[0163] In one implementation of the embodiments of this application, Defined as:
[0164]
[0165] in ;
[0166] Defined as a piecewise function:
[0167] .
[0168] Reinforcement learning algorithms use reward functions to optimize the agent's policy. When a single-agent algorithm is used, the reward is given to the individual robot. When using the MA-POCA algorithm, the reward is distributed among robots in the same group.
[0169] Figure 4 The input-output structure of the policy network designed in this application is shown. For each robotic agent in the system (capturing robot, escaping robot, communicating robot, and observing robot), its perceptual state is encoded into a feature vector through a convolutional neural network, and then superimposed on its own state to form the final observation vector in the latent space.
[0170] The embodiments of this application follow the centralized training and distributed execution (CTDE) paradigm, which has been widely used in multi-robot (agent) reinforcement learning problems that use partially observable Markov decision processes for modeling. Figure 4 The input-output structure of the policy network in this embodiment is illustrated in detail. For each robotic agent in the system (capturing robot, escaping robot, communicating robot, and observing robot), its perceptual state is encoded into a feature vector through a convolutional neural network, and then superimposed on its own state to form the final observation vector in the latent space. Each robot's policy network uses a multi-layer perceptron (MLP) module. For the policy network of the i-th robot, it acquires the final observation vector at time t, outputs its action, and then converts the action into a speed command. Figure 4 The right side shows the relationship between the policy network and the credit allocation of the reward function, including the rewards for individual agents and the intra-group rewards for teams of agents. Figure 5 The middle section shows the strategy generation process for each robot, which involves using vector observation and grid sensors to obtain the position and velocity of each robot.
[0171] The following describes how to define the reward for each group of robots: For the capture robot, the reward function can be decomposed into two parts. The first part is the obstacle avoidance reward for the capture robot, and the second part is the tracking reward for the capture robot.
[0172] To encourage the capture robots to move away from each other and from obstacles, an obstacle avoidance reward is set up.
[0173] In one implementation of this application, the reward function of the policy network includes an obstacle avoidance reward for the capture robot:
[0174]
[0175] The set of obstacles is defined in the formula as follows: .
[0176] In one implementation of this application, when the capturing robot can maintain close proximity to the escaped robot, the capturing robot receives a sparse reward at time t. That is, the tracking reward for the capturing robot is the reward signal when either capturing robot approaches the escaped robot.
[0177] .
[0178] The observer robot tracks the escaped robot in the air and reports their location to the capturing robot. Therefore, the observer robot will try to get as close as possible to the escaped robot. The reward function of the policy network also includes a reward for the observer robot's tracking:
[0179] .
[0180] In one implementation of this application, the main function of the communication robot is to minimize the distance between itself and the capture robot during the capture process. To ensure this, the reward function of the policy network also includes a tracking reward for the communication robot.
[0181] .
[0182] The experimental data presented below demonstrate the feasibility and progressiveness of this application.
[0183] To develop a more robust and intelligent heterogeneous multi-robot system for capture tasks, we employ asymmetric self-game and curriculum learning techniques to address the heterogeneity and variable number of robots. Our approach, based on actor-critic multi-agent reinforcement learning, provides a system for cooperative behavior among heterogeneous multi-robot teams. This is for the development of heterogeneous multi-robot systems (HMRS) in complex capture scenarios involving multiple heterogeneous robot teams and real-world physical constraints. We conducted simulation experiments to evaluate the impact of different mechanisms on the performance of our method and real-world experiments to evaluate the performance of our system in complex real-world capture problems. Furthermore, a comparative study was conducted, comparing our method with the Scalable and Safe Multi-Agent Motion Planning (S2M2) method in heterogeneous capture problems; our method performs better in adversarial environments. The results demonstrate that our proposed framework, which integrates asymmetric self-game and curriculum learning, can successfully complete HMRS capture tasks under realistic constraints in both simulations and the real world, thus providing direction for future large-scale intelligent capture.
[0184] To demonstrate the capabilities of our method in the real world, we first trained it in a simulated environment using curriculum-based learning, showing that our method can learn capture strategies by intelligently utilizing self-play mechanics and the diverse capabilities of heterogeneous robot teams. The simulated environment is a reasonable abstraction of the real-world scenario. In the first phase, the escaping robot was trained using a multi-agent reinforcement learning algorithm, endowing it with a certain level of intelligence. This will further stimulate the capture robot to learn more intelligent strategies. Simultaneously, the environment, composed of randomly generated obstacles, narrows the gap between the simulated and real-world environments. Finally, we fully modeled the capabilities of different heterogeneous robot teams based on their actual functions, ensuring that the modeling gap in our research was minimized. More specifically, we constructed the scenario using the Machine Learning Agents Toolkit (ML-Agent) on the real-time content development platform Unity. Unity is a general-purpose platform for intelligent agents. Figure 6Different developmental stages of the simulation experiments are defined, illustrating the gradual evolution of our baseline system. From left to right, the cooperative strategies of the teams are progressively developed: the capture robot, the observation robot, and the communication robot. For each robot team, there are also two-step or three-step training methods. In this approach, the robot teams are first trained to exhibit cooperative behavior, then the escaping robots are subjected to an asymmetric self-game mechanism, and finally, learning courses are designed for the different robots. The capture robot team's goal is to catch up with and maintain close proximity to the escaping robot; the observation robots' goal is to keep the escaping robots in an observation state so they can send their location to the capture robot. The communication robot team aims to maintain a short distance from the capture robot team and remain within communication range for as long as possible.
[0185] like Figure 9 As shown, the reward curve obtained in the real-world experiment shows that the total reward increases in the three stages of learning, which correspond to the three stages of the training framework.
[0186] like Figure 10 As shown, to demonstrate that our design is feasible in both simulation and the real world, we conducted a real-world experiment consisting of three different teams of robots: three capture robots, two observation robots, one communication robot, and an escape robot acting as an adversary.
[0187] To measure the success rates, we conducted 18 repeated experiments. We had a total of 18 × 3 robots = 54 potential success rates, demonstrating satisfactory results in both the simulation and real-world scenarios.
[0188] like Figure 11 As shown, we will conduct a comparative study, comparing the performance of the method proposed in this application embodiment with the S2M2 method in our capture scenario (the method proposed in this application embodiment is abbreviated as "Proposed" in the figure, and the S2M2 method is abbreviated as "S2M2"). To our knowledge, there is no method specifically designed to solve the intelligent capture problem using heterogeneous robot teams. S2M2 is an algorithm based on Optimization and Priority Search (PBS) that enables the robot team to reach the target while avoiding all obstacles and each other in the process. In a simulated environment, we collected the results of 100, 500, 1000, 1500, 2000, and 2500 steps respectively, allowing us to compare S2M2 with our learning-based method.
[0189] First, a two-tailed test was performed to check whether there was a statistically significant difference between the two groups in the final 2500 steps. The comparison revealed a significant difference between the two groups, and both results passed the normality test when the number of steps exceeded 1000. Furthermore, to further analyze trends, instead of using multivariate ANOVA or a general linear model, we assumed a Gaussian-like nonlinear transformation of the input at each step and performed Gaussian Process Regression (GPR) using maximum likelihood estimation.
[0190] like Figure 12 As shown in the illustration, this application also provides a computer device 1200, comprising:
[0191] Central processing unit 1201, memory 1205, input / output interface 1204, wired or wireless network interface 1203, and power supply 1202;
[0192] Memory 1205 is either a short-term storage memory or a persistent storage memory;
[0193] The central processing unit 1201 is configured to communicate with the memory 1205 and execute instructions stored in the memory 1205 to perform actions such as... Figures 1 to 8 The method in the illustrated embodiment.
[0194] This application also provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform actions such as... Figures 1 to 8 The method in the illustrated embodiment.
[0195] This application also provides a computer program product containing instructions that, when run on a computer, cause the computer to perform actions such as... Figures 1 to 8 The method in the illustrated embodiment.
[0196] This application also provides a chip system, which includes at least one processor and a communication interface. The communication interface and the at least one processor are interconnected via a circuit. The at least one processor is used to run computer programs or instructions to perform tasks such as... Figures 1 to 8 The method in the illustrated embodiment.
[0197] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0198] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0199] The terms "first," "second," "third," "fourth," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein.
[0200] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between apparatuses or units through some interfaces, and may be electrical, mechanical, or other forms.
[0201] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0202] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0203] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. A method for heterogeneous robot collaboration, characterized in that, include: Machine learning methods are used to train heterogeneous groups of robots to cooperate. During the training process, robots in the same group share machine learning training parameters and perform the same tasks. Among the heterogeneous groups of robots, at least two robots meet the following conditions: they belong to the same group of robots and are heterogeneous robots with different performance parameters. This involves using machine learning methods to train multiple heterogeneous groups of robots to collaborate, including: Determine the current learning stage based on the course learning mechanism, and select at least two groups of robots from multiple heterogeneous groups of robots as target robots based on the current learning stage. The target robot is trained to collaborate using deep reinforcement learning; it is then determined whether multiple groups of robots have been added as the target robot. If multiple groups of robots are not added as target robots, the current learning stage is updated according to the course learning mechanism. Based on the updated current learning stage, at least one group of robots that have not participated in training is selected from the robots and added as target robots, so that the number of groups of robots in the target robots increases by at least one group; then return to execute the step of training the target robots to cooperate using deep reinforcement learning methods. If multiple groups of robots are added as target robots, then the training is considered complete.
2. The heterogeneous robot collaboration method according to claim 1, characterized in that, Heterogeneous robots with different performance parameters include those with different structural parameters, model parameters, motion parameters, or sensor parameters.
3. The heterogeneous robot collaboration method according to claim 2, characterized in that, Heterogeneous robots with different structural parameters include: Heterogeneous robots with legged and wheeled structures respectively.
4. The heterogeneous robot collaboration method according to claim 1, characterized in that, The heterogeneous groups of robots include capture robots, escape robots, observation robots, and communication robots.
5. The heterogeneous robot collaboration method according to claim 4, characterized in that, Capture robots, escape robots, and communication robots are all ground-based robots; The observation robot is a flying drone.
6. The heterogeneous robot collaboration method according to claim 4, characterized in that, Based on the course learning mechanism, the current learning stage is determined. Then, based on the current learning stage, at least two groups of robots from a heterogeneous pool of robots are selected as the target robots. Specifically, this includes: Based on the course learning mechanism, the current learning stage is determined to be the first learning stage; Based on the current learning stage as the first learning stage, select a capture robot and an escape robot as target robots from multiple groups of robots.
7. The heterogeneous robot collaboration method according to claim 6, characterized in that, The current learning stage is updated according to the course learning mechanism. Based on the updated current learning stage, at least one group of robots is selected from multiple groups and added as the target robot. Specifically, this includes: The current learning stage will be updated to the second learning stage according to the course learning mechanism; Based on the updated current learning stage being the second learning stage, select the observation robot from multiple groups of robots and add it as the target robot.
8. The heterogeneous robot collaboration method according to claim 7, characterized in that, The current learning stage is updated according to the course learning mechanism. Based on the updated current learning stage, at least one group of robots is selected from multiple groups and added as the target robot. Specifically, this includes: The current learning stage has been updated to the third learning stage according to the course learning mechanism; Based on the updated current learning stage being the third learning stage, a communication robot is selected from multiple groups of robots and added as the target robot.
9. The heterogeneous robot collaboration method according to claim 4, characterized in that, The target robot is trained to cooperate using deep reinforcement learning methods, specifically including: Obtain the latent space state variables of the target robot; Using a distributed policy network, the target robot is controlled to make decisions in asymmetric games based on the latent space state variables of the target robot. Using a centralized comment network, the value of the current system state is estimated based on the latent space state variables of all target robots, and the current system state is determined by the target robots and the environment.
10. The heterogeneous robot collaboration method according to claim 9, characterized in that, Latent space state variables include: self-state variables, perception state variables, and / or communication state variables.