A method for updating cooperative control strategies for heterogeneous robot formations and its application
By using a collaborative control strategy update method for heterogeneous robot formations, which utilizes interactive trajectory data for evaluation and grouping strategy parameter updates, the efficiency and consistency issues of collaborative control in heterogeneous robot formations are resolved, thereby improving the stability and reliability of path coordination and task execution.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to achieve efficient collaborative control in heterogeneous robot formations, leading to problems such as path competition, traffic conflicts, and task response imbalances, making it difficult to balance individual robot differences with formation consistency requirements.
By acquiring interaction trajectory data for centralized value assessment, strategy groups are divided and group-level and individual-level strategy parameters are updated. Collaborative task control instructions are generated by combining shared and personalized parameters.
It improves the robot formation's path coordination, passage stability, and task connection capabilities in dynamic obstacle environments, thereby enhancing control reliability and overall efficiency.
Smart Images

Figure CN122308458A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of robot automatic control, and in particular relates to a method and application for updating the cooperative control strategy of heterogeneous robot formation. Background Technology
[0002] With the continuous improvement of intelligent manufacturing, smart warehousing, and port automation, heterogeneous robot platooning systems composed of various types of robots have been widely used in industrial scenarios such as smart manufacturing workshops, warehousing and logistics centers, and port material transfer areas. In these scenarios, multiple types of robots with different functions, such as towed handling robots, forklift handling robots, inspection robots, and vision-guided robots, are typically deployed simultaneously to collaboratively complete tasks such as material handling, route yielding, task relay, area avoidance, path reconstruction, and obstacle avoidance.
[0003] Due to the characteristics of industrial sites, such as narrow passages, dynamic changes in obstacles, dense task requests, and real-time fluctuations in equipment operating status, multiple heterogeneous robots need to perform continuous collaborative perception, collaborative decision-making, and collaborative control under the conditions of sharing site resources and shared task objectives. Therefore, how to achieve stable and efficient collaborative control of heterogeneous robot formations has become an important technical problem in the field of industrial automation.
[0004] In existing industrial robot swarm control methods, one approach adopts a uniform control strategy or decision-making model for all robots. This means that it does not differentiate between robots in terms of functional roles, mobility, and perception configuration, but uses similar action selection logic and parameter update mechanisms for different types of robots. While this approach facilitates control architecture design and parameter management, in practical applications, due to significant differences in the functional configurations and task roles of different robot types, a uniform control method often struggles to simultaneously meet the control needs of different robots. This can easily lead to problems such as path competition, passage conflicts, task response imbalances, and localized congestion among different types of robots in narrow passages, intersections, or high-density work areas.
[0005] Another approach is to establish an independent control model or update mechanism for each robot to preserve the behavioral differences between different robots. This approach can adapt to the functional differences between heterogeneous robots to a certain extent, but as the number of robots increases, the scale of model parameters, training costs, and the burden of online updates will increase rapidly. Under conditions of peak tasks, frequent obstacle changes, or temporary additions and removals of robots, problems such as asynchronous updates of control parameters among multiple robots, inconsistent task handover, and decreased overall coordination of the formation are likely to occur.
[0006] Therefore, it is evident that existing robot control methods suffer from difficulties in establishing an efficient correlation between the overall collaborative effect of the formation and the individual control differences of different types of robots; and in simultaneously addressing the adaptation needs of heterogeneous robots and the consistency requirements of formation operation. These issues limit the application effectiveness of heterogeneous robot formations in practical scenarios such as intelligent manufacturing, warehousing and logistics, and port transshipment. Summary of the Invention
[0007] The purpose of this application is to provide a method for updating the cooperative control strategy of heterogeneous robot formations, which aims to solve the problems in the existing heterogeneous robot control methods, such as the difficulty in establishing an efficient correlation between the overall cooperative effect of the formation and the individual control differences of different types of robots; and the difficulty in simultaneously taking into account the differences in adaptation between heterogeneous robots and the consistency requirements of formation operation.
[0008] This application provides a method for updating a cooperative control strategy for heterogeneous robot formations, the method comprising:
[0009] Acquire interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks; Based on the interaction trajectory data, a centralized value assessment is performed to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation; The heterogeneous robot is divided into several strategy groups. Based on the contribution of each strategy group to the global advantage, the global advantage is decomposed into the condition group advantage corresponding to each strategy group. The shared strategy parameters of each strategy group are updated at the group level according to the condition group advantage. Based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, the condition group advantage quantity is allocated within the group to obtain the individual advantage quantity corresponding to each heterogeneous robot; based on the individual advantage quantity, the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group are updated synchronously. The updated shared strategy parameters and personalized strategy parameters are sent to the controllers of each heterogeneous robot to generate collaborative task control instructions.
[0010] Another objective of this application is to provide a heterogeneous robot formation cooperative control strategy update device, the device comprising: The raw data acquisition unit is used to acquire the interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks. The global advantage quantity acquisition unit is used to perform centralized value assessment based on the interaction trajectory data to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation. The strategy group update unit is used to divide the heterogeneous robot into several strategy groups, decompose the global advantage into the condition group advantage corresponding to each strategy group based on the contribution of each strategy group to the global advantage, and update the shared strategy parameters of each strategy group at the group level according to the condition group advantage. The personalized strategy update unit is used to allocate the condition group advantage quantity within the group based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, so as to obtain the individual advantage quantity corresponding to each heterogeneous robot; and based on the individual advantage quantity, to synchronously update the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group. The instruction issuing unit is used to issue updated shared strategy parameters and personalized strategy parameters to the controllers of each heterogeneous robot, generating collaborative task control instructions.
[0011] Another objective of this application is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the heterogeneous robot formation cooperative control strategy update method as described above.
[0012] Another objective of this application is to provide a robot control system, including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor of the robot control system performs the steps of the heterogeneous robot formation cooperative control strategy update method as described above.
[0013] The heterogeneous robot formation cooperative control strategy update method provided in this application has the following key advantages: it can establish a hierarchical transmission from global cooperative evaluation to strategy group evaluation and then to individual evaluation based on the actual operating state of the heterogeneous robot formation. This avoids the strategy convergence caused by existing unified control methods and also avoids the problems of large parameter scale, asynchronous updates, and poor scalability caused by existing completely independent control methods. It can group robots according to their functional division and motion constraints, so that robots in the same group can share common control logic, while robots in different groups can retain differentiated strategies. It can allocate individual advantage quantities according to the actual contribution of robots in local working conditions, so that the strategy update basis is more consistent with the actual task role undertaken by the robots. This helps to improve the path coordination ability, passage stability, task connection ability, and control reliability of robot formations in dynamic obstacle environments. Attached Figure Description
[0014] Figure 1 A flowchart illustrating a method for updating a cooperative control strategy for heterogeneous robot formations, provided in this application embodiment; Figure 2A structural block diagram of a heterogeneous robot formation cooperative control strategy update device provided in this application embodiment; Figure 3 This is a block diagram of the internal structure of a computer device in one embodiment. Detailed Implementation
[0015] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0016] It is understood that the terms "first," "second," etc., used in this application may be used herein to describe various elements, but unless otherwise specified, these elements are not limited by these terms. These terms are used only to distinguish the first unit or module from another unit or module. For example, without departing from the scope of this application, the first unit may be referred to as the second unit, and similarly, the second module may be referred to as the first module.
[0017] like Figure 1 As shown, in one embodiment, a method for updating the cooperative control strategy of heterogeneous robot formations is proposed, which may specifically include the following steps: Step S10: Obtain the interaction trajectory data generated when the heterogeneous robot formation performs collaborative tasks.
[0018] In this embodiment, heterogeneous robot formation refers to robots that can be deployed in areas such as warehousing and logistics centers or port material transfer areas to complete collaborative tasks such as material handling, path yielding, task relay, area avoidance, and dynamic path reconstruction. The heterogeneous robots may include towed handling robots, forklift handling robots, inspection robots, and vision-guided robots. The interactive trajectory data is used to characterize the real-time operating status, control behavior, and task feedback results of each heterogeneous robot during the execution of collaborative tasks. This data may include indicators such as the global coordinates, speed, orientation, load status, remaining battery power, target workstation number, channel occupancy status, obstacle distribution status, task queue status, and inter-group communication link status of each heterogeneous robot in the workshop or warehouse area. It is understood that heterogeneous robots can also be used in scenarios other than material handling tasks.
[0019] Step S20: Based on the interaction trajectory data, a centralized value assessment is performed to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation.
[0020] In this embodiment, the aim is to map the local actions, global states, and task feedback of multiple heterogeneous robots at the current moment into a unified evaluation result of the overall formation's collaborative effect. The centralized value assessment refers to evaluating the impact of the collaborative execution results of various robots at the current moment on the overall task completion effect from the perspective of the entire formation. For example, when a traction-type transport robot prioritizes releasing the main aisle and a forklift-type transport robot successfully enters the shelf area to complete the forklift, the global state value is high; conversely, if the main aisle is congested, there is a conflict at the target workstation, or there is a risk of close-range collision, the global state value is low. The global advantage quantity refers to the degree of gain shown by the current state value compared to the baseline state value. The larger the value, the more beneficial the robot joint control behavior at the current moment is to improving the overall task execution efficiency, reducing aisle conflicts, and improving operational continuity. By constructing the global advantage quantity, engineering information originally scattered in the local observations and actions of each robot can be aggregated into a unified group-level and individual-level update basis, improving the directionality and stability of subsequent collaborative control strategy updates.
[0021] Step S30: Divide the heterogeneous robot into several strategy groups. Based on the contribution of each strategy group to the global advantage, decompose the global advantage into the condition group advantage corresponding to each strategy group. Update the shared strategy parameters of each strategy group at the group level according to the condition group advantage.
[0022] In this embodiment, the overall synergistic effect of the formation is further refined to robot groups of different functional categories, in order to distinguish the actual contribution of different strategy groups in the overall task. Strategy groups can be divided based on the functional roles and motion characteristics of the robots. For example, towed transport robots can be divided into towed transport groups, and forklift transport robots can be divided into forklift operation groups.
[0023] Shared strategy parameters can represent the backbone parameters of the strategy network used by robots within a group, and are used to complete local observation feature extraction, basic motion tendency discrimination, and generation of common behaviors for tasks within the group. For example, the shared strategy parameters of the traction and handling group can mainly characterize the logic of priority passage through aisles, smooth handling along long paths, and deceleration control under load conditions, while the shared strategy parameters of the forklift operation group can mainly characterize the logic of approaching the loading position, lifting stability, and docking at the workstation.
[0024] By first converting the global advantage quantity into the conditional group advantage quantity corresponding to each strategy group, and then using the conditional group advantage quantity to update the shared strategy parameters of each strategy group, the policy convergence problem caused by robots of different functional types directly sharing the same update basis can be avoided. It can also reduce control conflicts caused by simultaneous updates between groups, making the group-level update process more consistent with the functional division of labor among heterogeneous robots in industrial settings. Its key advantage lies in maintaining consistency in the basic control behavior of robots within the same group while preserving the control differences between different strategy groups, thus improving the overall collaborative stability of the formation.
[0025] Step S40: Based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, the condition group advantage quantity is allocated within the group to obtain the individual advantage quantity corresponding to each heterogeneous robot; based on the individual advantage quantity, the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group are updated synchronously.
[0026] In this embodiment, the actual impact of different robots within the same strategy group on the current group-level collaborative effect is further distinguished, while preserving the individual differences of each robot while maintaining the consistency of the basic strategy within the group. The individual identification parameter can refer to a unique personalized parameter vector or identifier embedding corresponding to each heterogeneous robot, and its function is to distinguish the differences between different robots within the same group in terms of task priority, local perception conditions, actuator states, etc.
[0027] The intra-group allocation operation refers to generating allocation coefficients based on the local observation data and individual identification parameters of each robot, so that robots that contribute more or are in critical working conditions receive a higher individual advantage. For example, in a forklift operation group, a forklift robot located near the target location and currently undertaking the main handling task can have a higher intra-group allocation score than a robot in the same group that is only waiting. Synchronous updates mean updating the shared strategy parameters within the current strategy group using an intra-group average gradient to maintain consistency in the basic behavioral logic of robots in the same group; and independently updating the individual strategy parameters of each robot to preserve individual differences within the group.
[0028] The key advantage of this step is that it avoids the problem of identical robot behavior within a group caused by relying solely on group-level unified updates. It enables robots in the same group to generate differentiated control responses based on their own roles, positions, and local working conditions, even while sharing common control logic.
[0029] Step S50: The updated shared strategy parameters and personalized strategy parameters are sent to the controllers of each heterogeneous robot to generate collaborative task control instructions.
[0030] In this embodiment, the control strategy parameters obtained through training and optimization are deployed into the actual robot control system, enabling the heterogeneous robot to perform specific control actions in the industrial field according to the updated strategy. The controller can be an edge control module such as a motion controller or path tracking controller for the robot body.
[0031] The collaborative task control instructions generated by the updated shared strategy parameters and personalized strategy parameters can include target workstation selection instructions, path switching instructions, yielding and waiting instructions, etc.
[0032] Taking a warehousing and logistics center as an example, when the system detects that a towed handling robot is obstructing the main aisle, a forklift robot is preparing to leave the shelving area, or an inspection robot is executing a fixed inspection route, the controller can generate control commands for different robots based on the updated parameters. For the towed handling robot, the controller outputs a command to keep moving straight and slow down; for the forklift robot, the controller outputs a command to temporarily give way and wait for the main aisle to open. In this way, the strategy update result is transformed into control actions that can be actually executed on the industrial site, realizing the transition from strategy optimization to control implementation.
[0033] In this embodiment, heterogeneous robot formation can represent a collaborative operation system composed of two or more robots that differ in functional roles, kinematic constraints, payload capacity, or task responsibilities; shared strategy parameters can represent the basic parameters of the strategy network used by multiple robots within the same strategy group, used to extract common features within the group and generate consistent basic action decision logic within the group; personalized strategy parameters can represent parameter vectors unique to each robot, used to express the differences between robots based on shared strategy parameters; global advantage quantity can represent the comprehensive evaluation result of the current state and joint action effect from the overall perspective of the heterogeneous robot formation; conditional group advantage quantity can represent the contribution evaluation result of the current strategy group to the overall collaborative effect given the states of the preceding updated strategy group and the subsequent unupdated strategy group; individual advantage quantity can represent the update basis for further allocation to individual robots within the strategy group.
[0034] The method provided in this application has the advantage of establishing a hierarchical transmission from global collaborative evaluation to strategy group evaluation and then to individual evaluation based on the actual operating state of heterogeneous robot formations. This avoids the strategy convergence caused by existing unified control methods and also avoids the problems of large parameter scale, asynchronous updates, and poor scalability caused by existing completely independent control methods. It can group robots according to their functional division and motion constraints, so that robots in the same group can share common control logic, while robots in different groups can retain differentiated strategies. It can allocate individual advantage quantities according to the actual contribution of robots in local working conditions, so that the basis for strategy updates is more consistent with the actual task role undertaken by the robots. This helps to improve the path coordination ability, passage stability, task connection ability, and control reliability of robot formations in dynamic obstacle environments.
[0035] As an embodiment of this application, taking the following warehousing system as an example, the system includes 4 towed handling robots, 3 forklift handling robots, 2 inspection robots, and 1 vision-guided robot. The towed handling robots are mainly responsible for long-distance material transfer, the forklift handling robots are mainly responsible for picking and placing goods in the shelving area, the inspection robots are responsible for aisle inspection and equipment status monitoring, and the vision-guided robot is responsible for target location identification and docking correction. During system operation, the system continuously collects the global position, speed, orientation, load status, battery status, local obstacle information, relative pose of neighboring robots, relative orientation of the target location, task queue status, main aisle occupancy status, and task execution feedback data of each robot. Based on the above data, the system performs global advantage assessment, strategy group contribution decomposition, intra-group advantage allocation, and parameter synchronization updates.
[0036] Based on the algorithm provided in this application, in situations of channel congestion, intersection conflicts, and temporary obstacles, the traction-type transport robot can release the main channel more smoothly, the forklift-type transport robot can reduce invalid waiting time, and the inspection robot can automatically avoid high-priority transport paths. This reduces the number of path competitions, shortens intersection waiting time, and makes task relay smoother. Compared with other general control algorithms, the method in this application can better maintain the functional division of labor among heterogeneous robots and improve the overall efficiency and stability of multi-robot collaborative operations in industrial sites.
[0037] In a preferred embodiment, the method for obtaining the interaction trajectory data generated when a heterogeneous robot formation performs a collaborative task is as follows: Get Global state data at time step , Global state data at time step , Real-time task execution feedback data , No. A heterogeneous robot at any time Local observation data Joint motion data of heterogeneous robot formation at time t To obtain interaction trajectory data : ; The local observation data includes at least the robot's own state, the state information of the interacting objects, and the results of local environmental perception. It can represent the absolute position of all robots in the workshop map, the occupancy status of each aisle, the queuing status of workstations, the task queue status, and the distribution status of major obstacles; It can characterize the changes in the global environment after joint actions are taken; It can represent the robot's own speed, turning angle, load, local point cloud, relative position of neighboring robots, local passable area, and relative pose of the target point; It can represent the set of actions that all robots perform together at that moment; This can represent rewards for task completion, penalties for delays, penalties for congestion, and rewards for successful obstacle avoidance. The interactive objects can include neighboring robots, target shelves, target workstations, or dynamic obstacles that will soon intersect, connect tasks, or give way to the current robot. Acquiring this interaction trajectory data can comprehensively reflect the perception state, action execution state, and task result state of heterogeneous robot formations in dynamic industrial scenarios.
[0038] In a preferred embodiment, the method for obtaining a global advantage quantity characterizing the overall cooperative control effect of the heterogeneous robot formation by performing centralized value assessment based on interaction trajectory data includes: Construct a centralized value network based on global state data from interaction trajectory data. Input, output time Corresponding global state value ; The temporal difference component is obtained based on task execution feedback data and global state value in the interaction trajectory data: ; in, Indicates the discount factor; for The global state value at any given time. for The global state value at any given time. Represents the timing difference component; The global dominance quantity is obtained based on the aforementioned time difference component: ; in, Indicates the advantage smoothing parameter. Indicates the cumulative length of the advantage. For parameters, For timing difference components, This represents the overall advantage.
[0039] In the embodiments of this application, This can represent the expected cumulative task benefits that the formation can obtain by continuing to execute the task according to the current strategy. These benefits can be reflected through multiple quantifiable indicators, such as the amount of material handled per unit time, total waiting time, channel congestion level, number of yielding conflicts in intersection areas, success rate of temporary obstacle avoidance, and task timeout rate. (Time difference component) Its purpose is to characterize the deviation between the actual feedback at the current moment and the value network's predicted value. A larger value indicates a strong positive impact; conversely, a smaller value indicates that the current action has not achieved the expected synergistic effect. Global Advantage Quantity This refers to the comprehensive judgment result of whether the joint control behavior at the current moment is better than the baseline state. For example, in a warehousing and logistics center, if a set of actions ensures timely clearing of the main aisle, smooth connection of goods handling, and unobstructed passage of inspection robots, the corresponding global advantage is high; if it leads to congestion in intersection areas, increased waiting time at workstations, or failure of local avoidance, the corresponding global advantage is low. In this way, scattered industrial operation indicators can be aggregated into a unified evaluation signal that can be used for subsequent group-level and individual-level updates.
[0040] In a preferred embodiment, the method for dividing heterogeneous robots into several strategy groups is as follows: Obtain the functional and motion attributes of each heterogeneous robot; Based on preset grouping rules, heterogeneous robots whose functional attributes belong to the same category and whose motion attributes meet the same group constraints are assigned the same strategy group identifier and grouped into the same strategy group, resulting in several strategy groups.
[0041] In this embodiment, functional attributes can represent the robot's task role and business responsibilities in the industrial field, such as traction and handling, forklift operations, aisle inspection, target guidance, and docking correction. Motion attributes can represent the robot's physical constraints and execution capabilities in motion control, such as maximum speed, maximum acceleration, turning radius, load capacity, braking capacity, lifting range, chassis type, and aisle width. Preset grouping rules are used to classify heterogeneous robots with similar functional roles and compatible motion characteristics into the same strategy group, ensuring that robots within the same group have shareable control objectives and basic motion logic. For example, traction-type handling robots can be divided into a traction and handling group. Robots within the same group can share basic strategy parameters, reducing the cost of repetitive modeling; functional differences are preserved between different groups to avoid strategy convergence problems caused by all robots using the same update logic. The advantage is that it ensures the consistency of control logic within a group and preserves the adaptability between heterogeneous robots.
[0042] In a preferred embodiment, clustering methods can be used to group heterogeneous robots with similar functional attributes into the same category. Furthermore, considering that robots may leave a group due to malfunctions or other reasons, the grouping will exhibit temporal and spatial heterogeneity. Therefore, the grouping method may further include: within a preset training period, the system constructs behavioral feature samples based on interaction trajectory data, and extracts latent variable representations for each heterogeneous robot using a VAE variational autoencoder; based on the latent variable representations, a clustering algorithm is used to subdivide the robots within the initial strategy group into adaptive sub-strategy groups; during system operation, sub-strategy group updates are triggered according to a preset regrouping period or when environmental state changes exceed a threshold; a parameter migration mechanism is used during updates to maintain the continuity of shared strategy parameters.
[0043] In a preferred embodiment, based on the contribution of each strategy group to the global advantage, the global advantage is decomposed into conditional group advantage values corresponding to each strategy group. The method for updating the shared strategy parameters of each strategy group at the group level based on the conditional group advantage values is as follows: Obtain the heterogeneous robot formation in the interaction trajectory data. Moment-time joint motion data ; Randomly sort all policy groups to obtain an update sequence that represents the update order of policy groups; For the first in the update sequence strategy groups Based on time global state The combined actions of all previously updated strategy groups Current strategy group joint action And the combined actions of all strategy groups that have not yet been updated. , obtain the The contribution coefficient of each strategy group to the global advantage : ; in, Indicates the first The contribution coefficient of each strategy group to the global advantage. Represents the contribution determination function; Based on contribution coefficient and the global advantage quantity , obtained the Advantage quantity of the condition group corresponding to each strategy group : ; in, Indicates the global advantage quantity. Indicates the first The condition group advantage quantity corresponding to each strategy group; Based on the condition group advantage quantity, construct the group-level update objective function for the current policy group, and perform gradient update on the shared policy parameters of the current policy group according to the group-level update objective function.
[0044] In this embodiment, decomposing the global advantage into conditional advantage values corresponding to each strategy group can further refine the overall formation evaluation result into the contribution evaluation results of different strategy groups to the overall collaborative effect at the current moment. The update sequence obtained by random sorting is used to determine the order of each strategy group in this round of parameter updates. The purpose is to avoid the bias caused by long-term use of a fixed order and reduce the mutual interference caused by simultaneous updates of different strategy groups.
[0045] Contribution coefficient Characterizes the contribution of each group of actions to the overall synergistic effect. Contribution determination function. The purpose of this approach is to comprehensively consider the relationships between the global state, the actions of previously updated strategy groups, the actions of the current strategy group, and the actions of subsequent unupdated strategy groups to determine the weight of the current strategy group in the overall coordination. For example, when the traction and transportation group is in a bottleneck area of the main channel and its actions directly affect the passage of the forklift operation group and the inspection group, the contribution coefficient of the traction and transportation group can be relatively high; when a strategy group only performs low-coupling tasks in the edge area, its contribution coefficient can be relatively low. The conditional group advantage quantity, obtained by combining the contribution coefficient and the global advantage quantity, can more accurately reflect the actual impact of the current strategy group in this round of updates, giving the group-level update a clear direction. Through the group-level update objective function and gradient descent update, the shared strategy parameters of the current strategy group can be adjusted in the direction of increasing the positive contribution of the group to the overall coordination effect.
[0046] In a preferred embodiment, the method for constructing a group-level update objective function for the current policy group based on the condition group dominance quantity, and performing gradient updates on the shared policy parameters of the current policy group according to the group-level update objective function, is as follows: based on the condition group dominance quantity corresponding to the current policy group... And the current policy group in sharing policy parameters Output the current combined action Based on the conditional probabilities, construct the group-level update objective function for the current policy group: ; The shared policy parameters are updated using gradient descent based on the gradient of the group-level update objective function with respect to the shared policy parameters of the current policy group: ; Obtain the updated shared policy parameters for the current policy group. Among them, Indicates the current policy group The group-level update objective function represents the optimization objective of the shared policy parameters of the current policy group; This means to calculate the expectation of the training samples corresponding to the current policy group, which can be expressed as to calculate the average of the trajectory samples of the current policy group at each time step within a training batch. This represents the shared policy function corresponding to the current policy group, which is used to determine the probability distribution of the joint action based on the input observations under the influence of the shared policy parameters. Indicates the current policy group Shared strategy parameters; Indicates the current policy group At any moment A set of local observation data; symbol " "" indicates a conditional relationship, that is, given local observation data Under the given conditions, the current strategy group outputs a joint action. Conditional probability; conditional group dominance It can characterize the contribution of the current strategy group's action selection to the overall collaborative control effect, given that the preceding strategy group has been updated and the subsequent strategy group retains the strategy state before the update. Represents the group-level update objective function Regarding the shared policy parameters of the current policy group The gradient represents the optimization direction and magnitude of the shared policy parameters of the current policy group; This represents the group-level update learning rate, which controls the adjustment step size of the shared policy parameters of the current policy group in a single gradient update. Indicates the time.
[0047] In a preferred embodiment, the method for allocating the condition group advantage quantities within the group to obtain the individual advantage quantities corresponding to each heterogeneous robot is as follows: Construct an intra-group advantage allocation network, taking the local observation data and individual identification parameters of each heterogeneous robot within the strategy group as input, and outputting the allocation coefficients corresponding to each heterogeneous robot. : ; in, For allocation coefficients, This represents the intra-group advantage allocation network. Indicates the first Individual identification parameters of heterogeneous robots; Represents the exponentiation function; Indicates the number of participants in the group's advantage allocation One strategy group; Index markers indicating heterogeneous robots; Indicates the first The intra-group allocation score corresponding to each heterogeneous robot is used to characterize the priority of the heterogeneous robot in allocating the condition group advantage quantity within the current strategy group. This represents the intra-group advantage allocation network; Network parameters representing the intra-group advantage allocation network; Indicates the first A heterogeneous robot at any time Local observation data; Indicates the first Individual identification parameters of heterogeneous robots; Indicates an index tag; Indicates time; Based on the allocation coefficient, the condition group advantage quantity corresponding to each strategy group is allocated to the individual advantage quantity corresponding to each heterogeneous robot within the group: ; in, For individual advantage quantity, express The corresponding condition set advantage.
[0048] In the embodiments of this application, This can characterize the priority of the heterogeneous robot in allocating the advantage quantity of the condition group within the current strategy group. It can be used to generate intra-group assignment scores for heterogeneous robots based on input local observation data and individual identification parameters. The role of the intra-group advantage allocation network is to determine the relative contribution of each robot in the current group-level collaborative result based on the local observation state and individual identity differences of each robot within the same strategy group. This parameter can be used to quantitatively evaluate the importance of a robot's current local working condition. It dynamically changes with the robot's location, task, surrounding obstacles, and the state of interacting objects. For example, in a forklift operation group, a forklift robot located near the target location and already aligned can receive a higher intra-group allocation score than a robot in the same group that is far from the location and waiting. In an inspection group, an inspection robot located at a main aisle intersection and performing path monitoring tasks can have a higher allocation coefficient than an inspection robot in a regular inspection section. Therefore, the conditional group advantage can be further transformed into an individual advantage. This ensures that each robot receives an update basis that matches its actual function, avoiding the situation where all robots in the group receive the exact same update signal, and enhancing the matching between individual control behavior and local working conditions.
[0049] In a preferred embodiment, the method for synchronously updating the shared policy parameters of the policy group and the individual policy parameters of each heterogeneous robot within the group based on individual advantage quantities is as follows: Assign a unique identifier to each heterogeneous robot In the embedded parameter matrix Read the embedding vectors corresponding to each heterogeneous machine to obtain the first... Personalized strategy parameters for each heterogeneous robot: ; ; in, Indicates the number of heterogeneous robots. The dimensions representing the personalized strategy parameters, Indicates personalized strategy parameters, Represents the embedding parameter matrix A unique identifier The index obtained row vectors Represents the set of real numbers; For the first strategy group For heterogeneous robots, the importance of conditional probability calculation for the same control action is sampled at a ratio of... : ; in, Indicates shared policy parameters and personalized strategy parameters The current strategy determined jointly This indicates the pre-update policy, which is jointly determined by the pre-update shared policy parameters and the pre-update personalized policy parameters. Indicates time; Indicates the first A heterogeneous robot at any time Control actions; Indicates the first A heterogeneous robot at any time Local observation data; Indicates shared strategy parameters; After trimming the importance sampling ratio, we get: ; in, Indicates the cutting factor. The importance sampling ratio after cropping; The individual policy loss function is constructed based on the importance sampling ratio after pruning, the individual advantage, and the KL divergence between the old and new strategies. : ; in, This represents the KL divergence regularization coefficient; Indicates shared policy parameters and personalized strategy parameters The jointly determined current strategy is based on given local observation data. Output control action under conditions The conditional probability; This indicates that the pre-update policy, determined jointly by the pre-update shared policy parameters and the pre-update personalized policy parameters, is applied to a given local observation dataset. Output control action under conditions The conditional probability; Indicates the first Individual advantage quantity of heterogeneous robots; Indicates the first Individual policy loss function for heterogeneous robots; This represents the Kullback-Leibler divergence between the old and new strategies; "" indicates the comparison relationship between two probability distributions in divergence calculation; Based on the individual policy loss functions of each heterogeneous robot within the current policy group, the average gradient within the group is calculated for the shared policy parameters: ; in, This indicates the number of heterogeneous robots within the current strategy group; This represents the average gradient within the group; Indicates the first One strategy group; Update the shared policy parameters based on the intra-group average gradient: ; in, Indicates the learning rate, a parameter of the shared policy. This indicates assignment; Personalized policy parameters are updated based on the individual policy loss function of each heterogeneous robot. ; in, This represents the learning rate, a parameter of the personalized strategy. Broadcast the updated shared policy parameters to all heterogeneous robots within the current policy group.
[0050] In the embodiments of this application, This represents a pruning function used to limit the importance sampling ratio within a preset range. It can be a lookup table or an embedded table, a unique identifier for different robots. By corresponding to different row vectors, different robots within the same strategy group can obtain different personalized strategy parameters. Its function is to measure the degree of probability change of the same control action under the updated policy and the original policy, and to determine the magnitude of the offset between the old and new policies. To optimize the training, the study considers the impact of individual advantage on the quality of movements and uses the KL divergence term to limit excessive differences in the distribution between the old and new strategies, thereby improving training stability. (Intra-group average gradient) Used to summarize the update direction of all robots in the current strategy group for the shared strategy parameters, so that the shared strategy parameters can reflect the common adjustment trend within the group; Independent updates are used to preserve the differences between robots within the group. Updated shared policy parameters are broadcast to all heterogeneous robots within the current policy group, enabling robots in the same group to use consistent basic policy logic in the next round of control decisions. The advantage is that it maintains the consistency of basic control behavior among robots in the group while allowing each robot to maintain differentiated decision-making capabilities based on its own identity, location, and local working conditions.
[0051] In a preferred embodiment, after the shared strategy parameters and personalized strategy parameters are updated, the updated shared strategy parameters are broadcast to each heterogeneous robot in the current strategy group, and an update lock is set so that each heterogeneous robot in the current strategy group does not perform new environmental interactions or strategy calculations before the shared strategy parameters are synchronized. After the shared strategy parameters are synchronized, the shared strategy parameters and personalized strategy parameters before the update are updated to the current shared strategy parameters and the current personalized strategy parameters, respectively, for use in the next round of importance sampling ratio calculation.
[0052] like Figure 2 As shown, in one embodiment, a heterogeneous robot formation cooperative control strategy update device is provided, which may specifically include: The raw data acquisition unit 510 is used to acquire the interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks. The global advantage quantity acquisition unit 520 is used to perform centralized value assessment based on the interaction trajectory data to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation. The strategy group update unit 530 is used to divide the heterogeneous robot into several strategy groups, decompose the global advantage into the condition group advantage corresponding to each strategy group based on the contribution of each strategy group to the global advantage, and update the shared strategy parameters of each strategy group at the group level according to the condition group advantage. The personalized strategy update unit 540 is used to allocate the condition group advantage quantity within the group based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, so as to obtain the individual advantage quantity corresponding to each heterogeneous robot; and based on the individual advantage quantity, to synchronously update the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group. The instruction issuing unit 550 is used to issue updated shared strategy parameters and personalized strategy parameters to the controllers of each heterogeneous robot to generate collaborative task control instructions.
[0053] In the embodiments of this application, the explanation and description of the above-mentioned heterogeneous robot formation cooperative control strategy update device can be referred to the explanation and description of the corresponding method above. For the description of the heterogeneous robot formation cooperative control strategy update method, please refer to the above text, and it will not be repeated here.
[0054] Figure 3 An internal structural diagram of a computer device in one embodiment is shown. Figure 3 As shown, the computer device includes a processor, memory, network interface, input device, and display screen connected via a system bus. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores an operating system and may also store a computer program. When executed by the processor, this computer program enables the processor to implement a method for updating the heterogeneous robot formation cooperative control strategy. The internal memory may also store a computer program, which, when executed by the processor, enables the processor to execute the heterogeneous robot formation cooperative control strategy update method.
[0055] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0056] In one embodiment, the heterogeneous robot formation cooperative control strategy update device provided in this application can be implemented as a computer program, which can be implemented in the form of, for example, Figure 3 The device shown operates on this device. The device's memory can store the various program modules that make up the heterogeneous robot formation cooperative control strategy update device, for example, Figure 2 The example shown includes the raw data acquisition unit 510 and the global advantage acquisition unit 520. The computer program, comprised of these various program modules, causes the processor to execute the steps in the heterogeneous robot formation cooperative control strategy update method described in the various embodiments of this application.
[0057] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the processor performs the steps of the heterogeneous robot formation cooperative control strategy update method as described above.
[0058] In the embodiments of this application, the description of the above-mentioned heterogeneous robot formation cooperative control strategy update method is as described above, and will not be repeated here.
[0059] In one embodiment, a robot control system is provided, including a memory and a processor. The memory stores a computer program that, when executed by the processor, causes the system's processor to perform the steps of the heterogeneous robot formation cooperative control strategy update method as described above.
[0060] In this embodiment, the description of the heterogeneous robot formation cooperative control strategy update method is provided above and will not be repeated here. This robot can perform control strategy updates.
[0061] It should be understood that although the steps in the flowcharts of the various embodiments of this application are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.
[0062] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Furthermore, any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory.
[0063] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
Claims
1. A method for updating a heterogeneous robot formation cooperative control strategy, characterized in that, The method includes: Acquire interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks; Based on the interaction trajectory data, a centralized value assessment is performed to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation; The heterogeneous robot is divided into several strategy groups. Based on the contribution of each strategy group to the global advantage, the global advantage is decomposed into the condition group advantage corresponding to each strategy group. The shared strategy parameters of each strategy group are updated at the group level according to the condition group advantage. Based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, the condition group advantage quantity is allocated within the group to obtain the individual advantage quantity corresponding to each heterogeneous robot; based on the individual advantage quantity, the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group are updated synchronously. The updated shared strategy parameters and personalized strategy parameters are sent to the controllers of each heterogeneous robot to generate collaborative task control instructions.
2. The method for updating the cooperative control strategy of heterogeneous robot formation according to claim 1, characterized in that, The method for obtaining the interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks is as follows: Get Global state data at time step , Global state data at time step , Real-time task execution feedback data , No. A heterogeneous robot at any time Local observation data Joint motion data of heterogeneous robot formation at time t To obtain interaction trajectory data : ; The local observation data includes at least the robot's own state, the state information of the interacting objects, and the results of local environmental perception.
3. The method for updating the cooperative control strategy of heterogeneous robot formation according to claim 1, characterized in that, The method for obtaining the global advantage quantity characterizing the overall cooperative control effect of the heterogeneous robot formation through centralized value assessment based on interaction trajectory data includes: Construct a centralized value network based on global state data from interaction trajectory data. Input, output time Corresponding global state value ; The temporal difference component is obtained based on task execution feedback data and global state value in the interaction trajectory data: ; in, Indicates the discount factor; for The global state value at any given time. for The global state value at any given time. Represents the timing difference component; The global dominance quantity is obtained based on the aforementioned time difference component: ; in, Indicates the advantage smoothing parameter. Indicates the cumulative length of the advantage. For parameters, For timing difference components, This represents the overall advantage.
4. The heterogeneous robot formation cooperative control strategy update method according to claim 1, characterized in that, The method for dividing heterogeneous robots into several strategy groups is as follows: Obtain the functional and motion attributes of each heterogeneous robot; Based on preset grouping rules, heterogeneous robots whose functional attributes belong to the same category and whose motion attributes meet the same group constraints are assigned the same strategy group identifier and grouped into the same strategy group, resulting in several strategy groups.
5. The heterogeneous robot formation cooperative control strategy update method according to claim 1, characterized in that, Based on the contribution of each strategy group to the global advantage, the global advantage is decomposed into the conditional group advantage corresponding to each strategy group. The method for updating the shared strategy parameters of each strategy group at the group level according to the conditional group advantage is as follows: Obtain the heterogeneous robot formation in the interaction trajectory data. Moment-time joint motion data ; Randomly sort all policy groups to obtain an update sequence that represents the update order of policy groups; For the first in the update sequence strategy groups Based on time global state The combined actions of all previously updated strategy groups Current strategy group joint action And the combined actions of all strategy groups that have not yet been updated. , obtain the The contribution coefficient of each strategy group to the global advantage : ; in, Indicates the first The contribution coefficient of each strategy group to the global advantage. Represents the contribution determination function; Based on contribution coefficient and the global advantage quantity , obtained the Advantage quantity of the condition group corresponding to each strategy group : ; in, Indicates the global advantage quantity. Indicates the first The condition group advantage quantity corresponding to each strategy group; Based on the condition group advantage quantity, construct the group-level update objective function for the current policy group, and perform gradient update on the shared policy parameters of the current policy group according to the group-level update objective function.
6. The heterogeneous robot formation cooperative control strategy update method according to claim 1, characterized in that, The method for allocating the conditional group dominance within the group to obtain the individual dominance for each heterogeneous robot is as follows: Construct an intra-group advantage allocation network, taking the local observation data and individual identification parameters of each heterogeneous robot within the strategy group as input, and outputting the allocation coefficients corresponding to each heterogeneous robot. : ; in, For allocation coefficients, This represents the intra-group advantage allocation network. Indicates the first Individual identification parameters of heterogeneous robots; Represents the exponentiation function; Indicates the number of participants in the group's advantage allocation One strategy group; Index markers indicating heterogeneous robots; Indicates the first The intra-group allocation score corresponding to each heterogeneous robot is used to characterize the priority of the heterogeneous robot in allocating the condition group advantage quantity within the current strategy group. This represents the intra-group advantage allocation network; Network parameters representing the intra-group advantage allocation network; Indicates the first A heterogeneous robot at any time Local observation data; Indicates the first Individual identification parameters of heterogeneous robots; Indicates an index tag; Indicates time; Based on the allocation coefficient, the condition group advantage quantity corresponding to each strategy group is allocated to the individual advantage quantity corresponding to each heterogeneous robot within the group: ; in, For individual advantage quantity, express The corresponding condition set advantage.
7. The heterogeneous robot formation cooperative control strategy update method according to claim 1, characterized in that, Based on individual advantage quantities, the method for synchronously updating the shared policy parameters of the policy group and the personalized policy parameters of each heterogeneous robot within the group is as follows: Assign a unique identifier to each heterogeneous robot In the embedded parameter matrix Read the embedding vectors corresponding to each heterogeneous machine to obtain the first... Personalized strategy parameters for each heterogeneous robot: ; ; in, Indicates the number of heterogeneous robots. The dimensions representing the personalized strategy parameters, Indicates personalized strategy parameters, Represents the embedding parameter matrix A unique identifier The index obtained row vectors Represents the set of real numbers; For the first strategy group For heterogeneous robots, the importance of conditional probability calculation for the same control action is sampled at a ratio of... : ; in, Indicates shared policy parameters and personalized strategy parameters The current strategy determined jointly This indicates the pre-update policy, which is jointly determined by the pre-update shared policy parameters and the pre-update personalized policy parameters. Indicates time; Indicates the first A heterogeneous robot at any time Control actions; Indicates the first A heterogeneous robot at any time Local observation data; Indicates shared strategy parameters; After trimming the importance sampling ratio, we get: ; in, Indicates the cutting factor. The importance sampling ratio after cropping; The individual policy loss function is constructed based on the importance sampling ratio after pruning, the individual advantage, and the KL divergence between the old and new strategies. : ; in, This represents the KL divergence regularization coefficient; Indicates shared policy parameters and personalized strategy parameters The jointly determined current strategy is based on given local observation data. Output control action under conditions The conditional probability; This indicates that the pre-update policy, determined jointly by the pre-update shared policy parameters and the pre-update personalized policy parameters, is applied to a given local observation dataset. Output control action under conditions The conditional probability; Indicates the first Individual advantage quantity of heterogeneous robots; Indicates the first Individual policy loss function for heterogeneous robots; This represents the Kullback-Leibler divergence between the old and new strategies. "" indicates the comparison relationship between two probability distributions in divergence calculation; Based on the individual policy loss functions of each heterogeneous robot within the current policy group, the average gradient within the group is calculated for the shared policy parameters: ; in, This indicates the number of heterogeneous robots within the current strategy group; This represents the average gradient within the group; Indicates the first One strategy group; Update the shared policy parameters based on the intra-group average gradient: ; in, Indicates the learning rate, a parameter of the shared policy. This indicates assignment; Personalized policy parameters are updated based on the individual policy loss function of each heterogeneous robot. ; in, This represents the learning rate, a parameter of the personalized strategy. Broadcast the updated shared policy parameters to all heterogeneous robots within the current policy group.
8. A heterogeneous robot formation cooperative control strategy update device, characterized in that, The device includes: The raw data acquisition unit is used to acquire the interaction trajectory data generated when heterogeneous robot formations perform collaborative tasks. The global advantage quantity acquisition unit is used to perform centralized value assessment based on the interaction trajectory data to obtain the global advantage quantity that characterizes the overall collaborative control effect of the heterogeneous robot formation. The strategy group update unit is used to divide the heterogeneous robot into several strategy groups, decompose the global advantage into the condition group advantage corresponding to each strategy group based on the contribution of each strategy group to the global advantage, and update the shared strategy parameters of each strategy group at the group level according to the condition group advantage. The personalized strategy update unit is used to allocate the condition group advantage quantity within the group based on the local observation data and individual identification parameters of each heterogeneous robot in the strategy group, so as to obtain the individual advantage quantity corresponding to each heterogeneous robot; and based on the individual advantage quantity, to synchronously update the shared strategy parameters of the strategy group and the personalized strategy parameters of each heterogeneous robot in the group. The instruction issuing unit is used to issue updated shared strategy parameters and personalized strategy parameters to the controllers of each heterogeneous robot, generating collaborative task control instructions.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the steps of the heterogeneous robot formation cooperative control strategy update method as described in any one of claims 1 to 7.
10. A robot control system, characterized in that, The system includes a memory and a processor, wherein the memory stores a computer program that, when executed by the processor, causes the processor of the robot control system to perform the steps of the heterogeneous robot formation cooperative control strategy update method as described in any one of claims 1 to 7.