An intelligent unmanned cluster system optimal bipartite cooperative control method

By constructing a dual-Q learning network using the ALR-TQDPG algorithm and utilizing local state error and adaptive learning rate, the impact of cooperative competition intensity on system performance in unmanned swarm systems is addressed. This achieves rapid convergence and resource optimization, improving system stability and the accuracy of learning results.

CN120065840BActive Publication Date: 2026-06-23CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2025-02-24
Publication Date
2026-06-23

Smart Images

  • Figure CN120065840B_ABST
    Figure CN120065840B_ABST
Patent Text Reader

Abstract

The present application belongs to the field of intelligent unmanned cluster system control, and particularly relates to an optimal two-part cooperative control method for an intelligent unmanned cluster system; the method comprises the following steps: determining the topological structure of the system according to the interconnection relationship among unmanned aerial vehicles in the unmanned cluster system; based on the topological structure of the system, each unmanned aerial vehicle intelligent agent sends its own state information to its neighbor intelligent agents, and calculates the local state error of each unmanned aerial vehicle; a double Q learning network is constructed, and the double Q learning network is trained by using an ALR-TQDPG algorithm according to the local state error until the intelligent unmanned cluster system reaches consistency, and the target network outputs control information; the control information output by the target network is used to control the intelligent agents, so that the cooperative control of the intelligent unmanned cluster system is realized; the present application can cope with the influence of cooperation-competition intensity on system performance; the problem of underestimated Q value caused by the minimization of Q value in the action selection process is reduced, and the present application has a good application prospect.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent unmanned swarm system control, specifically relating to an optimal two-part cooperative control method for intelligent unmanned swarm systems. Background Technology

[0002] In recent years, inspired by the collective behavior of organisms in nature, experts and scholars have applied the consensus problem in unmanned swarm systems to the collaborative control of complex systems. The consensus problem in unmanned swarm systems has significant application prospects in areas such as swarm behavior control, smart grids, satellite formation, and drone swarms. This problem can be viewed as a fundamental phenomenon in unmanned swarm systems, where all agents reach the same state through information exchange, guided by a consensus control protocol.

[0003] In recent years, research on consensus control in multi-agent systems has largely focused on the assumption that agents cooperate completely. However, in reality, agents often exhibit both cooperative and competitive relationships. While some studies have begun to focus on these cooperative and competitive relationships, most have only considered their existence without further exploring the specific impact of their intensity. In fact, research has found that the intensity of cooperation and competition not only affects the system's convergence speed but can even lead to system instability.

[0004] Some research has addressed the impact of cooperative and competitive intensity on system performance, designing corresponding algorithms and employing techniques such as target networks and empirical replay to optimize convergence speed and reduce energy consumption. However, this approach requires dynamically adjusting the intensity of cooperation and competition, which is typically complex and difficult to implement in practical applications. Therefore, finding an effective method to address the impact of cooperative and competitive intensity on system performance has become an urgent problem to be solved.

[0005] Furthermore, while considering the consistency problem of unmanned swarm systems, it is also necessary to minimize the resources consumed in the process, i.e., to achieve optimal consistency. Solving optimal control relies on the Hamilton-Jacobi-Bellman (HJB) equations. The core problem of optimal control lies in the difficulty of analytically solving the HJB equations, which can lead to the "curse of dimensionality." Reinforcement learning methods play an important role in solving the HJB equations for optimal control. The Deterministic Policy Gradient (DDPG) algorithm in reinforcement learning is suitable for continuous actions and state spaces. It uses experience replay to mitigate sample correlation and introduces a target network and delayed policy updates to improve stability. The algorithm exhibits good stability and performance. However, DDPG belongs to Q-learning, and overestimation bias is a characteristic of Q-learning. Maximizing the noise value estimate can lead to consistent overestimation. How to solve the overestimation bias problem is the second problem to be addressed. Summary of the Invention

[0006] To address the shortcomings of existing technologies, this invention proposes an optimal two-part cooperative control method for intelligent unmanned swarm systems, which includes:

[0007] S1: Determine the system topology based on the interconnection relationships between drones in the unmanned swarm system;

[0008] S2: Based on the system's topology, each drone agent sends its own state information to its neighboring agents and calculates the local state error of each drone.

[0009] S3: Construct a double-Q learning network and train the double-Q learning network using the ALR-TQDPG algorithm based on the local state error until the intelligent unmanned swarm system reaches consistency and the target network outputs control information;

[0010] S4: Use the control information output by the target network to control the intelligent agent and realize the collaborative control of the intelligent unmanned swarm system.

[0011] Preferably, the formula for calculating the local state error is expressed as follows:

[0012]

[0013] Among them, e i (k) represents the state information x of the i-th follower agent at time k. i The consistency error under (k), x i (k) represents the state information of the i-th follower agent at time k, x0(k) represents the state information of the leader agent at time k, and c i g represents the connection weight between the leader and agent i. i a represents the restraint control on agent i. ij x represents the connection weight from agent j to agent i. j (k) represents the state information of the j-th follower agent at time k, N i Let i represent the set of neighboring nodes of agent i. Represents the neighboring intelligent agent j of intelligent agent i. The summation of values, sign() represents the sign function.

[0014] Preferably, the process of training a double-Q learning network using the ALR-TQDPG algorithm includes:

[0015] S31: Initialize the parameters of the evaluator network, action network, and target network; the action network is the actor network; the evaluator network includes the critic1 network and the critic2 network; the target network includes the critic1 network, the critic2 network, and the target network corresponding to the actor network.

[0016] S32: The actor network outputs the control information at the current moment, and stores the local state error of the UAV at the current moment, the local state error at the next moment, and the control information at the current moment as experience information into the experience pool;

[0017] S33: Calculate the Q-value of the evaluator network by selecting experience information from the experience pool;

[0018] S34: The target network outputs control information and calculates the Q value of the target network;

[0019] S35: Update the weights between historical TD errors and cooperative competition intensity;

[0020] S36: Calculate the TD error of the two critic networks based on the Q value of the target network; calculate the historical TD error based on the TD error of the critic networks;

[0021] S37: Calculate the adaptive learning rate based on the historical TD error and the weights between the historical TD error and the cooperative competition intensity; update the weights of the two critic networks based on the adaptive learning rate;

[0022] S38: Update the weights of the actor network based on the Q-value of the evaluator network, and update the weights of the target network corresponding to the actor network based on the weights of the actor network.

[0023] S39: Update the weights of the target network corresponding to the critic network based on the weights of the two critic networks among the evaluators;

[0024] S310: Determine whether the intelligent unmanned swarm system has reached consensus. If consensus is reached, the target network outputs control information; otherwise, return to step S32.

[0025] Furthermore, the formula for updating the weights between historical TD errors and cooperative competition intensity is as follows:

[0026] κ=tanh(b·ln(l+1))

[0027] Where κ represents the weight between historical TD error and cooperative competition intensity, b represents a constant, and l represents the training iteration round.

[0028] Furthermore, the formula for calculating the TD error of the two critic networks is as follows:

[0029]

[0030] in, This represents the TD error during critic 1 training at time k. This represents the TD error during critic2 training at time k. Represents the performance function. This represents the control input of agent i. This represents the Q-function used in training the target network corresponding to the critic 1 network. This represents the Q-function used in training the target network corresponding to the critic 2 network. This represents the Q-function trained on critic 1. The Q-function represents the training parameters for the Critical 2 network; e i (k) represents the state information x of the i-th follower agent at time k. i The consistency error under (k), e i (k+1) represents the state information x of the i-th follower agent at time k+1. i Consistency error under (k).

[0031] Furthermore, the historical TD error is calculated using the following formula:

[0032]

[0033] Among them, r1 i (k) represents the historical TD error of critic 1 training at time k, r2 i (k) represents the historical TD error of critic2 training at time k, Θ represents the decay factor, and r1 i (k-1) represents the historical TD error of critic 1 training at time k-1, r2 i (k-1) represents the historical TD error of critic 2 training at time k-1. This represents the TD error during critic 1 training at time k. This represents the TD error during the training of critic 2 at time k.

[0034] Furthermore, the formula for calculating the adaptive learning rate is:

[0035]

[0036] Where β1′ represents the adaptive learning rate of the critic1 network, β2′ represents the adaptive learning rate of the critic2 network, and β c This represents a fixed learning rate, where m is a constant and N is a variable. i Let a represent the set of neighbors of agent i. ij s represents the connection weight from agent j to agent i. ij r1 represents the cooperative competition strength between agents i and j, k represents the weight between historical TD error and cooperative competition strength, and r1 represents the cooperative competition strength between agents i and j. i (k) represents the historical TD error of critic 1 training at time k, r2i (k) represents the historical TD error of the critic 2 training at time k.

[0037] Furthermore, the formula for determining whether an intelligent unmanned swarm system has reached consensus is as follows:

[0038]

[0039] in, This represents the weight of critic 1 in the l-th iteration. This represents the weight of critic 1 in the (l+1)th iteration. This represents the weight of critic 2 in the l-th iteration. ε represents the weight of critic 2 in the (l+1)th iteration, and ε represents the stability threshold.

[0040] The beneficial effects of this invention are as follows:

[0041] 1. This invention proposes an adaptive learning rate adjustment formula, which aims to dynamically adjust the learning rate of the evaluator network based on the intensity of cooperative competition and the TD error. This adjustment aims to balance the impact of cooperative competition intensity on system performance, accelerate convergence speed, and reduce the resources required to achieve optimal control.

[0042] 2. Since the intensity of cooperation-competition affects the Q-value training of the evaluator network, the accuracy of Q-value estimation becomes crucial. To reduce the underestimation of Q-value during action selection due to minimizing Q-value, this invention employs two evaluator networks, each corresponding to a target evaluator network. In each iteration update equation of the TD error, the larger of the two target Q-values ​​is selected for training.

[0043] 3. The DDPG algorithm is suitable for continuous action spaces and can obtain more stable and higher-quality learning results. Therefore, this invention adopts an improved DDPG algorithm, namely the ALR-TQDPG algorithm, which uses empirical replay to eliminate temporal data correlations and utilizes the target network to enhance the stability of the training process. This method allows unmanned swarm systems to explore without continuous stimulus conditions. Attached Figure Description

[0044] Figure 1 This is an overall flowchart of the present invention;

[0045] Figure 2 This is a topology graph that may appear during the system convergence process of the present invention;

[0046] Figure 3 This is a graph showing the evolution of the tracking error e1 of the UAV system in the comparative experiment of this invention;

[0047] Where (a) represents the adaptive cooperative competition intensity and (b) represents the algorithm of this invention;

[0048] Figure 4 This is a graph showing the evolution of the tracking error e2 of the UAV system in the comparative experiment of this invention;

[0049] Where (a) represents the adaptive cooperative competition intensity and (b) represents the algorithm of this invention;

[0050] Figure 5 This is a diagram showing the evolution of the tracking trajectory X1 of the UAV system in the comparative experiment of this invention;

[0051] Where (a) represents the adaptive cooperative competition intensity and (b) represents the algorithm of this invention;

[0052] Figure 6 This is a diagram showing the evolution of the tracking trajectory X2 of the UAV system in the comparative experiment of this invention;

[0053] Where (a) represents the adaptive cooperative competition intensity and (b) represents the algorithm of this invention;

[0054] Figure 7 This is a graph showing the evolution of performance consumption of the unmanned aerial vehicle system in the comparative experiment of this invention;

[0055] Where (a) represents the adaptive cooperative competition intensity and (b) represents the algorithm of this invention;

[0056] Figure 8 This is a graph showing the evolution of actor network weights in this invention;

[0057] Figure 9 This is a graph showing the evolution of the critic network weights in this invention;

[0058] Figure 10 This is a graph showing the evolution of the adaptive learning rate in this invention. Detailed Implementation

[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0060] This invention proposes an optimal two-part cooperative control method for intelligent unmanned swarm systems, such as... Figure 1 As shown, the method includes the following:

[0061] S1: Determine the system topology based on the interconnection relationships between drones in the unmanned swarm system.

[0062] In a specific application scenario of this invention, drones are interconnected in a specific manner, such as... Figure 2 To illustrate, this system is a leader-follower drone cluster with five nodes. Nodes 1, 2, 3, and 4 act as follower drones, while node 0 serves as the leader drone. Specifically, follower drones 1 and 2 are designed to receive information from leader drone 0.

[0063] The topology of the unmanned swarm system is determined based on the interconnection relationships between the drones. For example... Figure 2 In this context, the topological structure diagram is represented as: G = diag{1,1,0,0}.

[0064] The connection matrix is ​​represented as:

[0065] The Laplace matrix is ​​expressed as:

[0066] S2: Based on the system's topology, each drone agent sends its own state information to its neighboring agents and calculates the local state error of each drone.

[0067] The unmanned swarm system of this invention adopts a leader-follower model, in which each drone maintains its own state information. Specifically, a symbol is used to represent the state information of a specific drone, and so on, the state information of any drone can be represented. To achieve dynamic updates of the system, the leader drone and follower drones respectively follow specific leader dynamic models and follower dynamic models for state iteration. This update process can be precisely described by corresponding equations. The dynamic equation of the leader drone is expressed as:

[0068] x0(k+1)=A0x0(x)

[0069] The dynamic equations of the follower drone are expressed as follows:

[0070] x i (k+1)=A i x i (k)+B i μ i (j), i = 1, 2, ..., N

[0071] Where x0(k) and x0(k+1) represent the state values ​​of the drone at times k and k+1, respectively. i (k), x i (k+1) represents the state value of follower drone i at times k and k+1, μ i (k) represents the control input of the follower drone at time k, where A0, Ai and B i These are different unknown constant matrices, each with appropriate dimensions.

[0072] In the implementation of this invention, the local state error of the UAV is constructed using the dynamic equations of the leader and follower, and expressed as follows:

[0073]

[0074] Among them, s ij ≠0 represents the intensity of cooperation and competition between agent i and agent j, e i (k) represents the new state information x at time k for the i-th agent. i Consistency error under (k); N i Denotes the set of neighboring nodes of agent i; a ij G represents the connection weight from agent j to agent i; i G represents the restraint control on agent i. i =0 means that followers cannot receive information from the leader, a ij >0 indicates that follower drone i can receive the status information of follower drone j, a ij =0 means that it cannot be received.

[0075] Based on the state equations of the drone leader and followers, the iterative form of the local state error is expressed as follows:

[0076]

[0077] The consumption performance index function is expressed as follows:

[0078]

[0079] in, It is a performance function. Q is the control input of neighboring intelligent agent i. ii R ii These are the weight matrix and symmetric matrix of the follower drone i, respectively; the performance index function serves as an evaluation metric to assess the current state e. i (k) lower controller μ i (e i The performance of (k)).

[0080] S3: Construct a dual-Q learning network and train it using the ALR-TQDPG algorithm based on local state errors until the intelligent unmanned swarm system reaches consensus, at which point the target network outputs control information.

[0081] A dual-Q learning network is constructed, comprising an evaluator network, an action network, and a target network, both of which are actor-critic network structures. The action network is the actor network, and the evaluator network includes the critic1 network and the critic2 network; the target network includes the actor network, the critic1 network, and the target network corresponding to the critic2 network.

[0082] This invention designs the ALR-TQDPG algorithm to train a double-Q learning network, and the specific process is as follows:

[0083] S31: Initialize the parameters of the evaluator network, action network, and target network.

[0084] Initialize appropriate parameter values, including the initial learning rate β of the actor network. a The initial learning rate β of the two critic networks c The target network updates the weights ι, with a stability threshold ε. Let l = 1, k = 1, and set the maximum number of iterations l. max Initialize actor network weights critic 1's network weight critic 2's network weight The target network weights corresponding to the actor network The target network weight corresponding to critic 1 network weight

[0085] The target network weight corresponding to the critic 2 network weight Initialize the experience pool The size of the experience pool is M.

[0086] S32: The actor network outputs the control information at the current moment, and stores the local state error of the UAV at the current moment, the local state error at the next moment, and the control information at the current moment as experience information into the experience pool.

[0087] Initially, the actor network outputs the control information for the current moment:

[0088]

[0089] in, This represents the control input of agent i at time k. This represents the transpose of the actor network weights. Let represent the activation function of the actor network for agent i at time k.

[0090] Subsequent updates are performed using a policy gradient approach, and the update of the control policy is defined as follows:

[0091]

[0092] Where, β a Indicates the learning rate. Indicates compared to The partial derivative of .

[0093] Store dataset e i (k),e i (k+1), to β M middle.

[0094] S33: Select experience information from the experience pool to calculate the Q value of the evaluator network.

[0095]

[0096] in, This represents the Q-value estimate from critic 1 training. This represents the Q-value estimate from the Criterion 2 training. This represents the transpose of the network weights for critic 1. This represents the transpose of the network weights for critic 2. This represents the activation function of the critic network. in, Let k represent the control input of agent i at time k, where k is the control input of agent i at time k. This represents the control input of neighboring agent i at time k.

[0097] S34: The target network outputs control information and calculates the Q value of the target network.

[0098] pass Calculate the action values, i.e., control information, of the target network; calculate the Q-value of the target network:

[0099]

[0100] in, This represents the Q-value estimate of the target network training in critic 1. This represents the Q-value estimate of the target network training in critic 2. This represents the control input of the target network at the error at time k+1. This represents the transpose of the target network weights for critic 1. This represents the transpose of the target network weights for critic 2.

[0101] in, This represents the control input of agent i at time k+1. This represents the control input of neighboring agent i at time k+1.

[0102] S35: Update the weights between historical TD errors and cooperative competition intensity.

[0103] κ=tanh(b·ln(l+1))

[0104] Where κ represents the weight between historical TD error and cooperative competition intensity, b represents a constant, and l represents the training iteration round.

[0105] S36: Calculate the TD error of the two critic networks based on the Q value of the target network; calculate the historical TD error based on the TD error of the critic networks.

[0106] The formula for calculating the TD error of two critic networks is:

[0107]

[0108]

[0109] in, This represents the TD error during critic 1 training at time k. This represents the TD error during critic2 training at time k. Represents the performance function. This represents the control input of agent i. This represents the Q-function used in training the target network corresponding to criticism 1. This represents the Q-function used in training the target network corresponding to criticism 2. The Q-function represents the training parameters of the critic 1 network. The Q-function, e, represents the training parameters of the Critical 2 network. i (k) represents the state information x of the i-th follower agent at time k. i The consistency error under (k), e i (k+1) represents the state information x of the i-th follower agent at time k+1. i Consistency error under (k).

[0110] The formula for calculating historical TD error is as follows:

[0111]

[0112] Among them, r1 i (k) represents the historical TD error of critic 1 training at time k, r2 i(k) represents the historical TD error of critic2 training at time k, Θ represents the decay factor, and r1 i (k-1) represents the historical TD error of critic 1 training at time k-1, r2 i (k-1) represents the historical TD error of critic 2 training at time k-1. This represents the TD error during critic 1 training at time k. This represents the TD error during the training of critic 2 at time k.

[0113] S37: Calculate the adaptive learning rate based on the historical TD error and the weights between the historical TD error and the cooperative competition intensity; update the weights of the two critic networks based on the adaptive learning rate.

[0114] The formula for calculating the adaptive learning rate is:

[0115]

[0116] Where β1′ represents the adaptive learning rate of the critic1 network, β2′ represents the adaptive learning rate of the critic2 network, and β c This represents a fixed learning rate, where m represents a constant. Represents the neighboring intelligent agent j of intelligent agent i. The summation of the values ​​is given by κ = tanh(b·ln(l+1)), where κ represents the weight between historical TD error and cooperative competition intensity.

[0117] Considering that the change in the learning rate should be within a reasonable range, a clipping operation is performed on β1′ and β2′. The conversion formula for the clipping operation is as follows:

[0118] in, β It represents the appropriate upper and lower bounds for the adaptive learning rate.

[0119] Update the weights of the critic network based on the adaptive learning rate:

[0120]

[0121] in, This represents the network weights for the (l+1)th iteration of the critic 1 network. E1 represents the weights of the critic 2 network in the (l+1)th iteration. ci (k) represents the loss function for critic 1 at time k, E2 ci (k) represents the loss function of critic 2 at time k. in, This represents the control input of agent i at time k. This represents the control input of neighboring agent i at time k. This represents the TD error during critic 1 training at time k. This represents the TD error during the training of critic 2 at time k.

[0122] S38: Update the weights of the actor network based on the Q-value of the evaluator network, and update the weights of the target network corresponding to the actor network based on the weights of the actor network.

[0123] Update the weights of the actor network:

[0124]

[0125] in, Relative to The partial derivatives of .

[0126] Update the weights of the target network corresponding to the actor network:

[0127]

[0128] S39: Update the weights of the target network corresponding to the critic network based on the weights of the two critic networks in the evaluator network.

[0129]

[0130] S310: Determine whether the intelligent unmanned swarm system has reached consensus. If consensus is reached, the target network outputs control information; otherwise, return to step S32.

[0131] The formula for determining whether an intelligent unmanned swarm system has reached consensus is:

[0132]

[0133] in, This represents the weight of critic 1 in the first iteration. This represents the weight of critic 1 in the (l+1)th iteration. This represents the weight of critic 2 in the first iteration. Let ε represent the weight of critic 2 in the (l+1)th iteration, and let ε represent the stability threshold.

[0134] If the above conditions are met, it is determined that the intelligent unmanned swarm system has reached consensus; otherwise, return to step S32 and continue iterating.

[0135] S4: Use the control information output by the target network to control the intelligent agent and realize the collaborative control of the intelligent unmanned swarm system.

[0136] Evaluation of the present invention:

[0137] The invention was simulated. Figure 3 , Figure 4 The state convergence diagrams of the unmanned swarm system are shown respectively. Figure 5 , Figure 6 The error convergence plot of the UAV swarm system is shown. Simulation results demonstrate that the UAV swarm system ultimately achieves consensus. To further verify the advantages of this invention, the same UAV dynamic system, topology, initial system state values, critic weights, actor weights, and other relevant parameters were used as in the comparative experiment. Here, 'a' represents the algorithm performance of the adaptive cooperative competition intensity, and 'b' represents the ALR-TQDPG algorithm proposed in this invention. It can be seen that this invention has a faster convergence speed.

[0138] A comparison from the perspective of performance consumption Figure 7 (a) represents the performance consumption of the algorithm with adaptive cooperative competition intensity, and (b) represents the performance consumption of the ALR-TQDPG algorithm proposed in this invention. It can be clearly seen that the algorithm proposed in this invention has lower performance consumption.

[0139] In addition, from Figure 8 and Figure 9 It can be seen that the evolution of actor network weights and critic network weights eventually tends to stabilize, the neural network has converged, and the training has achieved the desired results.

[0140] Previous research has shown that choosing an inappropriate cooperative-competitive strength parameter can lead to instability in unmanned swarm systems. By designing an ALR-TQDPG algorithm, when an unmanned swarm system with cooperative-competitive interactions eventually achieves bipartite consistency, the corresponding adaptive learning rate parameter converges to its optimal value (see...). Figure 10 This eliminates the need for manual adjustment of the learning rate parameter, ensuring system stability and reducing system energy consumption.

[0141] In summary, this invention dynamically adjusts the learning rate of the evaluator network based on the intensity of cooperative competition and TD error, which can address the impact of cooperative competition intensity on system performance; it reduces the problem of underestimating the Q value due to minimizing the Q value during the action selection process; and it achieves superior performance compared to the comparison method, showing promising application prospects.

[0142] The above-described embodiments further illustrate the purpose, technical solution, and advantages of the present invention. It should be understood that the above-described embodiments are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made to the present invention within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An optimal two-part cooperative control method for an intelligent unmanned swarm system, characterized in that, include: S1: Determine the system topology based on the interconnection relationships between drones in the unmanned swarm system; S2: Based on the system's topology, each drone agent sends its own state information to its neighboring agents and calculates the local state error of each drone. S3: Construct a double-Q learning network and train the double-Q learning network using the ALR-TQDPG algorithm based on the local state error until the intelligent unmanned swarm system reaches consistency and the target network outputs control information; The process of training a double-Q learning network using the ALR-TQDPG algorithm includes: S31: Initialize the parameters of the evaluator network, action network, and target network; the action network is the actor network; the evaluator network includes the critic1 network and the critic2 network; the target network includes the critic1 network, the critic2 network, and the target network corresponding to the actor network. S32: The actor network outputs the control information at the current moment, and stores the local state error of the UAV at the current moment, the local state error at the next moment, and the control information at the current moment as experience information into the experience pool; S33: Calculate the Q-value of the evaluator network by selecting experience information from the experience pool; S34: The target network outputs control information and calculates the Q value of the target network; S35: Update the weights between historical TD error and cooperative competition intensity; the formula for updating the weights between historical TD error and cooperative competition intensity is: ; in, This represents the weighting between historical TD errors and the intensity of cooperative competition. Represents a constant. Indicates the number of training iterations; S36: Calculate the TD error of the two critic networks based on the Q value of the target network; calculate the historical TD error based on the TD error of the critic networks; S37: Calculate the adaptive learning rate based on the historical TD error and the weights between the historical TD error and the cooperative competition intensity; update the weights of the two critic networks based on the adaptive learning rate; S38: Update the weights of the actor network based on the Q-value of the evaluator network, and update the weights of the target network corresponding to the actor network based on the weights of the actor network. S39: Update the weights of the target network corresponding to the critic network based on the weights of the two critic networks among the evaluators; S310: Determine whether the intelligent unmanned swarm system has reached consensus. If consensus is reached, the target network outputs control information; otherwise, return to step S32. S4: Use the control information output by the target network to control the intelligent agent and realize the collaborative control of the intelligent unmanned swarm system.

2. The optimal two-part cooperative control method for an intelligent unmanned swarm system according to claim 1, characterized in that, The formula for calculating local state error is expressed as: ; in, This represents the state information of the i-th follower agent at time k. Consistency error under the following conditions This represents the state information of the i-th follower agent at time k. This represents the state information of the leader agent at time k. This represents the connection weight between the leader and agent i. This represents the constraint control on agent i. This represents the connection weight from agent j to agent i. This represents the state information of the j-th follower agent at time k. Let i represent the set of neighboring nodes of agent i. Represents a symbolic function. This represents the intensity of cooperation and competition between agent i and agent j.

3. The optimal two-part cooperative control method for an intelligent unmanned swarm system according to claim 1, characterized in that, The formula for calculating the TD error of two critic networks is: ; ; in, This represents the TD error during the training of the critic1 network at time k. This represents the TD error during the training of the critic2 network at time k. Represents the performance function. This represents the control input of agent i. This represents the Q-function used in training the target network corresponding to the critic 1 network. This represents the Q-function used in training the target network corresponding to the critic2 network. This represents the Q-function used to train the critic1 network. The Q-function represents the training parameters of the critic2 network; This represents the state information of the i-th follower agent at time k. Consistency error under the following conditions This represents the state information of the i-th follower agent at time k+1. Consistency error.

4. The optimal two-part cooperative control method for an intelligent unmanned swarm system according to claim 1, characterized in that, The formula for calculating historical TD error is as follows: ; ; in, This represents the historical TD error of the critic1 network training at time k. This represents the historical TD error during the training of the critic2 network at time k. Indicates the attenuation factor. express Historical TD error during network training at time point critic1 express Historical TD error during the training of the critic2 network at various times. This represents the TD error during the training of the critic1 network at time k. This represents the TD error during the training of the critic2 network at time k.

5. The optimal two-part cooperative control method for an intelligent unmanned swarm system according to claim 1, characterized in that, The formula for calculating the adaptive learning rate is: ; ; in, This represents the adaptive learning rate of the critic1 network. This represents the adaptive learning rate of the critic2 network. Indicates a fixed learning rate. It is a constant. Let i represent the set of neighbors of agent i. Represents the neighboring intelligent agent j of intelligent agent i. Sum the values ​​of , This represents the connection weight from agent j to agent i. This indicates the intensity of cooperation and competition between agent i and agent j. This represents the weighting between historical TD errors and the intensity of cooperative competition. This represents the historical TD error of the critic1 network training at time k. This represents the historical TD error during the training of the critic2 network at time k.

6. The optimal two-part cooperative control method for an intelligent unmanned swarm system according to claim 1, characterized in that, The formula for determining whether an intelligent unmanned swarm system has reached consensus is: ; ; in, Represents the intelligent agent i. The weights of the critic1 network in the next iteration Represents the intelligent agent i. The weights of the critic1 network in the next iteration Represents the intelligent agent i. The weights of the critic2 network in the next iteration Represents the intelligent agent i. The weights of the critic2 network in the next iteration This represents the stability threshold.