Data-driven expert quadrotor unmanned aerial vehicle decision-making imitation method
By establishing a UAV model that considers communication packet loss and fitting it with a neural network, the optimal control party and the worst adversary party's decisions are derived, solving the robustness problem of UAV autonomous decision-making in complex environments and realizing effective imitation of expert decision-making.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI UNIV OF TECH
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-12
AI Technical Summary
Existing UAV control methods struggle to achieve robust autonomous decision-making in complex dynamic adversarial environments, especially when expert decisions are unknown, system models are uncertain, and communication channels suffer from random packet loss. Existing inverse reinforcement learning methods cannot effectively mimic expert control and adversarial decision-making.
Kinematic and communication models of expert quadrotor UAVs and learner quadrotor UAVs are established, considering channel packet loss. The optimal control and worst adversary decisions are derived through Bellman equations and the minimax principle. The expert decisions are fitted using neural networks, and the state penalty weight matrix is updated by combining packet loss probability and system dynamics.
The system enables autonomous decision-making simulation of drones in complex environments, enhancing the robustness and autonomous decision-making capabilities of drones in dynamic adversarial environments.
Smart Images

Figure CN122194643A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of unmanned aerial vehicle (UAV) control technology, and more specifically to a data-driven expert quadrotor UAV decision-making simulation method. Background Technology
[0002] Quadrotor drones, with their unique flight maneuverability, are increasingly widely used in complex tasks such as security patrols, disaster relief, and swarm collaboration. In these applications, drones not only need to achieve precise trajectory tracking and stable control, but also need to make robust and adaptive intelligent decisions autonomously in dynamic, uncertain environments with potential adversarial factors. How to equip drones with such decision-making capabilities has become a key challenge in the current development of autonomous unmanned systems.
[0003] Traditional model-based UAV control methods, such as classical PID control or model predictive control, typically rely on an accurate model of the controlled object. However, in real-world flight environments, UAV models often exhibit uncertainties and are susceptible to modeling errors, external disturbances, and adversarial behavior from other agents. These factors lead to performance degradation and insufficient robustness in practical applications for control methods that rely entirely on accurate models.
[0004] In recent years, data-driven methods, represented by imitation learning and reinforcement learning, have provided new solutions to complex decision-making problems. Imitation learning achieves skill transfer by replicating behavior in expert demonstration data, but it struggles to understand the cost-benefit logic behind expert behavior and has limited generalization ability when the environment changes. Reinforcement learning learns optimal decisions autonomously through interaction with the environment, but its performance is highly dependent on manually designed reward functions. In complex scenarios involving adversarial competition and multi-objective trade-offs, designing reasonable and effective reward functions often requires significant parameter tuning costs and experimental experience.
[0005] Inverse reinforcement learning can deduce the implicit cost function from expert demonstration data, thus bypassing the intent understanding dilemma of imitation learning and the reward design problem of reinforcement learning, becoming an effective framework for realizing expert decision imitation. However, existing inverse reinforcement learning methods, when applied to UAV systems, are usually based on several idealized assumptions: for example, assuming that the expert's decision is known; assuming that the dynamic model of the system is completely and accurately known; and often ignoring phenomena such as signal packet loss that are common in the wireless communication networks of actual UAVs. In addition, most existing methods are aimed at single-decision-maker scenarios, and rarely consider the complex competitive environment in which the controller and the adversary coexist and play each other.
[0006] Therefore, how to enable learners to learn inversely from limited expert demonstration data and effectively imitate the control and adversarial decision-making logic of experts in a zero-sum game framework, under conditions of unknown expert decisions, uncertain system models, and random packet loss in multiple communication channels, constitutes a critical problem that urgently needs to be solved. Solving this problem is of great importance for improving the autonomous decision-making ability and mission execution robustness of UAVs in complex dynamic adversarial environments. Summary of the Invention
[0007] To address the aforementioned technical problems, this invention provides a data-driven expert quadcopter drone decision-making simulation method.
[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0009] A data-driven expert quadrotor drone decision-making simulation method includes: Establish kinematic and communication models for expert quadrotor UAVs and learning quadrotor UAVs; the communication model considers random packet loss in the sensor-controller channel, controller-actuator channel, and adversarial command injection channel, and is described by Bernoulli random variables; The expert quadcopter drone is referred to as the expert, and the learning quadcopter drone is referred to as the learner. The value functions of the expert and the learner are defined in the zero-sum game framework, and the expert behavior is modeled as the Nash equilibrium between the controller and the adversary. The expert decision imitation problem under the zero-sum game is established, with the goal of enabling the learner to imitate the expert's decision by adjusting its own state penalty weight matrix. A data-driven model-based expert decision imitation algorithm is proposed. The optimal control decision and the worst adversary decision of the learner are derived through the Bellman equation and the minimax principle. The update rule of the state penalty weight matrix is established based on inverse optimal control and solved by value iteration. This paper utilizes expert trajectory data to train a neural network to fit the decisions of the expert controller and the adversary. It uses learner trajectory data to estimate the packet loss probability and system dynamics, and proposes a data-driven, model-free expert decision imitation algorithm. Combining the estimated packet loss probability and system dynamics, the state penalty weight matrix is updated through value iteration and neural network.
[0010] In one embodiment, the kinematic model of the expert quadcopter drone is as follows: ; in For the expert's status, This indicates the expert's location in three-dimensional space. This represents the k-th time. express dimensionality Indicates transpose. This indicates the speed of the expert in three-dimensional space. This indicates the expert's posture in three-dimensional space; For the expert's control input. = The thrust vector is normalized to mass. For experts The thrust generated by each motor , For the quality of experts, For the expert's angular velocity in three-dimensional space, express dimensionality; To control the input for the opposing side of the expert. For simplified constant wind disturbance, For state-based malicious attacks on sensors, express Dimension; inherent dynamics of the system , control input dynamic function and the opponent's input dynamic function Both have Lipschitz continuity and Assuming the kinematic model of the expert quadcopter drone is controllable, Measurable and square-summable; The kinematic model for learning a quadcopter drone is as follows: ; in, For the learner's state, The learner controls the input. Assuming the learner's adversary controls the input, It is measurable and has square summability.
[0011] In one embodiment, the communication model considers random packet loss in the sensor-controller channel, the controller-actuator channel, and the adversarial instruction injection channel, and is described by Bernoulli random variables, specifically including: For experts: Expert signal transmission is accomplished via a wireless communication network employing a transmission control protocol. Both the expert controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. The probability of successful reception is described as follows: ; Decisions made by the controlling party Data is transmitted via the controller-actuator channel; packet loss is determined by a Bernoulli random variable. Describe the probability of successful delivery as follows: ; Decisions made by the opposing side The packet loss situation is determined by a Bernoulli random variable, which is transmitted via adversarial command injection into the channel. Describe the probability of successful delivery as follows: ; Decision-making by expert controllers For state feedback decision-making, where For expert control decision-making law and abbreviation Decision-making by the expert controlling party; decision-making by the expert opposing party. It is also a state feedback decision, in which For expert adversarial decision-making law and abbreviation For expert adversarial decision-making; the actual control input to the kinematic model of the expert quadcopter UAV is... The opponent's control input is ; For learners: all learner signals are transmitted via a wireless communication network employing a transmission control protocol; both the learner controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. The probability of successful reception is described as follows: ; Decisions made by the controlling party Data is transmitted via the controller-actuator channel; packet loss is determined by a Bernoulli random variable. Describe the probability of successful delivery as follows: ; Decisions made by the opposing side The packet loss situation is determined by a Bernoulli random variable, which is transmitted via adversarial command injection into the channel. Describe the probability of successful delivery as follows: , Learner-controlled decision-making For state feedback decision-making, where For learners to control decision-making laws and abbreviation Decision-making for learners in control; decision-making for learners in opposition. For state feedback decision-making, where For learners, the adversarial decision-making law and abbreviation The learner makes the adversarial decision; the actual control input for the learner, who then executes the kinematic model of the quadcopter drone, is... The opponent's control input is During the learner's process, the learner possesses their own information set. : .
[0012] In one embodiment, defining the value functions of the expert and the learner within a zero-sum game framework, and modeling expert behavior as a Nash equilibrium between the controller and the adversary, specifically includes: Define the expert's value function for: ; The expert's state penalty weight matrix; The penalty weight matrix for the expert's control side; As the attenuation factor for the adversary; based on the expert's value function, the expert controller's decision aims to minimize The opposing side's decisions aim to maximize... This constitutes a zero-sum game; experts actually use the optimal control method to make decisions. and the worst-case scenario decision , making ,in The optimal game value function for the expert; Define the learner's value function for: ; The penalty weight matrix represents the state to be determined for the learner. The learner's control penalty weight matrix, The attenuation factor for the adversary; the penalty weight matrix for the learner's control side. and the attenuation factor of the opponent Fixed and known; learners can acquire the expert's state. Controlling party decision and opposing decision The measurement information includes the expert's state trajectory, the controller's decision trajectory, and the adversary's decision trajectory; therefore, before learners begin learning, they also possess an expert information set. : .
[0013] In one embodiment, the goal of establishing the expert decision imitation problem under a zero-sum game is to enable learners to imitate expert decisions by adjusting their own state penalty weight matrix, specifically including: Without knowing the expert's value function, i.e., without knowing the expert's state penalty weight matrix. Control side penalty weight matrix and the opponent's attenuation factor In this case, the learner aims to determine its own state penalty weight matrix. Ultimately, the learner determines its own value function to mimic the decisions of both the expert controller and the adversary; the learner needs to determine its own state penalty weight matrix through learning. In order to achieve .
[0014] In one embodiment, the derivation of the learner's optimal control decision and worst adversary decision using the Bellman equation and the minimax principle, and the establishment of an update rule for the state penalty weight matrix based on inverse optimal control, specifically includes: When the learner's state penalty weight matrix Determine and establish the learner's optimal control side decision and worst-case adversary side decision; for ease of expression, [the following will be used]. Abbreviated as ; For time and known , , In the case of defining , , , ; Represents the evolution of the learner's state under different packet loss conditions; Define , , , , These represent the learner's state direction respectively. The probability of evolution; For time and known In the case of defining , , , ; This indicates that the expert's status changes under different packet loss conditions. The evolution of the state to the next time step; Will Abbreviated as ,Will Abbreviated as ; For permissible learner decisions , That is, the learner's value function is bounded in decision-making. The Bellman equation for the learner is established using dynamic programming: ; Based on the Bellman equation of the learner mentioned above, the Hamiltonian function is constructed as follows: ; In the learner's state penalty weight matrix Without considering expert decision-making, the learner's optimal control decision can be obtained through the minimax principle and the first-order necessary condition: ; (6) in The expression and The expressions are consistent, but the decision is and , This represents the learner's optimal decision in control. This represents the learner's optimal decision against the opponent. This indicates that the learner's input is and The state under the condition; the learner's optimal value function The Hamilton-Jacobi-Isaks equations are satisfied as follows: (7) Then, determine the learner's state-penalty weight matrix. The update rule: To solve the problem of imitating expert decisions, it is necessary to update the rule in equation (6). In time The update is based on inverse optimal control theory and matching methods, establishing the current state penalty weight matrix. The rule for updating the state penalty weight matrix is as follows: (8) This is the updated state penalty weight matrix.
[0015] In one embodiment, the solution obtained through value iteration specifically includes the following steps: A1, Initialize the state penalty weight matrix Initialization function Initialize learner control decision =0, initialize learner adversary decision ;for and known In the case of defining , , , ; This represents the learner's control decision in the i-th iteration. Represents the learner's adversary decision in the i-th iteration; Down, This indicates that the learner's state changes under different packet loss conditions. The evolution of the state to the next time step; Abbreviated as ,Will Abbreviated as ; A2, update the learner's value function using equation (7): ; The function representing the value of the learner in the i-th iteration. This represents the state penalty weight matrix for the learner in the i-th iteration; A3, Use Equation (6) to update the learner's decision: ; ; A4. Update the learner's state penalty weight matrix using equation (8): ; Iterate through steps A2 to A4 until... ; Indicates the specified error bound.
[0016] In one embodiment, training a neural network using expert trajectory data to fit the expert's control and adversary decisions specifically includes: In situations where expert decisions are unknown, expert trajectories can be used to obtain... Extracting expert trajectories The corresponding data at that time is To achieve the mapping from the expert quadcopter UAV's state to control commands, a first multilayer perceptron is introduced to fit the expert's control decisions, i.e. , To achieve an approximate mapping from expert states to expert control decisions based on the mapping function obtained by fitting a multilayer perceptron, The parameters are those of a first multilayer perceptron, which consists of an input layer, a hidden layer, and an output layer. The number of neurons in the input layer is equal to the dimension of the expert state. The number of neurons in the hidden layer is automatically adjusted, and the activation function is a modified linear unit; the number of neurons in the output is the decision dimension of the expert controller. The activation function is a linear activation function; a supervised learning method is used to learn the parameters of the first multilayer perceptron by minimizing the mean square error between the output of the first multilayer perceptron and the decision of the expert controller; In situations where expert decisions are unknown, expert trajectories can be used to obtain... Extracting expert trajectories The corresponding data at that time is To achieve the mapping from the expert quadcopter drone's state to the adversary's commands, a second multilayer perceptron is introduced to fit the expert's adversary decisions. , To achieve an approximate mapping from expert states to expert adversary decisions based on the mapping function obtained by fitting a multilayer perceptron, These are the parameters of the second multilayer perceptron.
[0017] In one embodiment, estimating the packet loss probability and system dynamics using learner trajectory data specifically includes: For total duration learner trajectory a) Extraction Total duration corresponding to time Measurement data of learner states, controller decisions, and adversary decisions; obtained using the Monte Carlo method. The estimated value For estimation Introducing a third multilayer perceptron, namely , These are the parameters of the third multilayer perceptron; This represents the mapping function obtained based on the multilayer perceptron fitting, realizing the transformation from the learner state to the inherent dynamics of the system. Approximate mapping; b): Extraction Total duration corresponding to time Measurement data of learner states, controller decisions, and adversary decisions; obtained using the Monte Carlo method. The estimated value For estimation Introducing a fourth multilayer perceptron, namely , These are the parameters of the fourth multilayer perceptron; This represents the mapping function obtained based on multilayer perceptron fitting, realizing the dynamic transition from learner state to adversary input. Approximate mapping; c): Extraction Total duration corresponding to time The Monte Carlo method was used to obtain measurement data on learner states, controller decisions, and adversary decisions. The estimated value For estimation Introducing the fifth multilayer perceptron, namely , These are the parameters of the fifth multilayer perceptron; This represents the mapping function obtained based on multilayer perceptron fitting, realizing the dynamic transition from learner state to controller input. Approximate mapping; d) Extraction Total duration corresponding to time The Monte Carlo method was used to obtain measurement data on learner states, controller decisions, and adversary decisions. The estimated value ; Utilizing learned probabilistic information and system dynamics, in and known In the case of redefining: , , , .
[0018] In one embodiment, the proposed data-driven model-free expert decision imitation algorithm combines the estimated packet loss probability with system dynamics, and updates the state penalty weight matrix through value iteration and neural networks, specifically including the following steps: B1, Initialization , , , , =0, ; The initial components of the iterative state penalty weight matrix are represented. The initialization component of the learner's iterative value function. B2, using expert trajectory data to obtain an expert control party decision estimation network that can approximate expert decisions. Expert adversarial decision estimation network Probability estimation is obtained using learner trajectory data. And a dynamic estimation network that can approximate the learner's... Control input dynamic estimation network and adversary input dynamic estimation network ; B3, perform learner component value function update: extract The corresponding data at that time, and updated using the following formula. ; (12) This represents the 0th component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : (13) This represents the first component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : (14) This represents the second component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : (15) This represents the third component of the value function of the learner in the i-th iteration; B4. Update the learner's value function using equations (12) to (15): ; B5, updating learner decisions: ; ; B6, Update the learner's component state penalty weight matrix: Extract The corresponding data at that time, and updated using the following formula. : (18) extract The corresponding data at that time, and updated using the following formula. : (19) extract The corresponding data at that time, and updated using the following formula. : (20) extract The corresponding data at that time, and updated using the following formula. : ;(twenty one) B7. Update the learner's state penalty weight matrix by combining equations (18) to (21): ; Iterate through steps B3 to B7 until... .
[0019] Compared with the prior art, the beneficial technical effects of the present invention are: To address the problems of existing UAV decision-making imitation methods, which generally rely on accurate models, ignore actual communication packet loss, and struggle to imitate expert decisions and cost structures from data within a zero-sum game framework where both the controller and the adversary coexist, this invention proposes a data-driven expert decision-making imitation method for quadrotor UAVs. This method aims to establish a zero-sum game inverse reinforcement learning framework under communication packet loss conditions, enabling the learning UAV to jointly imitate the decisions of both the expert controller and the adversary, based solely on the state trajectories and decision data of the expert and the learner, even when the system dynamics and expert cost functions are unknown. Attached Figure Description
[0020] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation
[0021] A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.
[0022] like Figure 1 As shown, a data-driven expert quadcopter UAV decision-making simulation method of the present invention includes the following steps: S1. Establish kinematic and communication models for expert quadrotor UAVs and learning quadrotor UAVs; the communication model considers random packet loss in the sensor-controller channel, controller-actuator channel, and adversarial command injection channel, and is described by Bernoulli random variables; S2 refers to the expert quadcopter drone as the expert and the learning quadcopter drone as the learner; under the zero-sum game framework, the value functions of the expert and the learner are defined, and the expert behavior is modeled as the Nash equilibrium between the controller and the adversary; the expert decision imitation problem under the zero-sum game is established, with the goal of enabling the learner to imitate the expert's decision by adjusting its own state penalty weight matrix. S3 proposes a data-driven model-based expert decision imitation algorithm. Through the Bellman equation and the minimax principle, it derives the learner's optimal control decision and worst adversary decision, and establishes an update rule for the state penalty weight matrix based on inverse optimal control, which is solved through value iteration. S4 utilizes expert trajectory data to train a neural network to fit the decisions of the expert's controlling and adversary sides. It uses learner trajectory data to estimate the packet loss probability and system dynamics, and proposes a data-driven, model-free expert decision imitation algorithm. Combining the estimated packet loss probability and system dynamics, it updates the state penalty weight matrix through value iteration and neural networks.
[0023] The following provides a detailed explanation of each step.
[0024] Step S1 specifically includes: Modeling expert quadcopter drones: (1) in For the expert's status, These represent the expert's three-dimensional position, three-dimensional linear velocity, and three-dimensional attitude angle, respectively. ,in They respectively represent the experts in axis, axis, The position of the axis; ,in They respectively represent the experts in axis, axis, The linear velocity of the shaft; ,in These represent the expert's roll angle, pitch angle, and yaw angle, respectively. For the expert's control input, among which = For the mass-normalized thrust vector, For expert quadcopter drones The thrust generated by each motor For the quality of expert drones, For the expert's angular velocity in three-dimensional space; For the expert's adversary to control the input, among which For simplified constant wind disturbance, State-based malicious attacks on sensors; system dynamics , , It has Lipschitz continuity and Assuming the system corresponding to equation (1) is controllable, the expert's interference input... It is measurable and square-summable.
[0025] The expert's signal transmission is accomplished via a wireless communication network employing the Transmission Control Protocol (TCP). In complex flight environments, data packet loss may occur due to channel congestion, obstruction, or interference. Both the expert controller and the adversary make decisions based on the system state, but they acquire the state and transmit decisions through different channels. Specifically, both the controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. The probability of successful reception is described as follows: ,Right now ; Decisions made by the controlling party The data is transmitted via its controller-actuator channel, and packet loss is determined by a Bernoulli random variable. Describe the probability of successful delivery as follows: ,Right now ; Decisions made by the opposing side The packet loss situation is determined by a Bernoulli random variable, which is transmitted via adversarial command injection into the channel. Describe the probability of successful delivery as follows: ,Right now The controlling party's decision-making For state feedback decision-making, i.e. And satisfy ; the decision-making of the opposing side This is also known as state feedback decision-making, i.e. And satisfy Therefore, the actual control input to system (1) is... The opponent's input is .
[0026] Similar to expert quadcopter drones, modeling is performed on learning quadcopter drones: (2) in For the learner's state, The learner controls the input. The learner's adversary control input has the same physical meaning as that of the expert quadcopter drone, and will not be repeated here. Assume the learner's perturbation input... It is measurable and square-summable. Similar to experts, the learner's signal transmission is also accomplished via a TCP-based communication network, and signal loss is possible. Both the controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. Describe, ; Decisions made by the controlling party The data is transmitted via its controller-actuator channel, and packet loss is determined by a Bernoulli random variable. Describe, ; Decisions made by the opposing side The packet loss is described by Bernoulli random variables through adversarial command injection into the channel. The controlling party's decision-making For state feedback decision-making, i.e. And satisfy ; the decision-making of the opposing side This is also known as state feedback decision-making, i.e. And satisfy Therefore, the control input of the system that actually executes equation (2) is... The opponent's input is During the learner's process, the learner possesses their own information set: .
[0027] In tasks such as drone trajectory tracking, obstacle avoidance, and formation control, learners need to learn the expert's single-step cost function design through expert trajectory learning in order to imitate the expert's decision-making.
[0028] Step S2 specifically includes: Define the expert's value function as: (3) in The expert's state penalty weight matrix reflects the importance attached to the UAV's position error; The expert's control penalty weight matrix reflects the focus on the energy consumption of the quadruple spinner. This is an attenuation factor used to adjust the expert's robustness to interference signals. Based on this value function, the expert controller's decisions aim to minimize... The opposing side's decisions aim to maximize... This constitutes a zero-sum game. Experts actually use the optimal control method to make decisions. and the worst-case scenario decision , making ,in This is the expert's optimal game value function.
[0029] Define the learner's value function as: .
[0030] in The penalty weight matrix represents the state to be determined for the learner. The learner's control penalty weight matrix. The attenuation factor is the adversary's factor. The learner's control-side penalty weight matrix. and the attenuation factor of the opponent Fixed and known. Learners can obtain the expert's state trajectory. Decision-making trajectory of the control party and the decision-making trajectory of the opposing side Therefore, before learners engage in learning, they also possess a set of expert information. .
[0031] Establishing an expert decision-making simulation problem: without knowing the expert's value function (3), i.e., without knowing the expert's state penalty weight matrix. Control side penalty weight matrix and the opponent's attenuation factor In this case, learner (2) aims to determine its own state penalty weight matrix. Ultimately, the learner determines its own value function to mimic the decisions of both the expert controller and the adversary. In other words, the learner needs to determine its own state-penalty weight matrix through learning. In order to achieve .
[0032] Step S3 specifically includes: First, when the learner's state penalty weight matrix We determine and establish the learner's optimal control side decision and worst adversary side decision. For ease of expression, the following text will use abbreviated forms. for .definition , , , For time and known , , In the case of defining , , , For time and known In the case of defining , , , .
[0033] abbreviation for abbreviation for .
[0034] For permissible learner decisions , That is, the learner's value function is bounded in decision-making. The Bellman equation for the learner is established using dynamic programming: .
[0035] Further establish its Hamiltonian function: ;
[0036] In the learner's state penalty weight matrix Without considering expert decision-making, the learner's optimal control decision can be obtained through the minimax principle and the first-order necessary condition: ; (6) in Expression and The expressions are consistent, but the decision is and .
[0037] The Hamilton-Jacobi-Isaks equations are satisfied as follows: (7) Then, determine the learner's state-penalty weight matrix. The update rules. To solve the problem of imitating expert decisions, it is necessary to (6) In time Update.
[0038] Based on inverse optimal control theory and matching methods, establish the current To the updated The rules are: (8) Finally, a value iteration framework is introduced to establish a model-based data-driven decision-making system that mimics the value iteration algorithm, specifically including: A1, Initialization , , =0, .for and known In the case of defining , , , In the following text, the abbreviation will be used. for abbreviation for .
[0039] A2, update the learner's value function using equation (7): (9) A3, Use Equation (6) to update the learner's decision: ; (10) A4: (11) The iterative loop continues from A2 to A4 until... .
[0040] Although the algorithm relies on expert and learner data, it depends on complete probabilistic information and system dynamics, thus making it model-based. However, in tasks such as drone trajectory tracking, obstacle avoidance, and formation control, learners typically need to mimic expert decisions in a model-free manner.
[0041] S4: Propose a data-driven, model-free expert decision-making simulation algorithm: To mimic expert decision-making in a model-free manner, appropriate identification using expert trajectories and the learner's own trajectories is necessary. First, when the expert decision is unknown, expert trajectories are used to obtain... Extracting expert trajectories , , The corresponding data at that time is To achieve the mapping from the expert quadcopter UAV's state to control commands, a multilayer perceptron is introduced to fit the expert's control decisions. , These are the network parameters. The perceptron contains one input layer (the number of neurons is equal to the dimension of the expert state). The system consists of one hidden layer (with the number of neurons adjusted automatically and the activation function being a modified linear unit), and one output layer (with the number of neurons equal to the decision dimension of the expert controller). The activation function is a linear activation function. A supervised learning method is used to learn the network parameters by minimizing the mean square error between the network output and the expert control decision.
[0042] Similarly, when expert decisions are unknown, expert trajectories can be used to obtain... Extracting expert trajectories The corresponding data is: To achieve the mapping from the expert quadcopter drone's state to the adversary's commands, a multilayer perceptron is introduced to fit the expert's adversary decisions. , The network parameters are set up similarly to those of the multilayer perceptron used by the expert controller, except that the dimensions and outputs correspond to the expert adversary's decisions. This will not be elaborated further.
[0043] Then, learner trajectories are used to estimate probabilistic information and system dynamics. For the total duration... Learner trajectory, a): Extraction Total duration corresponding to time And data. The Monte Carlo method can be used to obtain... The estimated value For estimation Introducing a multilayer perceptron, namely , The network parameters are set up similarly to those of a multilayer perceptron in network parameter learning and expert control, except that the dimension and output correspond to the learner's dynamic function. b): Extraction Total duration corresponding to time And data. The Monte Carlo method can be used to obtain... The estimated value For estimation Introducing a multilayer perceptron, namely , The network parameters are set up similarly to those of a multilayer perceptron in network parameter learning and expert control, except that the dimension and output correspond to the learner's dynamic function. c): Extraction Total duration corresponding to time And data. The Monte Carlo method can be used to obtain... The estimated value For estimation Introducing a multilayer perceptron, namely , The network parameters are set up similarly to those of a multilayer perceptron in network parameter learning and expert control, except that the dimension and output correspond to the learner's dynamic function. d): Extraction Total duration corresponding to time And data. The Monte Carlo method can be used to obtain... The estimated value .
[0044] Next, using the learned probability information and system dynamics, in and known In the case of redefining: , .
[0045] Finally, based on the model-based algorithm, a data-driven, model-free decision-making imitation algorithm is established, including the following steps: B1, Initialization , , , , , .
[0046] B2, using expert trajectories to obtain an expert decision estimation network that can approximate expert decisions. , Probability estimation is obtained using learner trajectories. And a dynamic estimation network that can approximate the learner's dynamic function. , and .
[0047] B3, perform learner component value function update: extract The corresponding data at that time, and updated using the following formula.
[0048] ;
[0049] extract The corresponding data at that time, and updated using the following formula. : ;
[0050] extract The corresponding data at that time, and updated using the following formula.
[0051] (14) extract The corresponding data at that time, and updated using the following formula.
[0052] (15) B4, update the learner's value function using (12-15). (16) B5, updating learner decisions: ; (17) B6, Update the learner's component state penalty weight matrix: Extract The corresponding data at that time, and updated using the following formula. : (18) extract The corresponding data at that time, and updated using the following formula. : (19) extract The corresponding data at that time, and updated using the following formula. : (20) extract The corresponding data at that time, and updated using the following formula. : ;(twenty one) B7. Update the learner's state penalty weight matrix by combining equations (18) to (21): ;(twenty two) Iterate through steps three through seven until... .
[0053] It should be noted that in data-driven model-free decision imitation algorithms, although equations (12) to (22) provide... , , The update formulas, but these formulas involve changes to the network output (such as...). , , The gradient calculation of the value function and decision is performed. Furthermore, the updated value function and decision typically lack analytical solutions. Therefore, in practical implementation, the value function and decision in the proposed model-free algorithm are often parameterized using neural networks, and their parameters are updated using methods such as gradient descent. This invention does not limit the specific network structure or optimization method used, but only provides a data-driven expert decision imitation framework. This framework can be implemented either in a model-based manner (as in step S3) or in a model-free manner (as in step S4). In the model-free mode, the algorithm can further support an offline learning mode, that is, using pre-collected expert trajectory and learner trajectory data to train and update the system dynamic network, expert decision network, and learner decision and value function network; it also supports an online learning mode, that is, during the real-time interaction between the learner and the environment, data is dynamically collected and network parameters are incrementally updated to achieve real-time adjustment and adaptive imitation of decisions.
[0054] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0055] It should be understood that although the steps in the flowcharts of the accompanying drawings are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some of the steps in the flowcharts of the accompanying drawings may include multiple steps or stages, which are not necessarily completed at the same time, but may be executed at different times, and the execution order of these steps or stages is not necessarily sequential, but may be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0056] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0057] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention, and no reference numerals in the claims should be construed as limiting the scope of the claims.
[0058] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
Claims
1. A data-driven expert quadrotor UAV decision-making simulation method, characterized in that, include: Establish kinematic and communication models for expert quadrotor drones and learning quadrotor drones; The communication model considers random packet loss in the sensor-controller channel, controller-actuator channel, and adversarial command injection channel, and is described by Bernoulli random variables; Experts are referred to as experts when they are working with quadcopter drones, and learners are referred to as learners when they are learning how to use quadcopter drones. In the framework of zero-sum game, we define the value functions of experts and learners, and model the expert behavior as the Nash equilibrium between the controller and the adversary. We establish the expert decision imitation problem under zero-sum game, with the goal of enabling learners to imitate expert decisions by adjusting their own state penalty weight matrix. A data-driven model-based expert decision imitation algorithm is proposed. The optimal control decision and the worst adversary decision of the learner are derived through the Bellman equation and the minimax principle. The update rule of the state penalty weight matrix is established based on inverse optimal control and solved by value iteration. This paper utilizes expert trajectory data to train a neural network to fit the decisions of the expert controller and the adversary. It uses learner trajectory data to estimate the packet loss probability and system dynamics, and proposes a data-driven, model-free expert decision imitation algorithm. Combining the estimated packet loss probability and system dynamics, the state penalty weight matrix is updated through value iteration and neural network.
2. The data-driven expert quadrotor UAV decision-making simulation method according to claim 1, characterized in that, The kinematic model of the expert quadcopter drone is as follows: ; in For the expert's status, This indicates the expert's location in three-dimensional space. This represents the k-th time. express dimensionality Indicates transpose. This indicates the speed of the expert in three-dimensional space. This indicates the expert's posture in three-dimensional space; For the expert's control input. = For the mass-normalized thrust vector, For experts The thrust generated by each motor , For the quality of experts, For the expert's angular velocity in three-dimensional space, express dimensionality; To control the input for the opposing side of the expert. For simplified constant wind disturbance, For state-based malicious attacks on sensors, express Dimension; inherent dynamics of the system , control input dynamic function and the opponent's input dynamic function Both have Lipschitz continuity and Assuming the kinematic model of the expert quadcopter drone is controllable, Measurable and square-summable; The kinematic model for learning a quadcopter drone is as follows: ; in, For the learner's state, The learner controls the input. Assuming the learner's adversary controls the input, It is measurable and has square summability.
3. The data-driven expert quadrotor UAV decision-making simulation method according to claim 2, characterized in that, The communication model considers random packet loss in the sensor-controller channel, controller-actuator channel, and adversarial command injection channel, and is described by Bernoulli random variables, specifically including: For experts: Expert signal transmission is accomplished via a wireless communication network employing a transmission control protocol. Both the expert controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. The probability of successful reception is described as follows: ; Decisions made by the controlling party Data is transmitted via the controller-actuator channel; packet loss is determined by a Bernoulli random variable. Describe the probability of successful delivery as follows: ; Decisions made by the opposing side The packet loss situation is determined by a Bernoulli random variable, which is transmitted via adversarial command injection into the channel. Describe the probability of successful delivery as follows: ; Decision-making by expert controllers For state feedback decision-making, where For expert control decision-making law and abbreviation Decision-making by the expert controlling party; decision-making by the expert opposing party. It is also a state feedback decision, in which For expert adversarial decision-making law and abbreviation For expert adversarial decision-making; the actual control input to the kinematic model of the expert quadcopter UAV is... The opponent's control input is ; For learners: all learner signals are transmitted via a wireless communication network employing a transmission control protocol; both the learner controller and the adversary receive signals through a sensor-controller channel. The packet loss situation is determined by the Bernoulli random variable. The probability of successful reception is described as follows: ; Decisions made by the controlling party Data is transmitted via the controller-actuator channel; packet loss is determined by a Bernoulli random variable. Describe the probability of successful delivery as follows: ; Decisions made by the opposing side The packet loss situation is determined by a Bernoulli random variable, which is transmitted via adversarial command injection into the channel. Describe the probability of successful delivery as follows: , Learner-controlled decision-making For state feedback decision-making, where For learners to control decision-making laws and abbreviation Decision-making for learners in control; decision-making for learners in opposition. For state feedback decision-making, where For learners, the adversarial decision-making law and abbreviation The learner makes the adversarial decision; the actual control input for the learner, who then executes the kinematic model of the quadcopter drone, is... The opponent's control input is During the learner's process, the learner possesses their own information set. : 。 4. The data-driven expert quadrotor UAV decision-making simulation method according to claim 3, characterized in that, The definition of value functions for experts and learners within a zero-sum game framework, and the modeling of expert behavior as a Nash equilibrium between the controller and the adversary, specifically includes: Define the expert's value function for: ; The expert's state penalty weight matrix; The penalty weight matrix for the expert's control side; As the attenuation factor for the adversary; based on the expert's value function, the expert controller's decision aims to minimize The opposing side's decisions aim to maximize... This constitutes a zero-sum game; experts actually use the optimal control method to make decisions. and the worst-case scenario decision , making ,in The optimal game value function for the expert; Define the learner's value function for: ; The penalty weight matrix represents the state to be determined for the learner. The learner's control penalty weight matrix. The attenuation factor for the adversary; the penalty weight matrix for the learner's control side. and the attenuation factor of the opponent Fixed and known; learners can acquire the expert's state. Controlling party decision and opposing decision The measurement information includes the expert's state trajectory, the controller's decision trajectory, and the adversary's decision trajectory; therefore, before learners begin learning, they also possess an expert information set. : 。 5. The data-driven expert quadrotor UAV decision-making simulation method according to claim 4, characterized in that, The aforementioned problem of imitating expert decisions under a zero-sum game aims to enable learners to imitate expert decisions by adjusting their own state penalty weight matrix. Specifically, this includes: Without knowing the expert's value function, i.e., without knowing the expert's state penalty weight matrix. Control side penalty weight matrix and the opponent's attenuation factor In this case, the learner aims to determine its own state penalty weight matrix. Ultimately, the learner determines its own value function to mimic the decisions of both the expert controller and the adversary; the learner needs to determine its own state penalty weight matrix through learning. In order to achieve .
6. The data-driven expert quadrotor UAV decision-making simulation method according to claim 5, characterized in that, The process derives the learner's optimal control decision and worst adversary decision using the Bellman equation and the minimax principle, and establishes an update rule for the state penalty weight matrix based on inverse optimal control. Specifically, this includes: When the learner's state penalty weight matrix Determine and establish the learner's optimal control side decision and worst-case adversary side decision; for ease of expression, [the following will be used]. Abbreviated as ; For time and known , , In the case of defining , , , ; Represents the evolution of the learner's state under different packet loss conditions; Define , , , , These represent the learner's state direction respectively. The probability of evolution; For time and known In the case of defining , , , ; This indicates that the expert's status changes under different packet loss conditions. The evolution of the state to the next time step; Will Abbreviated as ,Will Abbreviated as ; For permissible learner decisions , That is, the learner's value function is bounded in decision-making. The Bellman equation for the learner is established using dynamic programming: ; Based on the Bellman equation of the learner mentioned above, the Hamiltonian function is constructed as follows: ; In the learner's state penalty weight matrix Without considering expert decision-making, the learner's optimal control decision can be obtained through the minimax principle and the first-order necessary condition: ; ;(6) in Expression and The expressions are consistent, but the decision is and , This represents the learner's optimal decision in control. This represents the learner's optimal decision against the opponent. This indicates that the learner's input is and The state under the condition; the learner's optimal value function The Hamilton-Jacobi-Isaks equations are satisfied as follows: ;(7) Then, determine the learner's state-penalty weight matrix. The update rule: To solve the problem of imitating expert decisions, it is necessary to update the rule in equation (6). In time The update is based on inverse optimal control theory and matching methods, establishing the current state penalty weight matrix. The rule for updating the state penalty weight matrix is as follows: ;(8) This is the updated state penalty weight matrix.
7. The data-driven expert quadrotor UAV decision-making simulation method according to claim 6, characterized in that, The solution obtained through value iteration specifically includes the following steps: A1, Initialize the state penalty weight matrix Initialization value function Initialize learner control decision =0, initialize learner adversary decision ;for and known In the case of defining , , , ; This represents the learner's control decision in the i-th iteration. Represents the learner's adversary decision in the i-th iteration; Down, This indicates that the learner's state changes under different packet loss conditions. The evolution of the state to the next time step; Abbreviated as ,Will Abbreviated as ; A2, update the learner's value function using equation (7): ; The function representing the value of the learner in the i-th iteration. This represents the state penalty weight matrix for the learner in the i-th iteration; A3, Use Equation (6) to update the learner's decision: ; ; A4. Update the learner's state penalty weight matrix using equation (8): ; Iterate through steps A2 to A4 until... ; Indicates the specified error bound.
8. The data-driven expert quadrotor UAV decision-making simulation method according to claim 7, characterized in that, The method of training a neural network using expert trajectory data to fit the decisions of the controlling and adversary parties specifically includes: In situations where expert decisions are unknown, expert trajectories can be used to obtain... Extracting expert trajectories The corresponding data is: To achieve the mapping from the expert quadcopter UAV's state to control commands, a first multilayer perceptron is introduced to fit the expert's control decisions, i.e. , To achieve an approximate mapping from expert states to expert control decisions based on the mapping function obtained by fitting a multilayer perceptron, The parameters are those of a first multilayer perceptron, which consists of an input layer, a hidden layer, and an output layer. The number of neurons in the input layer is equal to the dimension of the expert state. The number of neurons in the hidden layer is automatically adjusted, and the activation function is a modified linear unit; the number of neurons in the output is the decision dimension of the expert controller. The activation function is a linear activation function; a supervised learning method is used to learn the parameters of the first multilayer perceptron by minimizing the mean square error between the output of the first multilayer perceptron and the decision of the expert controller; In situations where expert decisions are unknown, expert trajectories can be used to obtain... Extracting expert trajectories The corresponding data at that time is To achieve the mapping from the expert quadcopter drone's state to the adversary's commands, a second multilayer perceptron is introduced to fit the expert's adversary decisions. , To achieve an approximate mapping from expert states to expert adversary decisions based on the mapping function obtained by fitting a multilayer perceptron, These are the parameters of the second multilayer perceptron.
9. The data-driven expert quadrotor UAV decision-making simulation method according to claim 8, characterized in that, The method of estimating packet loss probability and system dynamics using learner trajectory data specifically includes: For total duration learner trajectory a) Extraction Total duration corresponding to time Measurement data of learner states, controller decisions, and adversary decisions; obtained using the Monte Carlo method. The estimated value For estimation Introducing a third multilayer perceptron, namely , These are the parameters of the third multilayer perceptron; This represents the mapping function obtained based on the multilayer perceptron fitting, realizing the transformation from the learner state to the inherent dynamics of the system. Approximate mapping; b): Extraction Total duration corresponding to time Measurement data of learner states, controller decisions, and adversary decisions; obtained using the Monte Carlo method. The estimated value For estimation Introducing a fourth multilayer perceptron, namely , These are the parameters of the fourth multilayer perceptron; This represents the mapping function obtained based on multilayer perceptron fitting, realizing the dynamic transition from learner state to adversary input. Approximate mapping; c): Extraction Total duration corresponding to time The Monte Carlo method was used to obtain measurement data on learner states, controller decisions, and adversary decisions. The estimated value For estimation Introducing the fifth multilayer perceptron, namely , These are the parameters of the fifth multilayer perceptron; This represents the mapping function obtained based on multilayer perceptron fitting, realizing the dynamic transition from learner state to controller input. Approximate mapping; d) Extraction Total duration corresponding to time The Monte Carlo method was used to obtain measurement data on learner states, controller decisions, and adversary decisions. The estimated value ; Utilizing learned probabilistic information and system dynamics, in and known In the case of redefining: , , , 。 10. A data-driven expert quadrotor UAV decision-making simulation method according to claim 9, characterized in that, The proposed data-driven, model-free expert decision-making simulation algorithm combines the estimated packet loss probability with system dynamics, and updates the state penalty weight matrix through value iteration and neural networks. Specifically, it includes the following steps: B1, Initialization , , , , =0, ; The initial components of the iterative state penalty weight matrix are represented. This represents the initialization component of the learner's iterative value function; B2, using expert trajectory data to obtain an expert control party decision estimation network that can approximate expert decisions. Expert adversarial decision estimation network Probability estimation is obtained using learner trajectory data. And a system-inherent dynamic estimation network that can approximate the learner's... Control input dynamic estimation network and adversary input dynamic estimation network ; B3, perform learner component value function update: extract The corresponding data at that time, and updated using the following formula. ; ;(12) This represents the 0th component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : ;(13) This represents the first component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : ;(14) This represents the second component of the value function of the learner in the i-th iteration; extract The corresponding learner state, controller decision, and adversary decision measurement data are used to update the following formula. : ;(15) This represents the third component of the value function of the learner in the i-th iteration; B4. Update the learner's value function using equations (12) to (15): ; B5, updating learner decisions: ; ; B6, Update the learner's component state penalty weight matrix: Extract The corresponding data at that time, and updated using the following formula. : ;(18) extract The corresponding data at that time, and updated using the following formula. : ;(19) extract The corresponding data at that time, and updated using the following formula. : ;(20) extract The corresponding data at that time, and updated using the following formula. : ;(21) B7. Update the learner's state penalty weight matrix by combining equations (18) to (21): ; Iterate through steps B3 to B7 until... .