A fixed-wing unmanned aerial vehicle radar constraint confrontation decision method and system based on hybrid action PPO

By adopting a decision-making method based on hybrid action PPO, and combining UAV kinematics and radar threat field models, efficient autonomous decision-making and trajectory control of fixed-wing UAVs under radar constraints are achieved. This solves the problems of computational real-time performance, adaptability and multi-objective balance in existing technologies, and improves decision-making efficiency and maneuverability.

CN122239764APending Publication Date: 2026-06-19GUANGDONG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG UNIV OF TECH
Filing Date
2026-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies suffer from poor real-time performance and adaptability in fixed-wing UAV air combat, difficulty in balancing multiple targets, and limited maneuver space, making it difficult to achieve efficient autonomous decision-making and trajectory control under radar constraints.

Method used

A decision-making method based on hybrid action PPO is adopted to construct an environmental model that includes UAV kinematics and radar threat field. Decisions are made using the hybrid action space, and a safety masking mechanism and Lagrange multiplier optimization are introduced to achieve a combination of discrete tactics and continuous control.

Benefits of technology

It significantly improves the decision-making efficiency and maneuverability of UAVs in complex combat missions, ensures safety and tactical intelligence under radar constraints, and achieves efficient radar avoidance and target tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122239764A_ABST
    Figure CN122239764A_ABST
Patent Text Reader

Abstract

This invention discloses a radar-constrained adversarial decision-making method and system for fixed-wing UAVs based on hybrid action point optimization (PPO). The method includes: constructing an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS); constructing and training a decision model based on near-end policy optimization (PPO), wherein the decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control; and outputting hybrid actions based on the converged decision model and real-time state observations to control the UAV to perform adversarial tasks under radar constraints. This invention avoids the computational difficulties of traditional methods under non-convex constraints by constructing a hybrid action space and an improved reinforcement learning framework.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous decision-making and control technology for unmanned aerial vehicles (UAVs), and in particular to a radar constraint adversarial decision-making method and system for fixed-wing UAVs based on hybrid action PPO. Background Technology

[0002] When performing air combat missions, fixed-wing unmanned aerial vehicles (UAVs) need to achieve efficient autonomous tactical decision-making and trajectory control while satisfying their own kinematic constraints and radar threat avoidance constraints. However, the radar exposure field in air combat environments typically exhibits non-convex characteristics, and the discrete tactical maneuvers and continuous dynamic control of UAVs are highly coupled, posing a significant challenge to the realization of autonomous decision-making technology. To address these issues, scholars both domestically and internationally have conducted relevant research, with the following being some of the main representative technologies: For UAV avoidance problems, path planning techniques based on threat fields or optimal control are proposed, incorporating model optimization and potential field-based path planning methods, as well as similar traditional approaches. These methods typically model radar detection risk as a cost or a repulsive force within a potential field. Under given flight envelope constraints, an optimization algorithm searches for a geometric path that avoids the radar detection area.

[0003] However, this type of technology has the following drawbacks: (1) Poor real-time computation: The radar exposure field in the air combat environment is usually non-convex, and the discrete maneuvering of UAVs is highly coupled with continuous dynamics, which makes traditional optimal control or model predictive control (MPC) computationally very complex and difficult to meet the millisecond-level real-time response requirements in actual combat. (2) Poor model adaptability: Such methods often rely on simplified environmental assumptions or static optimization landscapes. When facing multi-radar dynamic cooperative detection or unstructured complex electromagnetic countermeasures environments, their adaptability and robustness are poor.

[0004] For fixed-wing UAVs in adversarial environments, a deep reinforcement learning-based decision-making method is proposed. This method utilizes neural networks to interact with the environment, attempting to learn the UAV's maneuvering strategies to enhance its autonomy in adversarial scenarios.

[0005] However, this type of technology still has significant shortcomings when applied to radar-constrained adversarial scenarios: (1) Difficulty in balancing multiple targets: Conventional reinforcement learning methods lack explicit constraint mechanisms when dealing with the dual targets of attack missions and radar survival. Under radar constraints, aggressive attack maneuvers often lead to high radar exposure, and conventional algorithms struggle to find the optimal balance between the two, easily leading to mission failure. (2) Limited action space and slow convergence: Existing standard algorithms (such as standard PPO) typically only handle a single type of action space (purely discrete or purely continuous), making it difficult to cope with the mixed requirements of "discrete tactical decision-making" and "continuous flight control" in air combat. Simulations show that the standard PPO algorithm has a slow convergence speed and a low average reward value for the final strategy, making it difficult to generate trajectories that conform to both physical dynamics and advanced tactical intelligence.

[0006] In summary, there is an urgent need for a radar-constrained adversarial decision-making method and system for fixed-wing UAVs based on hybrid action PPO, in order to overcome the problems of large computational load, poor adaptability, difficulty in balancing multiple targets, and limited action space of traditional technologies. Summary of the Invention

[0007] To address the complex challenges faced by fixed-wing unmanned aerial vehicles (UAVs) in complex electromagnetic environments, such as radar threat avoidance, target tracking conflicts, coupling of discrete decision-making and continuous control, and the non-convexity of the radar exposure landscape, as well as the shortcomings of existing control methods in terms of real-time adversarial adaptability, survivability balance, and decision-making efficiency, this invention proposes a radar-constrained adversarial decision-making method and system for fixed-wing UAVs based on hybrid action PPO.

[0008] On the one hand, to achieve the above objectives, this invention provides a radar-constrained adversarial decision-making method for fixed-wing UAVs based on hybrid action PPO, comprising: Construct an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). Construct and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. Based on the decision model that has converged during training, the UAV is controlled to perform adversarial missions under radar constraints by outputting hybrid actions according to real-time state observations.

[0009] Preferably, the multi-radar joint threat field model is constructed based on the detection probability of a single radar, wherein the aircraft in the first... The instantaneous detection probability under each radar is: ; In the formula, For drones and the first The Euclidean distance between the radar centers; This represents the current radar cross-section of the drone; These are all inherent constants of the radar system; For the aircraft in the Instantaneous detection probability under each radar; Among them, for those by The combined threat field strength of the network of radars is: ; In the formula, For the combined threat field strength; This is the current spatial position vector of the drone; Number of radars For the aircraft in the Instantaneous detection probability under each radar.

[0010] Preferably, the hybrid action space includes: Discrete action branch, used to select a grid as a tactical waypoint from a set of velocity-adaptive 3D space grids; The continuous action branch is used to directly output the normalized control vector for underlying attitude control.

[0011] Preferably, the process of constructing and training a decision model for optimizing PPO based on proximal policy also includes introducing a safety masking mechanism into the discrete action branches; The security shielding mechanism includes: Calculate the radar exposure cost for each candidate grid; Based on the radar exposure cost, the original action logic value of the candidate grid is modified or grids exceeding a preset threshold are hard-masked.

[0012] Preferably, the radar exposure cost corresponding to each candidate grid is calculated as follows: ; In the formula, For radar exposure costs, for The intensity of the joint threat field at any given moment, For the time variable along the flight path, For drones to the first Flight trajectories of candidate grids.

[0013] Preferably, the decision model for optimizing PPO based on proximal policy is constructed and trained using a constraint-aware policy optimization mechanism. This constraint-aware policy optimization mechanism includes constructing a comprehensive reward function, specifically: ; In the formula, Penalty for staying in the danger zone. The original game reward, All are weighting coefficients. For RCS cost, For the comprehensive reward function, for The intensity of the combined threat field at any given moment.

[0014] Preferably, the constraint-aware strategy optimization mechanism further includes constructing a total loss function containing Lagrange multipliers, specifically: ; In the formula, The total loss function includes Lagrange multipliers. For strategic objectives, For the value function loss, The value loss coefficient, , Predict the current state of the Critic network. value, For real and tangible returns, As a reward for entropy, The entropy coefficient, For the entropy of the strategy, For discrete strategies, For continuous strategies, For radar safety constraints, For Lagrange multipliers, This represents the average probability that a drone will be detected by radar. The set safety threshold.

[0015] Preferably, the training process further includes iterative optimization to perform dual updates on the Lagrange multipliers, specifically: ; In the formula, These represent the new parameters after the update and the old parameters before the update, respectively. This indicates assignment. For learning rate, For parameters Find the partial derivative. This is the total loss function.

[0016] On the other hand, to achieve the above objectives, the present invention also provides a fixed-wing UAV radar constraint adversarial decision-making system based on hybrid action PPO, used to implement the aforementioned fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO, comprising: The environment model module is used to construct an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). The decision control module is used to build and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. The execution module is used to control the UAV to perform adversarial tasks under radar constraints by outputting mixed actions based on the decision model that has been trained and converged, according to real-time state observations.

[0017] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the aforementioned fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO.

[0018] Compared with the prior art, the present invention has the following advantages and technical effects: (1) This invention proposes a probabilistic detection model based on physical characteristics, which deeply couples the instantaneous detection probability with the real-time position of the unmanned aerial vehicle (UAV), the dynamic radar cross section (RCS), and the radar system constant. By establishing a single radar detection probability model and constructing a three-dimensional joint threat field under a multi-radar network through the assumption of independent detection events, this invention can more realistically simulate the non-convex exposed landscape under complex electromagnetic environments, providing continuous gradient perception information for intelligent agents.

[0019] (2) This invention divides the action space into a discrete branch (Velocity-adaptive3D grid) responsible for mid-range waypoint selection and a continuous branch (three-degree-of-freedom control variables) responsible for close-range motion execution. It utilizes a shared feature layer to extract environmental features and outputs hybrid decisions through a branch network. This overcomes the limitations of insufficient control accuracy of a single discrete action or low search efficiency of a single continuous action in a high-dimensional non-convex space, significantly improving the decision-making efficiency and maneuverability of UAVs in complex combat missions.

[0020] (3) This invention introduces Safety Masking into the discrete branch, corrects the action logic value according to the expected radar exposure cost, and performs hard masking on actions that exceed the limit; secondly, it introduces Lagrange multipliers into the PPO loss function to update the restricted strategy, and enforces long-term radar safety constraints through iterative optimization, which enables UAVs to learn advanced countermeasures such as "High Yo-Yo" and "Flanking Interception" autonomously without the need for manual pre-setting of complex evasion trajectories, thus significantly reducing the risk of being detected while ensuring the effectiveness of the interception mission. Attached Figure Description

[0021] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a flowchart of a radar constraint adversarial decision-making method for fixed-wing UAVs based on hybrid action PPO according to an embodiment of the present invention; Figure 2 This is a schematic diagram of an adversarial scenario according to an embodiment of the present invention; Figure 3 This is a schematic diagram of the joint radar threat value according to an embodiment of the present invention; Figure 4 This is a schematic diagram of the trajectory in an embodiment of the present invention; Figure 5 This is a schematic diagram illustrating the real-time distance to each radar in an embodiment of the present invention; Figure 6 This is a schematic diagram of the single radar detection probability according to an embodiment of the present invention; Figure 7 The reward function curve is shown in the embodiment of the present invention. Figure 8 The four combat scenarios in this embodiment of the invention are: (a) flanking interception, (b) head-on evasive maneuver, (c) pursuit interception, and (d) circling and waiting. Detailed Implementation

[0022] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0023] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0024] To address the complex challenges faced by fixed-wing unmanned aerial vehicles (UAVs) in complex electromagnetic environments, including radar threat avoidance, target tracking conflicts, coupling of discrete decision-making and continuous control, and the non-convexity of the radar exposure landscape, as well as the shortcomings of existing control methods in terms of real-time adversarial adaptability, survivability balance, and decision-making efficiency, this embodiment proposes a radar-constrained adversarial decision-making method for fixed-wing UAVs based on hybrid action point-of-action (PPO), aiming to achieve the following objectives: (1) Breaking through the bottleneck of coordinated control between target tracking and survivability assurance: Based on the three-degree-of-freedom (3-DOF) kinematic model that considers physical reality, a physical-driven probabilistic radar detection model and a multi-radar joint threat field are constructed. At the same time, the continuous tracking of dynamic targets by UAVs and zero intrusion into the ground radar "no-fly zone" are strictly guaranteed, thus overcoming the limitations of traditional methods in complex dynamic confrontations where it is difficult to balance tactical advantages and radar evasion.

[0025] (2) Significantly reduce the computation and optimization difficulty in complex action space: Abandon the limitations of single discrete or continuous action space in existing technologies, design a hybrid action space architecture that combines discrete tactical decision-making and continuous flight control, and use multi-head Actor-Critic network to process waypoint selection and attitude control in parallel to meet the stringent requirements for real-time autonomous decision-making and high-dimensional action optimization in air combat environment.

[0026] (3) Enhancing robustness and safety under radar constraints: A multi-level radar constraint mechanism is adopted, including discrete action screening based on safety shield, joint penalty function design based on radar cross section (RCS) and exposure time, and long-term exposure risk constraint based on expectation. By introducing a constraint perception loss function through the Lagrange multiplier method, the UAV can have advanced tactical maneuverability and robust three-dimensional obstacle avoidance capability while ensuring high convergence speed.

[0027] A radar-constrained adversarial decision-making method for fixed-wing UAVs based on hybrid action PPO, such as Figure 1 ,include: Step A: Construct an environment model, which includes the kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). Step B: Construct and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. Step C: Based on the converged decision model, output hybrid actions according to real-time state observations to control the UAV to perform adversarial tasks under radar constraints.

[0028] Specifically, this embodiment employs a closed-loop control framework based on reinforcement learning. Step A constructs an "interactive environment" encompassing UAV dynamics and radar electromagnetic environment, providing state feedback and physical constraint benchmarks for the agent. Step B constructs an "intelligent decision agent," responsible for outputting control actions based on the environmental state. The two form a logical closed loop through "state observation - action execution - reward feedback."

[0029] Furthermore, step A includes problem modeling and environment description: A1: UAV kinematic modeling, used to calculate the state update of the UAV after performing actions.

[0030] The motion of the UAV is described using a three-degree-of-freedom (3-DOF) point mass model: ; Symbol definition: This represents the three-dimensional position of the UAV in the inertial coordinate system. This represents the actual flight speed of the drone. The flight path angle, For heading angle; control input variables include tangential overload. Roll angle and normal load factor n , g It is the acceleration due to gravity. Let x represent the velocity component of the UAV along the x-axis in the inertial coordinate system. Let be the velocity component of the UAV along the y-axis in the inertial coordinate system. Let be the velocity component of the UAV along the z-axis in the inertial coordinate system.

[0031] A2: Radar detection probability and threat field model, used to calculate the degree of threat posed by the environment to UAVs, as a negative feedback term in the reward function.

[0032] Based on radar equation definition For the aircraft in the Instantaneous detection probability under each radar: ; In the formula, For drones and the first The Euclidean distance between the radar centers; This represents the current radar cross-section of the drone, a value related to the drone's attitude angle. These are all inherent constants of the radar system, and are related to the radar transmit power, gain, and receiver sensitivity. For real-time detection probability; For by A network of radars, comprehensively detecting probability (Joint threat field strength) is: ; In the formula, For the combined threat field strength; This is the current spatial position vector of the drone; Number of radars For the aircraft in the Instantaneous detection probability under each radar.

[0033] This is used to construct a three-dimensional radar joint threat field to quantify electromagnetic environment risks.

[0034] Furthermore, step B includes the design of a controller based on hybrid motion PPO: B1: Enhanced State Observation Design. The agent acquires state information from the environment in step A, the original observation vector. ; in, Let represent the position of the UAV, the flight speed of the UAV, the flight path angle, the heading angle, and the distance between the UAVs at time t, respectively.

[0035] To enable intelligent agents to perceive environmental threats, in addition to basic kinematic states, radar detection probabilities are explicitly embedded. Threat field intensity [in and distance to the nearest radar Enhanced observation vector is .

[0036] B2: Hybrid Action Space Decision Architecture. The agent, based on... Output a hybrid motion, which is fed back into the model from step A1 to drive the drone: Furthermore, the hybrid action space includes: Discrete action branch, used to select a grid as a tactical waypoint from a set of velocity-adaptive 3D space grids; The continuous action branch is used to directly output the normalized control vector for underlying attitude control.

[0037] Specifically, the discrete action branch is: selecting discrete tactical actions from a velocity-adaptive 3D mesh. ,in These represent stopping at the origin, selecting the upper grid, selecting the left grid, selecting the lower grid, and selecting the right grid, respectively, to execute waypoint guidance.

[0038] Furthermore, the process of constructing and training a decision model for optimizing PPO based on proximal policies also includes introducing a safety masking mechanism into the discrete action branches. The security shielding mechanism includes: Calculate the radar exposure cost for each candidate grid; Based on the radar exposure cost, the original action logic value of the candidate grid is modified or grids exceeding a preset threshold are hard-masked.

[0039] Specifically, a safety masking mechanism is introduced to calculate the radar exposure cost of candidate grids. , for The intensity of the joint threat field at any given moment, For the time variable along the flight path, For drones to the first The flight trajectory of each candidate grid. This is achieved by adjusting the action logic values. ,in, The penalty coefficient is... The original logical value, To punish high-risk choices in exchange for radar exposure, and to penalize those exceeding a threshold. The action is hard-blocked.

[0040] The continuous action branch is: directly output the normalized control vector. , The change in the UAV's flight speed; control vector Used for attitude control and precision maneuvering at the lower levels, utilizing the angle between the velocity vector and the radar line of sight. Dynamic calculation of RCS cost ,in, , It is a velocity vector. For radar line of sight.

[0041] Furthermore, a constraint-aware policy optimization mechanism is employed to construct and train the decision model for optimizing PPO based on proximal policy, specifically including: B3: Constraint-aware strategy optimization.

[0042] Construct a comprehensive reward function that integrates task achievement, survivability, and physical constraints. : ; In the formula, Penalty for staying in the danger zone. The original game reward, All are weighting coefficients. For RCS cost, For the comprehensive reward function, for The intensity of the combined threat field at any given moment.

[0043] definition For discrete output heads: output the classification distribution in 5 grid directions. For continuous output heads: output vector The mean and variance of the beta distribution.

[0044] The probability ratio of the discrete head measures the deviation between the current strategy and the old strategy. The probability of consecutive heads is higher than In the formula, For the current strategy parameters In state Take discrete actions The probability, For old strategy parameters In state Take discrete actions The probability, For the current strategy parameters In state Take discrete actions The probability density, For old strategy parameters In state Take discrete actions The probability density.

[0045] definition: ; In the formula, For generalized advantage estimation, It is a shear hyperparameter. It is a shearing function. Used to balance the contribution of continuous control tasks To calculate the expectation over time step t, min() represents taking the minimum value. The clipping objective function in the mixed action space. Let be the probability ratio of discrete actions. This represents the probability ratio of consecutive actions.

[0046] Using Lagrange multipliers Multi-objective optimization of the loss function: ; In the formula, The total loss function includes Lagrange multipliers. For strategic objectives, For the value function loss, The value loss coefficient, , Predict the current state of the Critic network. value, For real and tangible returns, As a reward for entropy, The entropy coefficient, For the entropy of the strategy, For discrete strategies, For continuous strategies, For radar safety constraints, For Lagrange multipliers, This represents the average probability that a drone will be detected by radar. The set safety threshold.

[0047] Furthermore, the training process also includes iterative optimization to perform dual updates on the Lagrange multipliers, including: B4: Dual Update Law. Network parameter updates follow these rules, and discrete logic must undergo a safety mask before gradient calculation to prevent backpropagation through unsafe actions: ; In the formula, These represent the new parameters after the update and the old parameters before the update, respectively. This indicates assignment. For learning rate, For parameters Find the partial derivative. This is the total loss function.

[0048] Furthermore, the final result is expressed as follows: A Real-Time Autonomous Control Method Based on Environmental Interaction and Hybrid Decision-Making (Combined Execution Result of Steps A and B): The core technological product of this embodiment is a closed-loop control method that can be directly applied to UAV airborne platforms. This method is based on the interactive environment constructed in step A, which includes UAV kinematics (A1) and a radar threat field (A2), and utilizes the converged hybrid action policy network trained in step B as the decision core. In actual combat missions, this method acquires a real-time state vector containing three-dimensional position, velocity vector, and radar threat intensity through airborne sensors. (Corresponding to step B1), the hybrid motion space decision architecture (corresponding to step B2) performs millisecond-level inference and directly outputs the optimal control command vector containing normal overload, roll angle and velocity increment. This enables drones to achieve real-time autonomous maneuvers that meet physical and dynamic constraints in complex environments.

[0049] The radar-constrained optimal adversarial trajectory generated by strategy iteration (physical mapping of steps A and B): In physical space, the execution result of this technical solution is manifested as a three-dimensional flight trajectory that balances "mission attack" and "survival obstacle avoidance" (corresponding to the attached diagram). Figure 4The green trajectory (as shown in the image) is a direct product of the continuous interaction between the intelligent decision agent in step B and the radar electromagnetic environment in step A. At each time step, the control algorithm calculates the instantaneous detection probability based on the joint threat field model established in step A2, and dynamically adjusts the heading and attitude through the safety masking and constraint-aware optimization mechanism in step B3. Through this continuous closed-loop cycle of "state awareness - action execution - physical update", a trajectory is ultimately generated and executed in three-dimensional space that can actively utilize threat field gradient information and always satisfy the radar detection probability. (like The safety penetration trajectory constrained by the constraints.

[0050] Algorithm convergence speed and final performance improvement: Experimental Comparison: In a typical UAV adversarial simulation environment, compared with the standard PPO algorithm, the hybrid action PPO framework proposed in this embodiment shows a significant advantage in training convergence speed. Experimental data shows that this method exhibits a steep reward curve within approximately 200 to 500 training epochs, while the standard PPO algorithm has a relatively flat curve and a lower final return.

[0051] Theoretical support: The hybrid action architecture effectively solves the problem of low search efficiency in high-dimensional continuous space, making the average reward value stabilize at about 2800, which is about 33% higher than the standard PPO algorithm (about 2100).

[0052] Enhanced radar evasion capabilities and survivability: Experimental Comparison: In a multi-radar joint threat field test, traditional path planning methods (such as the shortest straight distance) directly pass through the high-probability red area, causing the mission to fail immediately. The "green safe trajectory" generated in this embodiment can accurately identify the threat field gradient and autonomously bypass the dangerous area covered by the radar detection radius (200m).

[0053] Theoretical support: By introducing a Safety Masking mechanism and Lagrange constraint optimization, the desired detection probability is forced to be... This ensures that the drone remains in a low-risk detection state throughout the mission.

[0054] Learning complex tactical maneuvers and improving decision-making robustness: Experimental comparison: In 1v1 combat tests, this embodiment enabled the UAV to successfully learn a variety of advanced tactical behaviors, such as "High Yo-Yo" maneuvers, large turns to flank, flanking interception, and ambushes in blind spots.

[0055] Theoretical support: The decoupled design of the 3-DOF dynamics model and the hybrid action head enables the UAV to perform both macroscopic discrete tactical selection (waypoint planning) and microscopic continuous attitude control (overload and roll), thereby achieving precise execution of tactical actions while meeting strict physical constraints.

[0056] This embodiment also provides a fixed-wing UAV radar constraint adversarial decision-making system based on hybrid action PPO, including: The environment model module is used to construct an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). The decision control module is used to build and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. The execution module is used to control the UAV to perform adversarial tasks under radar constraints by outputting mixed actions based on the decision model that has been trained and converged, according to real-time state observations.

[0057] This technical solution aims to provide a closed-loop control decision-making scheme for fixed-wing unmanned aerial vehicle (UAV) systems that balances tactical effectiveness and stealth survivability. While strictly meeting radar constraints and flight physics constraints, it significantly improves the overall reward performance of the algorithm in adversarial scenarios, and promotes the engineering application of intelligent decision-making technology for autonomous UAVs.

[0058] To more clearly illustrate the technical solution of the present invention, specific embodiments are provided below for description: To verify the effectiveness of this technical solution, this example builds a joint simulation experimental platform based on Python and MATLAB R2021b.

[0059] Experimental scenario initialization and parameter configuration: The simulation experiment is set in a high-fidelity 3D air combat scenario, where the airspace is composed of a boundary-constrained cubic space, such as... Figure 2 As shown.

[0060] Radar threat deployment: such as Figure 3 Three ground radar stations are deployed within the airspace, with their center coordinates set as follows: and The detection failure threshold radius for a single radar is set at 200m, meaning the detection probability within this radius is... ,like Figure 5 , Figure 6As shown, any intrusion into this area will be considered a mission failure.

[0061] Physical constant setting: Deriving radar system constants through numerical simulation This is to ensure that the simulation environment has physical realism and provides continuous gradient information.

[0062] UAV actuator performance parameters: The fixed-wing UAV used in this example follows a 3-DOF point mass dynamics model, and its state variables are constrained by the following physical safety envelope: Speed ​​constraint: The actual airspeed V must be maintained within the range of [25, 38] m / s.

[0063] Maneuvering constraints: roll angle Limited to within The maximum normal load factor n is limited to .

[0064] Perception Constraints: The blue attacking UAV has a specific sensor detection cone (green) and an attack engagement zone cone (orange). It needs to maintain the target within the cone through precise maneuvering to meet the locking conditions.

[0065] Algorithm training performance and convergence analysis: This example demonstrates a comparative training experiment of 1,000 episodes between the proposed hybrid action PPO algorithm and the standard PPO algorithm: Convergence efficiency: such as Figure 7 As shown, the algorithm of this invention exhibits extremely high exploration efficiency within 200 to 500 training cycles, and the total reward value rises rapidly.

[0066] Final reward performance: After 800 cycles, the average reward value of this algorithm stabilized at approximately 2,800, achieving a performance gain of about 33% compared to the standard PPO's 2,100. This demonstrates that the hybrid action space exhibits superior convergence quality when handling coupled decision-making tasks.

[0067] This example reconstructs four typical scenarios using MATLAB, validating the model's decision-making ability, such as... Figure 8 As shown: Figure 8 (a) is flanking interception: UAVs identify radar threat gradients, execute large detour routes, and intercept targets from behind using radar blind spots.

[0068] Figure 8(b) is a head-on engagement: When facing an oncoming threat, the UAV performs a "high yo-yo" maneuver to quickly gain dominance in the rear hemisphere of the target while avoiding radar coverage.

[0069] Figure 8 (c) is for tail-chase pursuit: During long-range pursuit, the UAV dynamically adjusts its altitude (Y-axis) and heading to keep itself in a "safe corridor" with a very low probability of radar detection.

[0070] Figure 8 (d) refers to Loitering & Turning: When the target trajectory is complex, the UAV loiters at the safety boundary to find the best interception window, while ensuring the real-time detection probability. .

[0071] like Figure 4 As shown, by monitoring the real-time Euclidean distance between the UAV and each radar station, the results show that throughout the entire flight trajectory, the distance between the UAV and the centers of the three radars remained outside the 200m safety threshold. This result quantitatively proves that this technical solution can strictly meet the preset radar survivability constraints while completing a highly challenging interception mission.

[0072] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A radar-constrained adversarial decision-making method for fixed-wing UAVs based on hybrid action PPO, characterized in that, include: Construct an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). Construct and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. Based on the decision model that has converged during training, the UAV is controlled to perform adversarial missions under radar constraints by outputting hybrid actions according to real-time state observations.

2. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 1, characterized in that, The multi-radar joint threat field model is constructed based on the detection probability of a single radar, wherein the aircraft in the first... The instantaneous detection probability under each radar is: ; In the formula, For drones and the first The Euclidean distance between the radar centers; This represents the current radar cross-section of the drone; These are all inherent constants of the radar system; For the aircraft in the Instantaneous detection probability under each radar; Among them, for those by The combined threat field strength of the network of radars is: ; In the formula, For the combined threat field strength; This is the current spatial position vector of the drone; Number of radars For the aircraft in the Instantaneous detection probability under each radar.

3. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 1, characterized in that, The hybrid action space includes: Discrete action branch, used to select a grid as a tactical waypoint from a set of velocity-adaptive 3D space grids; The continuous action branch is used to directly output the normalized control vector for underlying attitude control.

4. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 3, characterized in that, The process of building and training a decision model for optimizing PPO based on proximal policy also includes introducing a safety masking mechanism into the discrete action branches. The security shielding mechanism includes: Calculate the radar exposure cost for each candidate grid; Based on the radar exposure cost, the original action logic value of the candidate grid is modified or grids exceeding a preset threshold are hard-masked.

5. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 4, characterized in that, The radar exposure cost for each candidate grid is calculated as follows: ; In the formula, For radar exposure costs, for The intensity of the joint threat field at any given moment, For the time variable along the flight path, For drones to the first Flight trajectories of candidate grids.

6. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 1, characterized in that, The decision model for PPO optimization based on proximal policy is constructed and trained using a constraint-aware policy optimization mechanism. This constraint-aware policy optimization mechanism includes constructing a comprehensive reward function, specifically: ; In the formula, Penalty for staying in the danger zone. The original game reward, All are weighting coefficients. For RCS cost, For the comprehensive reward function, for The intensity of the combined threat field at any given moment.

7. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 6, characterized in that, The constraint-aware policy optimization mechanism also includes constructing a total loss function containing Lagrange multipliers, specifically: ; In the formula, The total loss function includes Lagrange multipliers. For strategic objectives, For the value function loss, The value loss coefficient, , Predict the current state of the Critic network. value, For real and tangible returns, As a reward for entropy, The entropy coefficient, For the entropy of the strategy, For discrete strategies, For continuous strategies, For radar safety constraints, For Lagrange multipliers, This represents the average probability that a drone will be detected by radar. The set safety threshold.

8. The fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO according to claim 7, characterized in that, The training process also includes iterative optimization to perform dual updates on the Lagrange multipliers, specifically: ; In the formula, These represent the new parameters after the update and the old parameters before the update, respectively. This indicates assignment. For learning rate, For parameters Find the partial derivative. This is the total loss function.

9. A fixed-wing UAV radar constraint adversarial decision-making system based on hybrid action PPO, used to implement the fixed-wing UAV radar constraint adversarial decision-making method based on hybrid action PPO as described in any one of claims 1-8, characterized in that, include: The environment model module is used to construct an environment model, which includes a kinematic model of the UAV and a multi-radar joint threat field model based on radar cross section (RCS). The decision control module is used to build and train a decision model based on near-end policy optimization (PPO). The decision model receives enhanced state observations including the multi-radar joint threat field model and outputs decision actions based on a hybrid action space, wherein the hybrid action space includes discrete action branches for tactical waypoint selection and continuous action branches for flight attitude control. The execution module is used to control the UAV to perform adversarial tasks under radar constraints by outputting mixed actions based on the decision model that has been trained and converged, according to real-time state observations.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the radar constraint adversarial decision-making method for fixed-wing UAVs based on hybrid action PPO as described in any one of claims 1-8.