A Waveform Scheduling Method for Group Target Tracking Based on Deep Reinforcement Learning
By constructing a state space and optimizing radar waveform parameters through deep reinforcement learning, the problem of insufficient adaptability of traditional radar in group target tracking is solved, achieving accurate tracking and improved stability while reducing computation time.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING RES INST OF ELECTRONICS TECH
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional radar waveform scheduling methods struggle to adapt to high-density, dynamically changing target groups, leading to oscillations in tracking accuracy and unstable correlation, thus failing to effectively improve the overall performance and stability of target group tracking.
A deep reinforcement learning-based approach is adopted to construct a state space that integrates group motion characteristics and spatial distribution, design a composite reward function, and optimize radar transmission waveform parameters through Actor and Critic neural networks to achieve adaptive scheduling.
It achieves accurate tracking and target resolution during dynamic formation changes, reduces track miscorrelation, improves tracking accuracy and stability, reduces computation time, and meets the real-time radar scheduling requirements.
Smart Images

Figure CN122308398A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of next-generation information technology, and more specifically, to a waveform scheduling method for group target tracking based on deep reinforcement learning. Background Technology
[0002] Group target tracking aims to estimate and predict the position, velocity, trajectory, and structural information of a group of targets using sensor data. Unlike single-target or multi-target tracking, group targets are characterized by high-density aggregation, dynamic formation, and cooperative movement. The distance between group members is often at the radar resolution limit, which can easily cause problems such as target echo aliasing and track miscorrelation, affecting the stability of target tracking.
[0003] Waveform scheduling, as a core component of group target tracking, directly determines the quality of measurement information and the accuracy of state estimation. Traditional radar waveform scheduling relies on a fixed parameter set and a limited rule base, lacking the ability to adaptively perceive the dynamic formation characteristics of targets. It is difficult to balance accuracy and resolution requirements in real time, resulting in oscillations in track accuracy and instability in correlation during the group target merging / splitting phases.
[0004] Some literature proposes a constrained deep reinforcement learning (CDRL) algorithm based on deep Q-network (DQN) to solve the problem of cooperative optimization of scanning and multi-maneuvering target tracking under resource-constrained conditions. Other literature proposes a dual-agent cooperative decision-making and LSTM-self-attention mechanism to optimize the waveform parameters of multi-target tracking. Still other literature proposes a data-driven integrated launch resource management scheme to improve the multi-target tracking performance of multi-functional radar in dynamic electromagnetic environments.
[0005] The above technologies focus on maximizing the number of targets tracked under limited radar resources, or minimizing transmission energy under specific tracking air conditions. They have not been studied in depth for tracking performance of dense formation targets.
[0006] Existing technologies primarily focus on solving macroscopic optimization problems such as maximizing the number of targets tracked or minimizing launch energy under resource-constrained conditions. These methods lack targeted research on the inherent performance challenges of tracking densely packed target formations.
[0007] Specifically, the real-time trade-off between tracking accuracy and target resolution caused by dynamic group formations (such as merging and splitting) was not fully considered. When facing high-density, dynamically changing groups of targets, it is still difficult to achieve adaptive and refined adjustment of waveform parameters to the formation situation, and it is impossible to effectively suppress track accuracy oscillations and correlation instability during merging / splitting phases. There are corresponding technical gaps in improving the overall performance and stability of group target tracking. Summary of the Invention
[0008] To address the technical problem of traditional radar's inability to stably track swarm targets due to the dynamic changes in their spatial geometry during flight, which change with flight intentions, this invention aims to provide an adaptive waveform scheduling method for swarm target tracking based on deep reinforcement learning. This method constructs a state space that integrates swarm motion characteristics and spatial distribution, and designs a composite reward function that integrates tracking accuracy rewards, target resolution rewards, and miscorrelation penalties. This enables the radar to adaptively adjust its transmitted waveform parameters according to the real-time situation of the swarm targets. Ultimately, this invention aims to achieve an adaptive response to changes in swarm target formations, thereby improving target resolution performance and overall tracking stability while maintaining tracking accuracy.
[0009] To achieve the above objectives, the technical solution adopted by this invention is as follows: a waveform scheduling method for group target tracking based on deep reinforcement learning, comprising the following steps:
[0010] Step 1: Construct the state space and action space of the group target tracking waveform scheduling agent. The state space is used to characterize the observation state of the group targets, and the action space consists of a set of adjustable radar transmission waveform parameters.
[0011] Step 2: Construct the neural network of the agent, including an Actor neural network for outputting waveform parameters based on the observed state and a Critic neural network for evaluating the value of the state after the action is performed.
[0012] Step 3: Construct the reward and punishment mechanism for the intelligent agent;
[0013] Step 4: Input the current group target observation status into the Actor neural network to generate actions for determining the radar transmission waveform parameters at the next moment;
[0014] Step 5: Perform the above actions, transmit the corresponding radar waveform, and receive the target echo to obtain the new observation status;
[0015] Step Six: Calculate the reward value based on the new observation state and the reward / penalty mechanism, and update the parameters of the Actor neural network and the Critic neural network using the reward value.
[0016] In a preferred embodiment of the present invention, step one involves constructing a state space based on raw radar observation data, specifically including two types of temporal features:
[0017] Track state sequence: consisting of N consecutive frames of tracked track measurement points This describes the distance of the target from the radar, in meters (m).
[0018] Intra-group relative position sequence: Minimum radial spacing of points within the gate Describes the evolution of the formation's tightness, in meters (m).
[0019] State space at time k Defined as: ;
[0020] Based on the state at time k Generate actions at time k, action space Defined as the set of transmitted waveform parameters at the current moment, with the pulse width at time k. ,bandwidth Pulse repetition period Define the action space of the intelligent agent :
[0021] .
[0022] In a preferred embodiment of the present invention, in step two:
[0023] An Actor neural network is used to output waveform parameters based on the observed state. The input to the Actor neural network is the state space at time k. The network consists of an LSTM (Long Short-Term Memory) layer and a fully connected FCC (Fully Connected Controlled Cross-Connect) layer. After passing through a ReLU activation function, three prediction heads are derived. The three prediction heads have the same network structure, each containing an FCC layer and a sigmoid layer, outputting three waveform actions: pulse width... ,bandwidth and pulse repetition period .
[0024] The Critic neural network is used to evaluate the Q-value of the state-action pair based on the observed state and the Actor's output state-action pair. The Critic neural network consists of an LSTM (Long Short-Term Memory) network layer and a fully connected FCC (Fully Connected Control) layer. After passing through the ReLU activation function, the Q-value is output by the FCC layer.
[0025] In a preferred embodiment of the present invention, in step three, the reward and punishment mechanism evaluates based on the following actions:
[0026] Tracking accuracy reward describes the evaluation of the tracking accuracy of the target by the scheduling waveform generated by the Actor neural network. The random difference of waveform measurements is described using the Cramer-Rao lower bound of the waveform:
[0027] ;
[0028] Where sign is the sign function. The waveform entropy at time k:
[0029] ;
[0030] in, To transmit a waveform at time k, For use The measurement random error is The covariance matrix estimated at time k is described using Kalman filtering;
[0031] Target discrimination reward is used to describe whether the scheduling waveform generated by the Actor neural network can distinguish the target. describe:
[0032] ;
[0033] ;
[0034] Where d is a random variable, describing the distance between the two targets with the smallest distance in the group. It is the cumulative probability density function of the standard normal distribution; the resolvable distance unit. By bandwidth B k Decide:
[0035] c is the speed of light in a vacuum;
[0036] The target misassociation penalty describes the penalty for mixed batches of targets generated when the scheduling waveform produced by the Actor neural network just reaches the critical resolution. describe:
[0037] ;
[0038] The conditions for triggering this penalty are:
[0039] .
[0040] As a preferred embodiment of the present invention, step four specifically includes:
[0041] The current observation status of the group targets Inputting into the Actor neural network, it propagates forward according to the following two formulas.
[0042] ;
[0043] ;
[0044] Generate actions to determine the radar transmit waveform parameters at the next moment. , by pulse width ,bandwidth and pulse repetition period Composition; where hk The temporal features extracted by the Actor network using LSTM are then fed into a fully connected network to generate corresponding actions. .in The network parameters trained for LSTM Network parameters trained for FCC.
[0045] As a preferred embodiment of the present invention, step five specifically includes:
[0046] The action generated by the Actor neural network is executed to produce the corresponding radar waveform. The ranging random error of this waveform is described as follows:
[0047] ;
[0048] Where: c is the speed of light, Snr is the signal-to-noise ratio of a single pulse echo, τ is the pulse width, and k is the signal-to-noise ratio of the pulse. b For frequency modulation slope, f c Where N is the carrier frequency, T is the pulse repetition period, and N is the pulse frequency. p To accumulate pulse count;
[0049] After transmitting the waveform, perform the action. Receive reward r k for:
[0050] .
[0051] In a preferred embodiment of the present invention, step six specifically includes:
[0052] Based on the new observation state and the reward / penalty mechanism, update the parameters of the Actor neural network and the Critic neural network, and simultaneously update the target network parameters using the following formula:
[0053] ;
[0054] For soft update hyperparameters, , For online Actor and Critic network parameters, , The network parameters are for the target Actor and Critic.
[0055] As described above, the technical solution adopted in this invention has the following beneficial effects:
[0056] 1. Outstanding dynamic adaptation capability: By constructing a temporal state space that integrates the motion characteristics and spatial distribution of the swarm, and combining it with the LSTM-enhanced DDPG framework, it achieves real-time dynamic optimization of waveform framework parameters such as pulse width, bandwidth, and pulse period. This accurately matches the needs of different formation evolution stages such as target merging, splitting, and critical resolution, solving the problem that traditional fixed parameters and limited rule bases cannot cope with dynamic formation changes.
[0057] 2. Tracking performance optimization: A composite reward function is designed that integrates accuracy reward, resolution reward, and miscorrelation penalty. During the grouping phase, the bandwidth is actively reduced to suppress track miscorrelation, and during the grouping phase, the resolution is adaptively improved. Compared with the criterion method, the root mean square values of distance and velocity are reduced by 9.51% and 5.08% respectively. The error reduction is more significant compared with the segmented fixed waveform, achieving a global balance between tracking accuracy, resolution, and stability.
[0058] 3. Strong engineering practicality: Under the same computing platform, the algorithm calculation time of the phase criterion method is reduced by 79.7%, which meets the real-time scheduling requirements of radar while ensuring high performance, and has wide applicability and promotion value. Attached Figure Description
[0059] Figure 1 This is a schematic diagram of the waveform scheduling method for group target tracking based on deep reinforcement learning according to the present invention.
[0060] Figure 2 This is a schematic diagram of the network structure of the waveform scheduling agent for group target tracking in this invention.
[0061] Figure 3 This is a flight path diagram of the target group tracking in this invention.
[0062] Figure 4 This invention describes the waveform scheduling decision-making process under continuous interaction between the intelligent agent and the target. Figure 1 .
[0063] Figure 5 This invention describes the waveform scheduling decision-making process under continuous interaction between the intelligent agent and the target. Figure 2 .
[0064] Figure 6 This invention describes the waveform scheduling decision-making process under continuous interaction between the intelligent agent and the target. Figure 3 . Detailed Implementation
[0065] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0066] The embodiments provided by the present invention will be described in detail below:
[0067] like Figure 1 As shown in the flowchart, the present invention provides a waveform scheduling method for swarm target tracking based on deep reinforcement learning, comprising four implementation steps:
[0068] (1) Agent state space Action space definition.
[0069] This algorithm constructs a state space based on raw radar observation data, which specifically includes two types of time-series features.
[0070] 1.1 Track State Sequence: Consists of N consecutive frames of tracking track measurement points This describes the distance of the target from the radar, in meters (m).
[0071] 1.2 Intra-group relative position sequence: the minimum radial spacing of the points within the gate. , describes the evolution of the tightness of the formation structure, in meters (m).
[0072] State space at time k Defined as:
[0073] Based on the state at time k Generate actions at time k, action space Defined as the set of transmitted waveform parameters at the current moment, with the pulse width at time k. ,bandwidth Pulse repetition period Define the action space of the intelligent agent :
[0074] Formula (1)
[0075] (2) Define the interaction process between radar and target environment
[0076] Consider two target members within a group, whose emitted waveforms at time k are as follows: At this point, the distance measurement errors of the two group members both follow a normal distribution with zero mean. Its variance Lower bound of the waveform Clamelloe:
[0077] Formula (2)
[0078] c is the speed of light, Snr is the signal-to-noise ratio of a single pulse echo, τ is the pulse width, and k is the signal-to-noise ratio of the pulse. b For frequency modulation slope, f c Where N is the carrier frequency, T is the pulse repetition period, and N is the pulse frequency. p To accumulate pulse count.
[0079] Assume the distance measurements of the two targets are as follows: , measuring spacing Corresponding statistic When the absolute value of the distance between two targets is less than the resolvable distance unit... It is related to the radar waveform bandwidth B, and is defined as follows: .Right now At this time, the target is indistinguishable. A standard normal transformation yields the probabilities that the two targets are indistinguishable:
[0080] Formula (3)
[0081] in It is the cumulative probability density function of the standard normal distribution. Considering that the target broadens, continuous distance cells are often agglomerated in engineering practice. Therefore, the condition for two targets to be completely distinguishable is: The corresponding probability is:
[0082] Formula (4)
[0083] Furthermore, the condition for the two targets to be critically distinguishable is: Corresponding probability:
[0084] Formula (5)
[0085] (3) Neural network initialization.
[0086] 3.1 Randomly initialize the online Actor network and Critic Network ;
[0087] 3.2 Initialize target network parameters: , ;
[0088] 3.3 Create an experience replay buffer D with a capacity of N. buffer ;
[0089] in: The Actor network can learn parameters as follows: Responsible for state As input to generate action , express For learnable parameters, evaluate state s Generated actions Obtain the Q value. The structures of the two networks are as follows: Figure 2 Both the Actor and Critic networks consist of a Long Short-Term Memory (LSTM) network, a Fully Connected Network (FCC), a ReLU activation function, and a Sigmoid activation function. The Actor has three heads that control three different waveform actions, while the Critic network has one head that takes the Q-value of the action based on the state and the action input.
[0090] (4) Model training.
[0091] 4.1 Actor Network Extracts Temporal Features h Using LSTM k Then, the corresponding actions are generated through an activation function and a fully connected network. :
[0092] Formula (6)
[0093] Formula (7)
[0094] in For LSTM network parameters, These are FCC network parameters.
[0095] 4.2 Execution of Actions Calculate the observation reward r k And according to the waveform And obtain the next state s k+1
[0096] Formula (8)
[0097] in Bonus for waveform accuracy:
[0098] Formula (9)
[0099] sign is the sign function. The waveform entropy at time k
[0100] Formula (10)
[0101] For use The measurement random error is The covariance matrix is estimated at time k using Kalman filtering.
[0102] Distinguish rewards for group targets:
[0103] Formula (11)
[0104] The penalty for miscorrelation after the target spacing reaches the waveform resolution limit:
[0105] Formula (12)
[0106] 4.3 Storage and Transfer of Samples To buffer D, randomly sample a small batch of samples from D: Perform network training and update Actor network parameters. and Critic network parameters ;
[0107] 4.4 Perform a soft update on the target network.
[0108] Formula (13)
[0109] For soft update hyperparameters, , For online network parameters, , These are the target network parameters.
[0110] 4.5 Once the policy converges (average reward stabilizes) or the time step condition is met, return the trained Actor network parameters. Simultaneously update the target network.
[0111] (5) Examples of Reasoning Application
[0112] Initialize a segment of the group's target trajectory, such as... Figure 3 As shown: Suppose that the group of targets enters the radar observation airspace from the due north direction, and after passing the nearest observation point (the waypoint), the radial distance gradually increases, and the total flight time is 40s.
[0113] The formation configuration is a longitudinal linear distribution: A is the leading target, B is in the center, and C is the trailing target. Based on the dynamic characteristics of the formation spacing, the flight process is divided into four typical stages: the fully resolved observation period (0~10s), the group transition period (10-20s), the critical maintenance period (20-30s), and the group reconstruction period (30-40s), in which a trained agent interacts with the group targets. Figure 4-6The waveform scheduling decision process diagram under the continuous interaction between the agent and the target shows the comparison results of using the traditional Selective Segmented Fixed Waveform Method (SFixP), criterion method (Min-MSE, Max-MI) and the method (DRL) proposed in this invention.
[0114] Table 1 shows the mean root mean square error of distance (ARMSE) for the three methods. loc ), root mean square error of velocity (ARMSE) vol A comparison of distance tracking error and decision-making efficiency shows that, in typical scenarios, the proposed method has significant advantages over the criterion method in terms of distance tracking error, velocity error, and inference time; however, compared to the criterion method, the proposed method has advantages in distance tracking error and velocity tracking error, but the inference time is longer.
[0115] Table 1: Comparison of Tracking Accuracy and Computation Time
[0116]
[0117] This invention addresses the technical bottlenecks of traditional scheduling methods in group target tracking, namely poor waveform dynamic adaptability and difficulty in balancing tracking accuracy and resolution. It proposes a waveform adaptive scheduling scheme based on deep reinforcement learning, which possesses significant technical advantages and practical value.
[0118] Outstanding dynamic adaptation capability: By constructing a temporal state space that integrates the motion characteristics and spatial distribution of the swarm, and combining it with the LSTM-enhanced DDPG framework, it achieves real-time dynamic optimization of waveform framework parameters such as pulse width, bandwidth, and pulse period, accurately matching the needs of different formation evolution stages such as target merging, splitting, and critical resolution, and solving the problem that traditional fixed parameters and limited rule bases cannot cope with dynamic formation changes.
[0119] Tracking performance optimization: A composite reward function is designed that integrates accuracy reward, resolution reward, and miscorrelation penalty. During the grouping phase, the bandwidth is actively reduced to suppress track miscorrelation, and during the grouping phase, the resolution is adaptively improved. Compared with the criterion method, the root mean square values of distance and velocity are reduced by 9.51% and 5.08% respectively. The error reduction is more significant compared with the segmented fixed waveform, achieving a global balance between tracking accuracy, resolution, and stability.
[0120] Highly practical for engineering applications: Under the same computing platform, the algorithm computation time of the phase criterion method is reduced by 79.7%, which meets the real-time scheduling requirements of radar while ensuring high performance, and has wide applicability and promotion value.
[0121] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A waveform scheduling method for swarm target tracking based on deep reinforcement learning, characterized in that, Includes the following steps: Step 1: Construct the state space and action space of the group target tracking waveform scheduling agent. The state space is used to characterize the observation state of the group targets, and the action space consists of a set of adjustable radar transmission waveform parameters. Step 2: Construct the neural network of the agent, including an Actor neural network for outputting waveform parameters based on the observed state and a Critic neural network for evaluating the value of the state after the action is performed. Step 3: Construct the reward and punishment mechanism for the intelligent agent; Step 4: Input the current group target observation status into the Actor neural network to generate actions for determining the radar transmission waveform parameters at the next moment; Step 5: Perform the above actions, transmit the corresponding radar waveform, and receive the target echo to obtain the new observation status; Step Six: Calculate the reward value based on the new observation state and the reward / penalty mechanism, and use the reward value to update the parameters of the Actor neural network and the Critic neural network.
2. The method according to claim 1, characterized in that, In the steps described: The state space is constructed based on the raw radar observation data, specifically including two types of temporal features: Track state sequence: consisting of tracking track measurement points from N consecutive frames (k-N+1~k frames). This describes the spatial distance between the target and the radar, in meters (m). Intra-group relative position sequence: the minimum radial spacing between points within the gate in N consecutive frames (k-N+1~k frames). Describes the evolution of the formation's tightness, in meters (m). State space of frame k (time k) Define a sequence of N historical frames, described as follows: ; Based on the state at time k Generate actions at time k, action space Defined as the set of transmitted waveform parameters at the current moment, with the pulse width at time k. ,bandwidth Pulse repetition period Define the action space of the intelligent agent : 。 3. The method according to claim 1, characterized in that, In step two: An Actor neural network is used to output waveform parameters based on the observed state. The input to the Actor neural network is the state space at time k. The network consists of an LSTM (Long Short-Term Memory) layer and a fully connected FCC (Fully Connected Controlled Cross-Connect) layer. After passing through a ReLU activation function, three prediction heads are derived. The three prediction heads have the same network structure, each containing an FCC layer and a sigmoid layer, outputting three waveform actions: pulse width... ,bandwidth and pulse repetition period .
4. The method according to claim 1, characterized in that, In step two: The Critic neural network is used to evaluate the Q-value of the state-action pair based on the observed state and the Actor's output state-action pair. The Critic neural network consists of an LSTM (Long Short-Term Memory) network layer and a fully connected FCC (Fully Connected Control) layer. After passing through the ReLU activation function, the Q-value is output by the FCC layer.
5. The method according to claim 1, characterized in that, In step three, the reward and punishment mechanism is evaluated based on the following actions: Tracking accuracy reward describes the evaluation of the tracking accuracy of the target by the scheduling waveform generated by the Actor neural network. The random difference of waveform measurements is described using the Cramer-Rao lower bound of the waveform: ; Where sign is the sign function. The waveform entropy at time k: ; in, To transmit a waveform at time k, For use The measurement random error is The covariance matrix estimated at time k is described using Kalman filtering; Target discrimination reward is used to describe whether the scheduling waveform generated by the Actor neural network can distinguish the target. describe: ; ; Where d is a random variable, describing the distance between the two targets with the smallest distance in the group. It is the cumulative probability density function of the standard normal distribution; the resolvable distance unit. By bandwidth B k Decide: c is the speed of light in a vacuum; The target misassociation penalty describes the penalty for mixed batches of targets generated when the scheduling waveform generated by the Actor neural network just reaches the critical resolution. describe: ; The conditions for triggering this penalty are: 。 6. The method according to claim 1, characterized in that, Step four specifically includes: The current observation status of the group targets Input into the Actor neural network and propagate forward according to the following two formulas. ; ; Generate actions to determine the radar transmit waveform parameters at the next moment. , by pulse width ,bandwidth and pulse repetition period Composition; where h k The temporal features extracted by the Actor network using LSTM are then fed into a fully connected network to generate corresponding actions. .in The network parameters trained for LSTM Network parameters trained for FCC.
7. The method according to claim 1, characterized in that, Step five specifically includes: The action generated by the Actor neural network is executed to produce the corresponding radar waveform. The ranging random error of this waveform is described as follows: ; Where: c is the speed of light, Snr is the single-pulse echo signal-to-noise ratio, τ is the pulse width, and k is the signal-to-noise ratio. b For frequency modulation slope, f c Where N is the carrier frequency, T is the pulse repetition period, and N is the pulse frequency. p To accumulate pulse count; After transmitting the waveform, perform the action. Receive reward r k for: 。 8. The method according to claim 1, characterized in that, Step six specifically includes: Based on the new observation state and the reward / penalty mechanism, update the parameters of the Actor neural network and the Critic neural network, and simultaneously update the target network parameters using the following formula: ; For soft update hyperparameters, , For online Actor and Critic network parameters, , The network parameters are for the target Actor and Critic.