A non-stationary noise environment-based extended target tracking optimization method based on TD3

By combining a star-convex stochastic hypersurface model and an unscented Kalman filter with a TD3-based method, the noise switching probability is dynamically optimized, solving the accuracy and real-time problems of extended target tracking under non-stationary noise environments, and achieving high-precision extended target tracking.

CN120448828BActive Publication Date: 2026-06-19LANZHOU UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LANZHOU UNIVERSITY OF TECHNOLOGY
Filing Date
2025-04-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve high-precision and real-time tracking of extended targets in non-stationary thick-tailed noise environments. Traditional methods are ill-suited to complex dynamic environments, especially when noise characteristics change drastically.

Method used

By employing a TD3-based approach, combining a star-convex stochastic hypersurface model and an unscented Kalman filter, and dynamically optimizing the noise switching probability, the state and shape estimation of the extended target are optimized using a Markov decision process and a reward function, thereby achieving accurate tracking of the extended target.

Benefits of technology

It significantly improves the tracking accuracy and real-time performance of extended targets in non-stationary noise environments, can adaptively switch noise distribution, capture the global and local characteristics of the target, and provide an efficient and reliable tracking solution.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120448828B_ABST
    Figure CN120448828B_ABST
Patent Text Reader

Abstract

This invention relates to an optimization method for extended target tracking in non-stationary noise environments based on TD3, belonging to the fields of target tracking and noise modeling technology. By modeling the noise switching probability as the action variable of the reinforcement learning agent, a Gaussian-Student t mixture distribution model is used to achieve dynamic switching between Gaussian noise and thick-tailed noise, thereby enhancing the adaptability of the target tracking system to complex noise scenarios. In terms of geometric modeling, a star-convex stochastic hypersurface model is employed to accurately describe the motion state and geometry of the extended target, and combined with an unscented Kalman filter to dynamically adjust key filtering parameters, significantly improving tracking accuracy and robustness. This invention can significantly reduce target state estimation errors in non-stationary noise environments and exhibits excellent real-time performance and stability, making it suitable for fields such as autonomous driving, intelligent monitoring, and military defense.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of target tracking and noise modeling technology, and in particular to an extended target tracking optimization method based on TD3 in non-stationary noise environments. Background Technology

[0002] Extended target tracking is a core task in modern sensor data processing, widely used in fields such as autonomous driving, intelligent surveillance, and military defense. Unlike traditional point target tracking methods, extended target tracking requires not only estimating the target's motion state but also modeling and tracking its geometry. However, traditional methods face significant challenges when dealing with complex dynamic environments, especially in non-stationary, heavy-tailed noise environments where noise characteristics change drastically and exhibit heavy-tailed distributions. This places higher demands on the robustness and real-time performance of the tracking system.

[0003] In existing technologies, stochastic hypersurface models, as a classic geometric modeling method, can describe target boundaries through parameterization. However, this method typically assumes that noise follows a Gaussian distribution, making it unsuitable for non-stationary noise environments. In recent years, researchers have proposed Gaussian-Student's t mixture distribution models to combine the accuracy of the Gaussian distribution with the robustness of the Student's t distribution; however, their optimization methods often rely on variational Bayesian derivations, resulting in high computational complexity and difficulty in meeting real-time requirements.

[0004] Meanwhile, deep reinforcement learning, as an efficient optimization method, has been widely applied to dynamic decision-making problems. Among them, the TD3 algorithm significantly improves the performance of reinforcement learning in continuous action spaces by introducing a dual-value network, delayed policy updates, and action smoothing mechanisms. However, current research mostly focuses on point target tracking tasks and has not fully incorporated the geometric characteristics and dynamic noise modeling requirements of extended target tracking. Therefore, this invention proposes an optimization method for extended target tracking in non-stationary noise environments based on TD3. Summary of the Invention

[0005] The purpose of this invention is to provide an extended target tracking optimization method based on TD3 in non-stationary noise environments. By dynamically optimizing the noise switching probability through reinforcement learning, and combining a star-convex stochastic hypersurface model and an unscented Kalman filter, the extended target's state and shape can be accurately estimated and tracked.

[0006] To achieve the above objectives, the present invention provides the following solution:

[0007] An extended target tracking optimization method based on TD3 in non-stationary noise environments includes:

[0008] S1. Based on the Gaussian-Student t mixture distribution model, the noise switching probability optimization problem is constructed as a Markov decision process, and the measurement information of the extended objective is obtained.

[0009] S2. Based on the measurement information, perform star-convex random hypersurface modeling on the extended target, and combine it with an unscented Kalman filter to estimate the state of the extended target, obtain the estimated position and covariance matrix, and calculate the trace, centroid position error and shape matching error of the covariance matrix.

[0010] S3. Input the trace, centroid position error, and shape matching error of the covariance matrix into the pre-constructed reward function to obtain the reward value;

[0011] S4. Update the policy network and value network in the TD3 algorithm based on the reward value, and dynamically optimize the noise switching probability in combination with the policy action generated in the current state;

[0012] S5. Feed the noise switching probability back to the extended target state estimation in S2, iterate S2-S4 until the preset iteration target is reached, obtain the optimal noise switching probability learning strategy, and based on the optimal noise switching probability learning strategy, track the extended target in a non-stationary noise environment by optimizing the extended target state estimation in real time.

[0013] Optionally, in S1, the noise switching probability optimization problem is constructed as a Markov decision process <S,A,P,R> based on the Gaussian-Student t mixture distribution model, where S is the state space, consisting of the trace of the noise covariance and the noise switching probability; A is the action space, consisting of the decision set; P is the state transition probability, consisting of the dynamic update of the noise switching probability and the state evolution of the filter; and R is the reward function, consisting of the trace of the noise covariance matrix, the shape matching error, and the centroid position error.

[0014] Optionally, the reward function is:

[0015] r t =α(Tr(C) k )-Tr(C k+1 ))+β(1-IoU Loss)+γ(1-RMSE);

[0016] Where, Tr(Σ) k ) represents the trace of the covariance matrix at time k; IoU Loss represents the shape matching error; RMSE represents the centroid position error; α, β, and γ are the weights of the reward term, r t This is the reward value.

[0017] Optionally, in step S2, modeling the extended target using a star-convex random hypersurface based on the measurement information includes:

[0018] A measurement source model is established based on the spatial distribution assumption, and several measurement sources for the extended target are generated using the measurement source model.

[0019] Several measurements of the extended target are generated based on the mixed sensor noise of the measurement source;

[0020] Based on the measurement source and measurement, a model is obtained by using a star-convex random hypersurface for modeling to obtain the observation model of the star-convex extended target.

[0021] Optionally, the observation model for the star-shaped extended target is:

[0022]

[0023] Where h(·) is the pseudo-measurement equation, x k To extend the state of the target at time k, υ k,l To expand the measurement noise at the l-th sampling point at time k, s k,l To expand the scale factor of the l-th sampling point at time k, z k,l To extend the measurement of the l-th sampling point at time k, These are Fourier coefficients. To expand the shape parameter vector of the target at time k, It is a direction vector, m k To expand the centroid of the target at time k, Let be the angle between the vector between the centroid and the measurement source at time k and the x-axis.

[0024] Optionally, in step S4, updating the policy network and value network in the TD3 algorithm based on the reward value, and dynamically optimizing the noise switching probability in conjunction with the policy action generated from the current state, includes:

[0025] Initialize the target state and the network parameters of the TD3 algorithm;

[0026] Based on the current policy, the switching probability at time k is input into the policy network to obtain the action at time k+1;

[0027] Based on the switching probability at time k and the action at time k+1, the switching probability at time k+1 is obtained.

[0028] Substitute the action and switching probability at time k+1 into the filter, update the state vector and covariance matrix, calculate the reward value and execute the action to obtain the noise switching probability at time k+1.

[0029] Optionally, updating the state vector and covariance matrix includes:

[0030]

[0031] Among them, a k+1 For the action at time k+1, It is based on action a at time k+1. k+1 The obtained predicted state vector, ω k+1,j It is the weight coefficient of the j-th predicted state at time k+1, which is the weight of each sigma point in UKF. It is the state prediction of the j-th sigma point at time k+1. It is the error between the predicted state at the j-th sigma point and the weighted average predicted state. It is the covariance matrix of the state and measurement at time k+1. x represents the error between the predicted state vector and the actual target state. k+1|k Let P represent the prior state estimate at time k+1. k+1|k Let x represent the prior covariance matrix at time k+1. k+1|k+1 (a k+1 () represents the posterior state estimate at time k+1, where K is the Kalman gain and P is the posterior state estimate. k+1|k+1 (a k+1 ) is the posterior covariance matrix at time k+1. It is the prediction covariance matrix at time k+1.

[0032] Optionally, the updates to the value network and policy network during the iterative process include:

[0033] Calculate the target value y i :

[0034] y i =R j +γmin(Q1(s k+1,j ,a' k+1,j |θ'1),Q2(s k+1,j ,a' k+1,j |θ'2));

[0035] Updating the value network:

[0036]

[0037] Update policy network:

[0038]

[0039] Soft update:

[0040]

[0041] Among them, y i The value function of the target action-value function, R jγ is the immediate reward at the current moment, γ is the discount factor, Q1 and Q2 are the estimates obtained from the dual-value network, and s is the value of the reward at the current moment. k+1,j and a' k+1,j It represents the state and action at the next moment, θ1 and θ2 are the current value network parameters, and φ is the current policy network parameter. and These are the loss functions of the two value networks, s k,j and a k,j It refers to the current state and actions. It is the gradient of the policy network. Let a = μ(s) represent the gradient of the Q1 network. k,j |φ) represents the state s given by s k,j The action output by the time-policy network, Let θ' represent the gradient of the policy network, θ'1 and θ'2 be the parameters of the target value network, φ' be the parameters of the target policy network, and τ be the update coefficient.

[0042] The beneficial effects of this invention are as follows:

[0043] This invention proposes an optimization method for extended target tracking in non-stationary noise environments based on TD3, which can significantly improve the tracking accuracy and real-time performance of extended targets in such environments. This invention dynamically optimizes the noise switching probability, adaptively switching between Gaussian and heavy-tailed distributions. It combines this with a star-convex stochastic hypersurface model to capture the global and local characteristics of the target, and uses a reward function to optimize the trace value of the covariance matrix, thereby achieving joint optimization of the extended target's state and shape. This invention fully leverages the advantages of the TD3 algorithm and the stochastic hypersurface model, providing an efficient and reliable solution for complex dynamic target tracking in fields such as autonomous driving and intelligent surveillance. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0045] Figure 1 This is a flowchart of an extended target tracking optimization method based on TD3 in a non-stationary noise environment according to an embodiment of the present invention;

[0046] Figure 2 The present invention provides a TD3-based star-convex extended target tracking algorithm structure and flowchart.

[0047] Figure 3 This is a schematic diagram of the TD3 neural network structure according to an embodiment of the present invention;

[0048] Figure 4 The above are target trajectory tracking diagrams according to embodiments of the present invention, wherein (a) is a single extended target tracking diagram, (b) is a complete tracking result diagram of the target trajectory by the present method and the comparison method, and (c) is a magnified view of the target trajectory tracking details by the present method and the comparison method;

[0049] Figure 5 The above is an enlarged view of the target trajectory tracking in three stages according to an embodiment of the present invention, wherein (a) is the start stage, (b) is the middle stage, and (c) is the end stage;

[0050] Figure 6 The above are IoU curves of the method and the comparison method in this embodiment of the invention.

[0051] Figure 7 This is a comparison chart of the target estimation centroid RMSE of the present invention method and the comparison method in this embodiment;

[0052] Figure 8 This is a reward curve diagram of the method in this embodiment of the invention. Detailed Implementation

[0053] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0054] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0055] This embodiment provides an extended target tracking optimization method based on TD3 in non-stationary noise environments, such as... Figure 1 As shown, it includes:

[0056] S1. Based on the Gaussian-Student t mixture distribution model, the noise switching probability optimization problem is constructed as a Markov decision process, and the measurement information of the extended objective is obtained. The constructed Markov decision process is specifically as follows:<S,A,P,R> Where S is the state space, consisting of the trace of the noise covariance and the noise switching probability; A is the action space, consisting of the decision set; P is the state transition probability, consisting of the dynamic update of the noise switching probability and the state evolution of the filter; and R is the reward function, consisting of the trace of the noise covariance matrix, the shape matching error, and the centroid position error.

[0057] S2. Based on the measurement information, perform star-convex random hypersurface modeling on the extended target, and combine it with an unscented Kalman filter to estimate the state of the extended target, obtain the estimated position and covariance matrix, and calculate the trace, centroid position error and shape matching error of the covariance matrix.

[0058] Specifically, modeling the extended target using a star-convex random hypersurface based on the measurement information includes:

[0059] A measurement source model is established based on the spatial distribution assumption, and several measurement sources for the extended target are generated using the measurement source model.

[0060] Several measurements of the extended target are generated based on the mixed sensor noise of the measurement source;

[0061] Based on the measurement source and measurement, a model is obtained by using a star-convex random hypersurface for modeling to obtain the observation model of the star-convex extended target.

[0062] S3. Input the trace, centroid position error, and shape matching error of the covariance matrix into the pre-constructed reward function to obtain the reward value;

[0063] S4. Update the policy network and value network in the TD3 algorithm based on the reward value, and dynamically optimize the noise switching probability in combination with the policy action generated by the current state, specifically including:

[0064] Initialize the target state and the network parameters of the TD3 algorithm;

[0065] Based on the current policy, the switching probability at time k is input into the policy network to obtain the action at time k+1;

[0066] Based on the switching probability at time k and the action at time k+1, the switching probability at time k+1 is obtained.

[0067] Substitute the action and switching probability at time k+1 into the filter, update the state vector and covariance matrix, calculate the reward value and execute the action to obtain the noise switching probability at time k+1.

[0068] S5. Feed the noise switching probability back to the extended target state estimation in S2, iterate S2-S4 until the preset iteration target is reached, obtain the optimal noise switching probability learning strategy, and based on the optimal noise switching probability learning strategy, track the extended target in a non-stationary noise environment by optimizing the extended target state estimation in real time.

[0069] Specifically, this embodiment dynamically optimizes the noise switching probability, adaptively switching between Gaussian and heavy-tailed distributions. It combines this with a star-convex stochastic hypersurface model to capture the global and local characteristics of the target, and uses a reward function to optimize the trace value of the covariance matrix, thereby achieving joint optimization of the extended target state and shape. In summary, this invention fully leverages the advantages of the TD3 algorithm and the stochastic hypersurface model, providing an efficient and reliable solution for complex dynamic target tracking in fields such as autonomous driving and intelligent surveillance.

[0070] The measurement information used for extended target tracking optimization needs to be obtained from sensors. First, a sensor-based extended target tracking database is established, storing the measurement data corresponding to extended targets detected by the sensors at different time points. Based on this measurement information, the geometry of the extended target is modeled using a star-convex stochastic hypersurface model, and the noise characteristics are described using a Gaussian-Student t mixture distribution model. Then, the TD3 algorithm is used to dynamically optimize the noise switching probability, obtaining the optimal noise distribution parameters at each time point, as well as estimates of the extended target's state and shape.

[0071] During the optimization process, the TD3 algorithm's agent determines whether the current policy is better than the previous policy based on the current reward value, and dynamically updates the agent's policy and value function accordingly. After multiple iterations, the reinforcement learning agent learns the optimal noise switching policy, thereby achieving high-precision tracking and estimation of extended targets in complex dynamic environments. Finally, using optimized noise and geometric modeling parameters, the contour estimation accuracy, centroid estimation accuracy, and tracking stability of the extended target tracking are optimized.

[0072] The multi-feature state of the extended target at time k, including motion and shape parameters, can be represented as follows: Where, m k M is the centroid of the expanded target at time k. k p represents the motion parameters. k Let be the shape contour parameters of the target, then the extended equation of motion for the target is:

[0073] x k+1 =f k (x k )+w k (1);

[0074] Among them, f k (·) is the system state evolution mapping, w k This is process noise.

[0075] As one implementation method, the measurement modeling process for the extended target can be broken down into two steps. First, a measurement source model is established based on reasonable spatial distribution assumptions, and this model is used to generate multiple measurement sources for the extended target. Second, the measurement sources are mixed with sensor noise to generate multiple measurements of the target. The sensor measurement model can be described as follows:

[0076] z k,l =y k,l +υ k,l (2);

[0077] Among them, y k,l To extend the measurement source location of the l-th sampling point at time k, z k,l To extend the target measurement, υ k,l For measuring noise (l=1,...,N).

[0078] This embodiment uses the star-convex extended target model algorithm as the target tracking algorithm. The star-convex shape is defined as: if a set If all points on the line segment from any point in the set to the centroid still belong to the set, then the shape formed by the set S is a star-convex shape. Assume the shape is parameterized using sets... express:

[0079]

[0080] Among them, the scale factor s k,l ∈[0,1] can be considered as multiplicative noise, which is assumed to be a Gaussian distribution in this embodiment, reflecting the relative distance from the measurement source to the centroid; φ k ∈[0,2π] represents the measurement source and the centroid m k The angle between the vectors and the x-axis; e(φ) k )=[cos(φ k ),sin(φ k )] T The direction vector is denoted by . The shape parameter vector is denoted by . N can be expanded using the Fourier series of radial functions f Using order coefficients to describe it, the low-frequency harmonic components of the Fourier series describe the general basic outline of the star convex shape, while the high-frequency harmonic components represent the local outline features of the star convex shape.

[0081]

[0082] Wherein, R(φ) k ) is the Fourier coefficient, a k and b k These are the parameters that make up the target shape. The amount.

[0083] If mk Let y represent the centroid of the expanded target at time k. Then, the measurement source y at time k... k,l It can be represented as:

[0084]

[0085] The angle between the target centroid, the measurement source, and the x-axis is represented by φ. k,l =∠(y k,l -m k )express; For shape The boundary satisfies:

[0086]

[0087] From the implicit description of the star-convex curve, we can obtain:

[0088] g(y k,l ,x k )=‖‖y k,l -m k ‖‖ 2 -r 2 (9);

[0089] in, Therefore, the scaled boundary is obtained:

[0090]

[0091] By combining the measurement equations, the measurement equations for the star-shaped extended target can be obtained:

[0092]

[0093] Organized Where φ k,l Since it is unknown, a point estimate of the target's centroid position at the current moment is generally used. and measurement z k,l The angle between the vector formed and the axis is used as a substitute, i.e. Combining them, we can obtain:

[0094]

[0095] in, Then the pseudo-measurement equation is:

[0096]

[0097] The observation model of the star-shaped extended target can be described using formula (13), that is, the pseudo-measurement equation h(·) establishes the extended target state x. k Measurement noise υ k,l Scale factor s k,land measurement z k,l The relationship between them.

[0098] As one implementation method, the TD3 algorithm uses a delayed dual determinism strategy to optimize the noise switching probability. Through a continuous feedback process, it gradually learns an optimal noise switching strategy, thereby achieving accurate tracking and estimation of the extended target state and shape.

[0099] This embodiment, based on the Markov decision process framework, can formulate the noise modeling and parameter optimization problem in extended target tracking into a Markov decision process, using quadruples.<S,A,P,R> The composition begins with defining the relevant elements:

[0100] 1) The state space S consists of two main variables: the trace of the noise covariance, reflecting the non-stationary nature of the current noise; and the switching probability, which is the switching weight between the Gaussian and Student's t-distributions. Therefore, the state is defined as:

[0101]

[0102] 2) The action space A represents the set of decisions that the reinforcement learning agent can make. The agent's actions represent adjustments to the switching probabilities to adapt to the current noise characteristics, and are defined as:

[0103] a k ∈[-Δπ max ,Δπ max (15);

[0104] The action range is pruned to ensure that the switching probability is always between [0,1].

[0105] 3) P represents the state transition probability. The system state transition at time step [time] describes the influence of the current state and action on the state at the next time step. State transition includes the dynamic update of the switching probability and the evolution of the filter state:

[0106] π k+1 =π k +a k (16);

[0107]

[0108] Among them, F k Let Q be the state transition matrix. k Let be the process noise covariance matrix.

[0109] 4) R represents the reward function that maps parameter states and agent action choices to rewards. It is used to measure the quality of actions and aims to guide the agent to learn action strategies that improve target tracking accuracy. This embodiment designs a composite reward function, considering the following three factors:

[0110] ① The trace of the noise covariance matrix: reflects the uncertainty of the target state. A decrease in covariance indicates a reduction in estimation uncertainty, thus giving a positive reward;

[0111] ②IOU loss: The matching error between the true target shape and the estimated shape. The higher the IoU, the greater the reward.

[0112] ③ RMSE of the centroid: The error between the estimated centroid of the target and the true centroid. It is used to measure the accuracy of the target state estimation. The smaller the error, the greater the reward.

[0113] After normalizing each reward item, the mathematical form of the reward function is as follows:

[0114] r t =α(Tr(C) k )-Tr(C k+1 ))+β(1-IoU Loss)+γ(1-RMSE) (18);

[0115] Where, Tr(Σ) k ) is the trace of the covariance matrix at time k; IoU Loss represents the shape matching loss; RMSE represents the centroid matching error; α, β, and γ are the weights of the reward terms.

[0116] To address the optimization problem in the continuous action space within the TD3 algorithm, a deep reinforcement learning network based on the Actor-Critic architecture was designed, comprising a policy network and a value network. To improve training stability, TD3 introduces a dual-value network mechanism, soft updates to the target network, and an action noise smoothing strategy. This embodiment implements the following optimization design for non-stationary noise scenarios.

[0117] The policy network is responsible for determining the current state s. k Generate action a k It is used to interact with the environment, that is:

[0118] a k =μ(s) k |θ μ (19);

[0119] Where μ(·) represents the function mapping of the policy network, θ μ These are the network parameters. The optimization objective of the policy network is to learn the optimal action by maximizing the Q-value of the Critic network.

[0120] L a =-Ε[Q1(s,a)] (20);

[0121] The action-value network evaluates the value Q(s,a) of the state-action pair (s,a).

[0122] TD3 employs dual value networks Q1 and Q2 to mitigate the problem of overestimation of value:

[0123] Q(s k ,a k |θ Q )=f(s k ,a k |θ Q ) (twenty one);

[0124] The TD network adopts a three-layer fully connected architecture, designed as follows: Figure 3 As shown.

[0125] As one implementation method, this embodiment uses a dynamic optimization filter based on the TD3 algorithm to jointly estimate the state and shape of the star-convex extended target, such as... Figure 2 As shown below, the control process of the extended target tracking sensor based on the TD3 algorithm is described.

[0126] This embodiment combines a reinforcement learning agent and a star-convex stochastic hypersurface model, utilizing the TD3 algorithm to dynamically adjust the noise switching probability, thereby improving the accuracy of extended target state estimation. The sensor control process includes sensor acquisition of extended target measurement information, recursive estimation of the target state based on pseudo-measurement equations, and dynamic optimization of the Gaussian-Student t mixture distribution through reinforcement learning. The agent dynamically adjusts the system noise modeling parameters and filter parameters based on reward function feedback, thereby enhancing the robustness and real-time performance of target tracking.

[0127] In complex dynamic environments, non-stationary thick-tailed noise poses a severe challenge to the tracking accuracy of extended targets. This embodiment combines a stochastic hypersurface model and unscented Kalman filtering, and uses the TD3 algorithm to dynamically optimize the switching probability of the mixed noise distribution, thereby achieving accurate estimation of the motion state and geometry of the extended target. The following describes the optimization of extended target tracking under non-stationary noise based on TD3.

[0128] Initialize target state (x) k ,P k|k ) and TD3 network parameters (θ1,θ2,θ'1,θ'2), using formula (1) for one-step update prediction, to obtain (x k+1|k ,P k+1|k ).

[0129] The process noise and measurement noise of the extended target may exhibit non-stationary characteristics. Noise modeling is performed to describe these characteristics. The switching probability π is dynamically optimized through reinforcement learning to adapt to variations in non-stationary noise. During the filtering process, π tends to 1 when the noise approximates a Gaussian distribution; and tends to 0 when the noise has a heavy tail.

[0130] Based on the current strategy μ, the switching probability s at time k is... k The input is fed into the policy network to obtain the action at time k+1:

[0131] a k+1 =μθ(s) k )+N (22);

[0132] Where N represents the introduced Gaussian noise, and in this embodiment, the output dynamics are set as an increment to represent the adjustment of the switching probability, thus obtaining the switching probability s at time k+1. k+1 (a k+1 ):

[0133] s k+1 (a k+1 ) = s k +a k+1 (twenty three);

[0134] Among them, s k+1 (a k+1 ) = [G k ,π k (a k+1 )] T The first term in the table is the trace of the noise covariance during the filtering process, and the second term is the value of a after the action is executed at time k+1. k+1 The obtained switching probability.

[0135] The switching probability s at the current moment k and the action a selected by the policy network k+1 Obtain the switching probability s at time k+1 k+1 (a k+1 Substitute these values ​​into the filter calculation. Suppose there is a set of values ​​from [x...] k+1|k ,a k ,s k Obtain sampling points Corresponding weight ω k+1,j Then the sampling points after propagation are defined as:

[0136]

[0137] in, To target the sampling point in action step a k+1 The corresponding pseudo-measurement. Then the state vector and covariance matrix are updated:

[0138]

[0139] The reward value R is obtained from the reward function. k Then, execute the output action. Obtain the switching probability s of the next time step k+1. k+1 This completes one iteration, and the resulting data set, including states and actions, is grouped into a data set {s}. k+1 (a k+1 ),a k+1 ,R k ,s k+1 The data is stored in the experience replay pool H, and this scene is looped. The number of loops is set to T. After the loop ends, n sets of data are randomly selected from the experience replay pool H to calculate the target value.

[0140] The target value y is calculated using the minimum value in the dual-value network Q1 and Q2. i :

[0141] y i =R j +γmin(Q1(s k+1,j ,a' k+1,j |θ'1),Q2(s k+1,j ,a' k+1,j |θ'2)) (31);

[0142] The target action is:

[0143] a' k+1,j =μ'(s k+1 |φ')+ε,ε~clip(N(0,σ 2 ),-c,c) (32);

[0144] The value network update updates the parameters of the two value networks using loss functions Q1 and Q2:

[0145]

[0146] The policy network is updated every d iterations, using the following gradient formula to update the parameter φ:

[0147]

[0148] Finally, the target network parameters θ'1, θ'2 and the policy network target parameter φ' are updated using a soft update strategy:

[0149]

[0150] Where τ is the update coefficient. In order to reduce the correlation of recursive data, the update coefficient is generally set to a small value.

[0151] In a star-convex stochastic hypersurface model, the noise switching probability and geometric modeling parameters directly affect the accuracy of the measurement equation and state estimation. Inappropriate noise model parameters or geometric model settings can lead to significant estimation errors and instability in the tracking process. Therefore, it is necessary to find an optimal dynamic optimization scheme to reduce the adverse effects of noise and modeling parameters on target state estimation and enhance its adaptability. Iterative optimization algorithms in reinforcement learning provide effective solutions to the above problems, especially for optimization problems in non-stationary noise scenarios, where related research is limited. Therefore, this embodiment proposes a star-convex extended target tracking optimization method based on the TD3 algorithm, which dynamically optimizes the state and shape estimation process of the extended target, thereby significantly improving the accuracy and robustness of extended target tracking.

[0152] This embodiment employs a star-convex random hypersurface model to model the geometric contour of the extended target. This model can flexibly adapt to the complex geometry of the target and still achieve good estimation results even when there are significant differences between the actual target shape and the modeled shape. Therefore, this method has significant practical application value in solving target recognition, detection, and tracking problems. Based on this, a parameter optimization method based on TD3 reinforcement learning is combined to significantly improve the estimation results of the target's centroid motion state and geometry by dynamically adjusting the noise switching probability and key model parameters, thereby providing higher tracking accuracy and system stability in non-stationary noise environments.

[0153] Simulation analysis:

[0154] Set the system sampling period T s =1s, the tracking time is N=200 sampling periods. The number of measurements n acquired by the extended target. mea Obedience strength β D =40 Poisson distribution. Define the initial state of the expanded target as a circle with a radius of 8cm, and the Fourier expansion order in the shape parameter is N. f =16, the initial switching probability is set to 0.7. The number of Monte Carlo experiment cycles T = 200. The extended target shape is set to a cross shape. The initial target state parameters include position, velocity, and shape parameters in the x and y axes. The initial target parameters and state covariance are as follows:

[0155] x0 = [10,10,10,10,8,0,...,0] 1×20 (36);

[0156] C0=diag([0.3,0.3,0.3,0.3,0.02,…,0.02]) 20×20 (37);

[0157] The covariance matrix of the system measurement noise is R, the system noise is zero-mean Gaussian noise with covariance matrix Q, and the corresponding state transition matrix is ​​F:

[0158]

[0159]

[0160] Where q1 = 0.01 2 q2 = 0.03 2 , ν=5.

[0161] The hyperparameters of the TD3 deep reinforcement learning algorithm are set as shown in Table 1.

[0162] Table 1

[0163]

[0164] In the above simulation scenario, the star-convex extended target tracking optimization method based on the TD3 algorithm is used to track the extended target. For example... Figure 4 As shown, Figure 4 Figure (a) shows the tracking diagram of a single extended target, illustrating the estimated centroid, tracking curve, and measurement data points. The estimation results for the target's shape and position are relatively accurate, with most measurement points falling within the tracking boundary. This method can effectively track the motion and shape state of a single extended target, and the model exhibits high robustness to noise. The estimated shape effectively fits the target's uncertainty region. The measurement points are primarily distributed within the target estimation region, further validating the accuracy of the method. Figure 4 Figure (b) shows the complete tracking results of the target trajectory for the two comparative methods. It can be clearly seen that the overall trajectory is not a straight line due to the non-stationarity of measurement noise and process noise, but both methods can track it well. Figure 4 Image (c) is a magnified view of the target trajectory tracking details of the comparison methods, demonstrating the differences between reinforcement learning and traditional methods in shape detail tracking. It can be seen that the reinforcement learning method fits the target boundary more accurately, while the traditional method shows deviations at the target boundary. In shape tracking, the reinforcement learning method is better able to capture subtle changes in the target, demonstrating its superiority in complex and noisy environments.

[0165] Figure 5The magnified outlines of any three stages of the target's motion are shown, where (a) represents the initial stage, (b) the intermediate stage, and (c) the final stage. The figures demonstrate that the TD3 reinforcement learning algorithm generates a more accurate target trajectory, exhibiting better fitting to the target center than the traditional variational Bayesian method. Under complex non-Gaussian noise conditions, the TD3 algorithm maintains low bias, demonstrating its accurate estimation capability of the target's motion state. The TD3 algorithm is more accurate in estimating the target's extended state and motion state, closely approximating the true target state. In the complex noise environment of the experimental setup, the TD3 algorithm effectively avoids overfitting or divergence issues that may occur with traditional methods. In contrast, traditional methods exhibit greater deviations in the target trajectory at different stages, especially when the target position changes significantly, resulting in weakened trajectory tracking. This indicates that the algorithm has stronger adaptability and robustness, making it suitable for application in dynamically changing and complex scenarios.

[0166] Figure 6 The IoU curves of the two methods are shown. The reinforcement learning algorithm TD3 exhibits a higher IoU value throughout the time steps, with its curve stabilizing around 0.8, indicating more accurate prediction of the target region. The traditional variational Bayesian method (VB), while gradually improving in the early stages, shows significant fluctuations in the later stages, especially dropping below 0.6 in most time steps, indicating poor stability in complex scenes. In time steps 50-100, simulating the impact of a sudden, unknown, thick-tailed noise, TD3 reacts quickly, and its IoU curve maintains a tracking position while rapidly returning to around 0.8. In contrast, VB shows significant overall fluctuations and its tracking performance is far inferior to TD3. TD3's IoU curve converges rapidly and reaches a stable state at approximately 20-30 time steps, demonstrating its ability to quickly adapt to target characteristics and generate stable predictions. VB's initial improvement is slower, and it does not reach complete stability throughout the time period, reflecting its weak adaptability to complex scenes. From the perspective of curve fluctuation amplitude, the TD3 algorithm exhibits smaller fluctuation amplitude, indicating its greater robustness to target tracking in dynamic environments. In contrast, the VB method shows more dramatic fluctuations, suggesting higher sensitivity to noise, occlusion, or complex target changes. The TD3 algorithm has a higher and more stable overall IoU value, indicating better prediction accuracy and reliability in target tracking tasks. While the VB method performs reasonably well at certain time steps, its overall performance is limited, particularly when dealing with non-Gaussian noise and complex scenes.

[0167] Figure 7A comparison of RMSE values ​​for estimating the centroid of the tracked target is presented. The RMSE value of the TD3 algorithm remains consistently low throughout the time period, mainly between 0.2 and 0.4, with minimal fluctuations, indicating that the TD3 algorithm has higher tracking accuracy and stability. The RMSE value of the VB algorithm fluctuates significantly, reaching 0.5 and even 1.0 multiple times, demonstrating large prediction errors in some time steps. The TD3 algorithm exhibits smaller error fluctuations throughout the time period, indicating better stability and the ability to accurately estimate the target's motion state. In time steps 50-100, the simulation encounters the influence of a sudden, unknown thick-tailed noise. As shown in the figure, the TD3 method exhibits relatively smoother fluctuations compared to the VB method and returns to normal faster. The VB algorithm's RMSE curve fluctuates significantly, especially during time steps 20-60, showing multiple large errors, indicating instability in complex scenes. The trend of the first 20 time steps shows that the TD3 algorithm converges to a stable error level quickly, indicating higher learning efficiency. In contrast, the VB algorithm exhibits significant error fluctuations in the early stages and fails to show a stable trend subsequently, indicating poor adaptability to complex dynamic environments. The TD3 algorithm demonstrates strong robustness with smaller errors and limited fluctuations across different time steps. The VB algorithm, however, is highly sensitive to noise or changes in target motion, easily exhibiting large errors, indicating insufficient robustness.

[0168] Figure 8 This is the reward curve obtained by the TD3 reinforcement learning algorithm during the training process. As the time steps increase, the reward value gradually increases, indicating that the algorithm demonstrates the ability to improve progressively during the learning process. The reward value gradually stabilizes from an initial low level to around -50, showing that the algorithm has achieved its learning objective to a certain extent and obtained a stable policy. The reward value basically stabilizes in the latter half of the steps, indicating that the algorithm has completed most of the learning in this stage and is gradually converging. The fluctuation range of the reward value after convergence is small, indicating that the algorithm's policy has good stability. Throughout the learning process, the reward value fluctuates to some extent, especially after convergence, where there are occasional large drops. Therefore, the robustness of the algorithm in handling complex or abnormal states needs further improvement. Judging from the stability and trend of the reward value, the algorithm can adapt well to dynamic environments and gradually learn effective policies, but additional optimization may be needed in some extreme cases.

[0169] This embodiment proposes an optimization method for extended target tracking in non-stationary noise environments based on the TD3 algorithm. By modeling the noise switching probability as the action of a reinforcement learning agent and combining it with a Gaussian-Student t mixture distribution model, this embodiment achieves dynamic switching between Gaussian and thick-tailed noise. Geometric modeling of the extended target is performed based on a star-convex stochastic hypersurface model, and key filter parameters are dynamically adjusted within the TD3 framework, effectively improving the accuracy and robustness of state estimation. Simulation experiments verify that, compared with traditional variational Bayesian methods, the method in this embodiment significantly reduces estimation errors in complex noise scenarios and exhibits stronger robustness and real-time performance. Experimental results show that the adaptive optimization capability of the reinforcement learning algorithm TD3 in dynamic noise environments enables accurate tracking of the motion state and geometry of extended targets, providing a new approach to solving the problem of extended target tracking under non-Gaussian noise.

[0170] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. An extended target tracking optimization method based on TD3 in a non-stationary noise environment, characterized in that, Used in autonomous driving and intelligent monitoring fields, it utilizes sensors to acquire measurement information used for extended target tracking optimization, including: S1. Based on the Gaussian-Student t mixture distribution model, the noise switching probability optimization problem is constructed as a Markov decision process, and the measurement information of the extended objective is obtained. S2. Based on the measurement information, perform star-convex random hypersurface modeling on the extended target, substitute the current noise switching probability into the unscented Kalman filter, jointly estimate the motion state and geometry of the extended target, obtain the state estimation result and covariance matrix, calculate the trace of the covariance matrix based on the covariance matrix, and calculate the centroid position error and shape matching error of the extended target based on the estimation result. S3. Input the trace, centroid position error and shape matching error of the covariance matrix into the pre-constructed reward function to obtain the reward value used to evaluate the accuracy of the state estimation; S4. Update the policy network and value network in the TD3 algorithm based on the reward value, generate policy actions in combination with the current system state, and dynamically optimize to obtain the noise switching probability at the next moment. S5. The optimized noise switching probability for the next time step is fed back in real time and substituted into the unscented Kalman filter state estimation process in S2, forming a closed-loop iteration in which noise switching probability optimization drives state estimation and state estimation error guides noise switching probability optimization. S2-S4 are executed repeatedly until the preset iteration target is reached, the optimal noise switching probability learning strategy is obtained, and based on the optimal noise switching probability learning strategy, the state estimation of the extended target is continuously optimized by adaptively adjusting the noise distribution, so as to achieve high-precision tracking of the extended target in non-stationary noise environment.

2. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 1, characterized in that, In S1, based on the Gaussian-Student t mixture distribution model, the noise switching probability optimization problem is constructed as a Markov decision process. ,in, The state space consists of the trace of the noise covariance and the noise switching probability; The action space consists of a set of decisions. The state transition probability consists of the dynamic update of the noise switching probability and the state evolution of the filter. The reward function consists of the trace of the noise covariance matrix, the shape matching error, and the centroid position error.

3. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 2, characterized in that, The reward function is: ; in, for The trace of the time-varying covariance matrix; Indicates shape matching error; Indicates the error in the position of the centroid; As the weight of the reward item, This is the reward value.

4. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 1, characterized in that, In step S2, modeling the extended target using a star-convex random hypersurface based on the measurement information includes: A measurement source model is established based on the spatial distribution assumption, and several measurement sources for the extended target are generated using the measurement source model. Several measurements of the extended target are generated based on the mixed sensor noise of the measurement source; Based on the measurement source and measurement, a model is obtained by using a star-convex random hypersurface for modeling to obtain the observation model of the star-convex extended target.

5. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 4, characterized in that, The observation model for the star-shaped extended target is as follows: in, This is a pseudo-measurement equation. To expand the target The state at any given moment, To expand the target Time of the first Measurement noise at each sampling point To expand the target Time of the first The scale factor of each sampling point To expand the target Time of the first Measurement at each sampling point These are Fourier coefficients. To expand the target The shape parameter vector at time, It is a direction vector. To expand the target The center of mass of time for Vector and the time-matter relationship between the centroid and the measurement source The included angle of the axis.

6. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 1, characterized in that, In step S4, updating the policy network and value network in the TD3 algorithm based on the reward value, and dynamically optimizing the noise switching probability in conjunction with the policy action generated from the current state, includes: Initialize the target state and the network parameters of the TD3 algorithm; Based on the current strategy, The switching probability at time step is input into the policy network to obtain... Actions at any given moment; based on The probability of time switching and The action at a moment, to obtain The probability of switching at any given moment; Will The action and switching probability at each time step are substituted into the filter to update the state vector and covariance matrix, calculate the reward value, and execute the action to obtain... The probability of noise switching at any given moment.

7. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 6, characterized in that, Updating the state vector and covariance matrix includes: ; ; ; ; ; ; in, for Actions at any moment yes Based on the action The obtained predicted state vector, yes Time of the first The weight coefficients of each predicted state are the weights of each sigma point in the UKF. It is the first sigma points at Predicting the state at any given time. It is the first The error between the predicted state at 1 Sigma point and the weighted average predicted state yes The covariance matrix of the state and measurement at time step. This represents the error between the predicted state vector and the actual target state. express Prior state estimation at time 10:00 express The prior covariance matrix at time t, express Post-hoc state estimation at time step It is the Kalman gain at time k+1. yes The posterior covariance matrix at time step 1. yes The prediction covariance matrix at time step.

8. The extended target tracking optimization method based on TD3 in a non-stationary noise environment according to claim 1, characterized in that, In the cyclic iteration, the updates to the value network and the policy network include: Calculate the target value : ; Updating the value network: Update policy network: Soft update: in, The value function is the target action-value function. For the immediate reward at the current moment, It is a discount factor. and To obtain an estimate for the dual-value network, and It refers to the state and actions at the next moment. and These are the current value network parameters. For the current policy network parameters, and These are the loss functions of the two value networks. and It refers to the current state and actions. It is the gradient of the objective function corresponding to the policy network. express network gradient, Indicates the state given The action output by the time-policy network, This represents the gradient of the policy network. and For the target value network parameters, Target policy network parameters To update the coefficients.