Multi-sensor intelligent fusion multi-target tracking method based on transformer architecture

By adopting a multi-sensor intelligent fusion method based on the Transformer architecture, which combines feature-level and decision-level fusion, the shortcomings of multi-sensor tracking methods in terms of anti-interference capability and robustness are solved, achieving high-precision and continuous multi-target tracking, which is suitable for autonomous driving and traffic management.

CN118823065BActive Publication Date: 2026-06-30UNIV OF ELECTRONICS SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIV OF ELECTRONICS SCI & TECH OF CHINA
Filing Date
2024-07-05
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multi-sensor target tracking methods are poor in terms of anti-interference ability and robustness, and deep learning methods are difficult to achieve continuous multi-sensor fusion tracking.

Method used

A multi-sensor intelligent fusion method based on the Transformer architecture is adopted, which combines feature-level and decision-level fusion to design a hybrid fusion architecture. Through a multi-target distributed covariance cross-fusion algorithm and a hybrid fusion network, high-precision continuous tracking of multiple targets is achieved.

Benefits of technology

It achieves high-precision, continuous and robust multi-target tracking fusion, has good scalability and low communication requirements, and is suitable for fields such as autonomous driving and traffic control.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118823065B_ABST
    Figure CN118823065B_ABST
Patent Text Reader

Abstract

This invention discloses a multi-sensor intelligent fusion multi-target tracking method based on the Transformer architecture. First, radar simulation data is designed and acquired to generate two-dimensional measurement data from multiple sensors. Then, the measurement data from the multiple sensors is preprocessed, and high-dimensional vector information is extracted before being input into a pre-constructed TMSHF network model for training. Finally, test data is input into the trained network model to obtain the multi-target intelligent tracking fusion result. This invention combines the optimal fusion idea of ​​feature-level fusion with the advantages of decision-level fusion, which is easily scalable and stable. From a data-driven perspective, it breaks away from the limitations of existing fusion methods that rely solely on feature-level or decision-level fusion, achieving deep utilization of information from multiple local sensors. This forms a hybrid feature-level and decision-level multi-sensor fusion architecture, enabling high-precision continuous tracking fusion of multiple targets, and can be applied in fields such as autonomous driving and traffic control.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of radar technology and deep learning technology, specifically relating to a multi-sensor intelligent fusion multi-target tracking method based on the Transformer architecture. Background Technology

[0002] Multi-target tracking (MTT) refers to the continuous estimation of the position and motion state of multiple dynamic targets over time using sensor measurement data that may contain noise. The methods for solving the MTT problem depend on whether they operate in a model-based or model-free environment. Model-based Bayesian methods achieve state-of-the-art (SOTA) performance when an accurate multi-target model is known and the observations consist of low-dimensional, single-target detections. There are three main Bayesian filtering-based solutions for MTT: Joint Probabilistic Data Association Filter (JPDAF), Multiple Hypothesis Tracking (MHT), and Random Finite Set (RFS) methods. In JPDAF and MHT, the core idea is to first solve the data association problem between sensor measurement data and target tracks, and then apply existing filtering methods for tracking. In contrast, RFS methods offer a holistic approach that simultaneously solves the tracking and association problems. Representative examples include the PMBM filter and the GLMB filter, both of which utilize the RFS framework to solve the tracking problem and achieve excellent performance.

[0003] While a single sensor can provide optimal or near-optimal estimates for dynamic targets, its weak anti-interference capability and poor robustness make it less effective. Therefore, the idea of ​​using multiple sensors to fuse state estimates for the same target has become a research hotspot. Existing multi-sensor target fusion tracking methods can be divided into two categories based on the type of information shared between sensors: feature-level and decision-level. The former focuses on calculating accurate multi-sensor likelihood / posterior probability. It typically employs a centralized sensor network, sending measurements from all sensors to a fusion center for joint analysis to achieve optimal fusion. This requires each sensor to communicate with a central node, demanding high computational, storage, and communication capabilities, while exhibiting weak fault tolerance. Conversely, decision-level fusion uses posterior estimates generated by local filters. This method is usually implemented in a distributed manner, offering better computational scalability and reliability. Covariance cross-fusion (CI) and arithmetic mean fusion (AA) are two widely used fusion rules in distributed fusion. For example, the CI fusion rule can fuse unknown correlation estimates generated by different but not necessarily independent sensors to avoid redundant calculations of information during fusion. In practical applications, fusion nodes often cannot determine the correlation between their information and the information received by other sensor nodes. The assumption of independence between different sensor nodes is not applicable in most real-world problems. Therefore, CI fusion rules are used to fuse the first-order and second-order statistics provided by local sensor nodes to obtain fused MTT results.

[0004] In recent years, deep learning-based MTT algorithms have become an attractive alternative to traditional Bayesian methods, typically optimizing models with a large number of parameters by minimizing empirical risk on labeled datasets. The paper "Nextgeneration multitarget trackers: Random finite set methods vs transformer-based deep learning," in Proc. Int. Conf. Inform. Fusion. IEEE, 2021, pp. 1-8, proposes a high-performance, type-specific MTT neural network based on the Transformer architecture, called MultiTarget TrackingTransformer (MT3). The paper "Deep learning for model-based multi-object tracking," in IEEE Transactions on Aerospace and Electronic Systems, 2023, proposes the MT3 v2 algorithm, which adds uncertainty output to the original MT3 algorithm output, further opening up opportunities for multi-sensor fusion using deep learning methods. However, the MT3 v2 algorithm cannot achieve continuous prediction; it can only achieve the fusion of single-frame target state estimation results. The paper "Transformer based online continuous multi-target tracking with state regression, in 2023 12th International Conference on Control, Automation and Information Sciences (ICCAIS). IEEE, 2023, pp. 393-398" addresses the problem of continuous target tracking and proposes an SR-MT3 method. This method achieves online continuous target tracking through recursive state autoregression queries. While a single sensor's state estimation of a single target can achieve optimal or suboptimal estimation for dynamic targets, it lacks strong anti-interference capabilities and robustness. Currently, no research has been conducted that uses deep learning methods to achieve continuous and robust multi-sensor fusion target tracking, while the aforementioned research advances provide possibilities for intelligent multi-sensor fusion multi-target tracking methods. Summary of the Invention

[0005] To address the aforementioned technical issues, this invention provides a multi-sensor intelligent fusion multi-target tracking method (TMSHF) based on the Transformer architecture. Combining the optimal fusion concept of feature-level fusion with the advantages of decision-level fusion, which is easy to expand and stable, this method takes a data-driven approach and breaks away from the limitations of traditional fusion methods that rely solely on feature-level or decision-level fusion. It innovatively achieves in-depth utilization of information from different sensors, forming a hybrid feature-level and decision-level multi-sensor fusion architecture, and realizes high-precision continuous tracking fusion tasks for multiple targets.

[0006] The technical solution adopted in this invention is: a multi-sensor intelligent fusion multi-target tracking method based on the Transformer architecture, the specific steps of which are as follows:

[0007] S1. Design radar simulation data to generate multi-sensor two-dimensional measurements, perform data preprocessing, and divide the data into training and testing sets for the network model.

[0008] S2. Construct the TMSHF network model, including: a single-sensor tracking module and a multi-sensor hybrid fusion module;

[0009] S3. Design a loss function and input the multi-sensor two-dimensional measurement training set obtained in step S1 into the TMSHF network model constructed in step S2 to train the network model.

[0010] S4. Input the multi-sensor two-dimensional measurement test set obtained in step S1 into the TMSHF network model trained in step S3, output the deep fusion test tracking results of the multi-target and evaluate them.

[0011] Furthermore, step S1 is specifically as follows:

[0012] First, a large amount of radar simulation data is generated using standard state and measurement equations as the dataset for the deep learning network. The target motion is approximated as uniform motion, and the state vector of target i at frame t is... The state vector of target i at frame t-1 is The equation of motion is expressed as follows:

[0013]

[0014] Among them, F t W represents the state transition matrix. t-1 This represents the process noise matrix, where the process noise is Gaussian noise with zero mean and covariance Q. t-1 .

[0015] In a two-dimensional scene, Covariance matrix Q t-1 The expression is as follows:

[0016]

[0017] in, Let x and y represent the x-axis position and y-axis position of target i at frame t, respectively. Let q represent the x-direction velocity and y-direction velocity of target i at frame t, respectively; s The process noise variance is represented by T, the sensor sampling period is represented by I², and the second-order identity matrix is ​​represented by I². It represents the Kronecker product.

[0018] The new target is reached via a Poisson point process, with a birth intensity of λ. b , This represents the set of all target states in frame t.

[0019] If a group of sensors with the same field of view (FOV) detect the same set of targets, and then a simulated sensor is used to generate measurements in a two-dimensional Cartesian coordinate system, with each existing target generating at most one actual measurement, then the observation equation expression for the actual measurements captured by sensor s is as follows:

[0020]

[0021] in, This represents the measurement value of target i obtained by sensor s at frame t. The measurement noise of sensor s has a mean of zero and a variance of . H s The measurement matrix of sensor s is expressed as follows:

[0022]

[0023] Clutter measurement is based on intensity The Poisson point process is reached, independent of the existing target or actual measurement, and is the set of all measurement values ​​in frame t. The expression is as follows:

[0024]

[0025] in, Let represent the set of clutter generated by sensor s in frame t.

[0026] Finally, data preprocessing is performed. First, the two-dimensional measurements from multiple sensors are normalized within the field of view. Then, a linear layer is used to extract high-dimensional feature vectors from the Cartesian coordinate system measurements. Measurement data is generated through simulation based on actual conditions and divided into training and testing sets. The training set is used to train the network model, and the testing set is used to test the trained TMSHF model to obtain the tracking fusion results.

[0027] Furthermore, step S2 is specifically as follows:

[0028] The TMSHF network model includes: a single-sensor tracking module and a multi-sensor hybrid fusion module.

[0029] The single-sensor tracking module includes: an encoder for analyzing target motion state information; a decoder for parsing and predicting target state; and a state regression mechanism.

[0030] First, the encoder analyzes the target's motion state information, encoding the measurements into a high-dimensional feature vector. The state regression mechanism obtains a newborn query, an autoregressive query, and a mask from the output predicted in the previous frame. Finally, the newborn query, autoregressive query, mask, and the high-dimensional feature vector obtained from the encoder are input into the decoder to predict the target state in the current frame, resulting in a predicted state output.

[0031] The multi-sensor hybrid fusion module includes: a multi-target distributed CI fusion algorithm module (MDCI) and a hybrid fusion network module (HFN).

[0032] The predicted state outputs of the decoders from multiple local sensors are initially fused using a multi-objective distributed CI fusion rule as decision-level information. Finally, the decision-level information and feature-level information are input into the fusion decoder as embeddings and queries, respectively, to obtain the deeply fused target predicted state and uncertainty.

[0033] In single-sensor tracking tasks, the focus is on the state estimation problem of multiple targets. For the state estimation of the current T-th frame, the state estimation of the previous frame is used. The measurement sequence of sensor s over past time steps τ up to the current frame T. Each measurement value in τ is detected by sensor s, and the measurement sequence is collected in random order. In Chinese, the expression is as follows:

[0034]

[0035] in,

[0036]

[0037] in, This represents the measurement value of target i obtained by sensor s at frame t, along with time information t. [·]′ represents the matrix transpose operation. This represents the number of measurements detected by sensor s in frame t. This represents the x and y coordinates of the measurement k.

[0038] Then the target state estimated by sensor s in the previous frame The expression is as follows:

[0039]

[0040] in, This indicates the number of targets estimated by sensor s in the last frame. This represents the estimated coordinates of target i. This represents the standard deviation of the estimated target k covariance.

[0041] The result of each sensor performing a single-sensor tracking task is a set of estimated states and standard deviations of the target in the current frame, expressed as follows:

[0042]

[0043] A hybrid fusion method combining feature-level fusion and decision-level fusion is adopted. The expression for the measurement values ​​from all sensors used for feature-level fusion is as follows:

[0044]

[0045] Where S represents the number of sensors involved in feature-level fusion.

[0046] For decision-level fusion, all estimated states are tracked from a single sensor. The following expression is used as input:

[0047]

[0048] in, This represents the number of targets estimated by sensor s in frame T.

[0049] The result of a multi-sensor fusion task is a set of fused estimated states and standard deviations, expressed as follows:

[0050]

[0051] in, This indicates the estimated number of targets after fusion; the superscript F indicates fusion.

[0052] Each sensor makes a prediction. Input Transformer encoder converted to embedding Then, With query Together they are fed into the Transformer decoder to generate the estimated state.

[0053] The estimated state copy is processed by a state regression mechanism and recursively returned to the decoder as a query in the next frame. Another copy, along with the outputs from other sensors, is collected into a set. In this context, it serves as a collection of outputs from multiple sensors.

[0054] The multi-objective distributed covariance cross (MDCI) fusion algorithm uses the estimated state set for preliminary fusion to generate a query sequence. These queries and the cascaded data embedded in the s sensors are then sent to the fusion decoder. The two samples are input together into the fusion decoder, which transforms them into the final fusion estimate.

[0055] Where, n C n represents the number of targets after MDCI. F This indicates the number of targets after the mixture is combined.

[0056] Furthermore, in step S2, the single-sensor tracking module is specifically as follows:

[0057] The single-sensor tracking module, namely the single-sensor SR-MT3 tracking module, uses the SR-MT3 network framework for single-sensor tracking.

[0058] The output of the single-sensor tracking module is derived from two types of queries: k neologisms, which allow the model to initialize trajectories for targets that did not exist in the previous frame; and k autoregressive queries, which are responsible for tracking trajectories that existed in the previous frame.

[0059] Where k represents a predetermined value equal to the maximum number of targets. Decoder query o 1:2k Query n from new students 1:k With autoregressive query r 1:k The results are obtained by combining the k-th dimension to jointly predict the target state of the current frame.

[0060] When the target number of queries has not been reached, a masking mechanism based on the probability of existence is used to construct a state regression module to generate new queries n. 1:k Autoregressive query r 1:k , mask m 1:2k This creates a completely new state regression query, as detailed below:

[0061] (1) Tracking initialization is achieved through newborn query;

[0062] Each embedding is initialized using static and learned target encodings, and new targets appearing in the current frame are detected by a fixed number of k output embeddings. During iteration, each new query input to the decoder is generated by n... 1:k express.

[0063] (2) The autoregressive query iterates during the tracking process;

[0064] First, let's consider the probability of existence P in the last frame. t-1,1:2k Choose k items with the highest probability of existence Then select the state with the corresponding top-k existence probability. The following are candidate expressions for the query:

[0065]

[0066] in, This indicates that the length of the sequence is a non-negative integer k, and r = argsort(P) t-1,1:2k ), argsort represents a function that returns the indices of the input array sorted according to their existence probabilities; rl is the index that gives the top-k existence probabilities. This indicates that based on index rl from The selected query candidates.

[0067] After normalization based on the field of view, the selected sequence The data is fed into a feedforward neural network layer for nonlinear mapping and feature extraction, generating a query r for the decoder. 1:k The expression is as follows:

[0068]

[0069] Each element Denotes the field of real numbers, and d′>d z Denotes a hyperparameter, d z Indicates the measurement dimension.

[0070] (3) Tracking termination is achieved based on the existence probability of the mask;

[0071] A masking mechanism based on existence probability is adopted, and the k highest existence probabilities are selected according to the top-k mechanism. This is then fed into the linear layer to calculate the existence threshold g. t The expression is as follows:

[0072]

[0073] in, This represents the learnable parameters. Because... and It is a one-to-one correspondence, so the mask m i According to g t The calculation is expressed as follows:

[0074]

[0075] in, express The probability of the l-th element in the sequence; The sequence length is a non-negative integer 2k; and new queries have no corresponding existence probability, their masks are always False.

[0076] Furthermore, in step S2, the multi-sensor hybrid fusion module is specifically as follows:

[0077] The multi-sensor fusion module is composed of a multi-target distributed CI fusion algorithm (MDCI) and a hybrid fusion network (HFN).

[0078] (1) Design MDCI;

[0079] MDCI is used for initial decision-making regarding fusion and for generating queries for the fusion decoder. This is reflected in the decoder's overall output. The system contains clutter and measurements from different sensors targeting different targets. By comparing the distances between different measurements with a distance threshold δ, measurements from different sensors targeting the same target are grouped. If the distance is greater than the threshold, the two measurements are classified as different targets. If the distance is less than the threshold, the two measurements are classified as one target.

[0080] After classifying the measurement values, existing CI fusion rules are used to fuse the estimates of the same target. Finally, the fused result is expressed as follows:

[0081]

[0082] in, This represents the fusion result of the xy-axis position of the i-th target after passing through MDCI. This represents the uncertainty of the xy-axis position of the i-th target after passing through the MDCI.

[0083] (2) Design HFN;

[0084] HFN is used to perform hybrid fusion tasks. During single-sensor tracking, the features of the measurements detected by each sensor are extracted and embedded by the encoder. In the middle. HFN performs feature fusion by concatenating these embeddings, as shown in the following expression:

[0085]

[0086] Where, m s This represents the number of measurements taken by sensor s, and concat represents an algorithm for concatenating a list. Denotes the embedding sequence of the fusion decoder, m CThis indicates the length of the embedding sequence in the fusion decoder.

[0087] Decision fusion first generates preliminary fusion results through MDCI processing. Normalization is performed using the FOV parameter. The state in Then with covariance The inputs are fed together into a feedforward neural network (FFN) to form a fusion decoder query sequence. The expression is as follows:

[0088]

[0089] Where, n C This indicates the number of existing targets after MDCI processing.

[0090] Finally, the features measured from all sensors and the estimated results are stored in the embedded sequences respectively. and query sequence In the middle, the fusion decoder performs hybrid fusion and transforms them into fusion estimates. And output a series of corresponding existence probabilities Determine if the estimate is valid. The corresponding probability of existence Estimates greater than the probability threshold g will be selected for the final fusion estimate. In Chinese, the expression is as follows:

[0091]

[0092] Where ε(·) represents the step function.

[0093] Furthermore, step S3 is specifically as follows:

[0094] Final fusion estimate Designated as virtual sensor 0: If the output is s, then the sensor label becomes s∈[0,S].

[0095] estimated value n s Parameters of the component-based Bernoulli MB density, each estimate The structural composition is Let represent the mean and covariance parameters of the Gaussian distribution, and represent the existence probability associated with the Bernoulli component, respectively. Predictions from all sensors are compared with ground truth values. The negative log-likelihood NLL loss is used for supervision, and the expression is as follows:

[0096]

[0097] in, Indicates in The calculation is done by Defined multi-Bernoulli density.

[0098] By attach Use elements to expand the sequence Generate a new sequence with the same number of elements. The negative log-likelihood (25) is then approximated as:

[0099]

[0100] Where σ represents the permutation function, defined as σ: n l Represents a new sequence Length; Represents a new sequence The σ(i)th element. express The specified Bernoulli density is The negative logarithm of the element at the σ(i)th position is calculated as follows:

[0101]

[0102] Finally, σ s The most probable association between the target and the Bernoulli component predicted for sensor s is approximated as follows:

[0103]

[0104] in, This represents a correlation score between the target and the Bernoulli component predicted by the sensor. The efficient computation of is achieved through the Hungarian algorithm, as shown below:

[0105]

[0106] The entire training process of the network model is divided into forward propagation and backward propagation. A large amount of training data is obtained through step S1 and input into the network model. At the same time, the weights of each unit are continuously adjusted using the loss function.

[0107] Furthermore, step S4 is specifically as follows:

[0108] After step S3, the training of the TMSHF network is complete. Then, the test dataset generated in step S1 is input into the trained network model to obtain the prediction results. The optimal sub-pattern assignment metric is used to evaluate the model's performance, as follows:

[0109]

[0110] d c (x,y)=min{c,d(x,y)} (32)

[0111] Among them, D p,c Indicates OSPA error. Indicates the prediction result. Γ represents the true value, and Γ represents a set of values ​​corresponding to equation (31); p represents the distance sensitivity parameter, c represents a parameter for cutting off distance, i.e., the target state estimation error threshold, which is used to adjust the ratio between the estimation error of the set potential and the position error; d(·) represents the function for calculating the distance.

[0112] The beneficial effects of this invention are as follows: The method of this invention first designs and acquires radar simulation data to generate two-dimensional measurement data from multiple sensors. Then, the measurement data from the multiple sensors is preprocessed, and high-dimensional vector information is extracted before being input into a pre-constructed TMSHF network model for network model training. Finally, test data is input into the trained network model to obtain the multi-target, multi-sensor intelligent tracking fusion result. This invention combines the optimal fusion concept of feature-level fusion with the advantages of decision-level fusion, which is easily scalable and stable. From a data-driven perspective, it breaks away from the limitations of existing fusion methods that rely solely on feature-level or decision-level fusion, achieving deep utilization of information from multiple local sensors. This forms a hybrid feature-level and decision-level multi-sensor fusion architecture, realizing high-precision continuous tracking fusion tasks for multiple targets. It possesses advantages such as high precision, continuity, robustness, good scalability, and low communication requirements, and can be applied in fields such as autonomous driving and traffic management. Attached Figure Description

[0113] Figure 1 This is a flowchart of a multi-sensor intelligent fusion multi-target tracking method based on the Transformer architecture according to the present invention.

[0114] Figure 2 This is a schematic diagram of the overall architecture of the TMSHF network model in an embodiment of the present invention.

[0115] Figure 3 This is a schematic diagram illustrating the state changes during continuous tracking by the single-sensor tracking module in an embodiment of the present invention.

[0116] Figure 4 This is a schematic diagram of the state regression mechanism in the single-sensor tracking module in an embodiment of the present invention.

[0117] Figure 5 This is a schematic diagram of a multi-sensor hybrid fusion module in an embodiment of the present invention.

[0118] Figure 6 This is a comparison curve of the OSPA index for Task 1 in this embodiment of the invention.

[0119] Figure 7 This is an example diagram of the evaluation of Task 1 in an embodiment of the present invention.

[0120] Figure 8 This is a comparison curve of the OSPA index for Task 2 in this embodiment of the invention.

[0121] Figure 9 This is an example diagram of the evaluation of Task 2 in an embodiment of the present invention. Detailed Implementation

[0122] This embodiment primarily employs simulation experiments for verification. All steps and conclusions have been verified correctly using Matlab 2021b and Python 3.9. The method of this invention will be further described below with reference to the accompanying drawings and embodiments.

[0123] like Figure 1 The flowchart of a multi-sensor intelligent fusion multi-target tracking method based on the Transformer architecture of the present invention is shown below. The specific steps are as follows:

[0124] S1. Design radar simulation data to generate multi-sensor two-dimensional measurements, perform data preprocessing, and divide the data into training and testing sets for the network model.

[0125] S2. Construct the TMSHF network model, including: a single-sensor tracking module and a multi-sensor hybrid fusion module;

[0126] S3. Design a loss function and input the multi-sensor two-dimensional measurement training set obtained in step S1 into the TMSHF network model constructed in step S2 to train the network model.

[0127] S4. Input the multi-sensor two-dimensional measurement test set obtained in step S1 into the TMSHF network model trained in step S3, output the deep fusion test tracking results of the multi-target and evaluate them.

[0128] In this embodiment, step S1 is specifically as follows:

[0129] First, a large amount of radar simulation data is generated using standard state and measurement equations as the dataset for the deep learning network. The target motion is approximated as uniform motion, and the state vector of target i at frame t is... The state vector of target i at frame t-1 is The equation of motion is expressed as follows:

[0130]

[0131] Among them, F t W represents the state transition matrix. t-1This represents the process noise matrix, where the process noise is Gaussian noise with zero mean and covariance Q. t-1 .

[0132] In a two-dimensional scene, Covariance matrix Q t-1 The expression is as follows:

[0133]

[0134]

[0135] in, Let x and y represent the x-axis position and y-axis position of target i at frame t, respectively. Let q represent the x-direction velocity and y-direction velocity of target i at frame t, respectively; s The process noise variance is represented by T, the sensor sampling period is represented by I², and the second-order identity matrix is ​​represented by I². It represents the Kronecker product.

[0136] The new target is reached via a Poisson point process, with a birth intensity of λ. b , This represents the set of all target states in frame t.

[0137] Set up a group of sensors with the same field of view (FOV) to detect the same set of targets. Then, use a simulated sensor to generate measurements in a two-dimensional Cartesian coordinate system, where each existing target can generate at most one actual measurement. The observation equation for the actual measurements captured by sensor s is as follows:

[0138]

[0139] in, This represents the measurement value of target i obtained by sensor s at frame t. The measurement noise of sensor s has a mean of zero and a variance of . H s The measurement matrix of sensor s is expressed as follows:

[0140]

[0141] Clutter measurement is based on intensity The Poisson point process is reached, independent of the existing target or actual measurement, and is the set of all measurement values ​​in frame t. The expression is as follows:

[0142]

[0143] in, Let represent the set of clutter generated by sensor s in frame t.

[0144] Based on the state equations and measurement equations of the aforementioned content standards, a large amount of multi-sensor radar simulation data is generated as the dataset for the deep learning network. Finally, data preprocessing is performed. First, the two-dimensional measurements from multiple sensors are normalized within the field of view. Then, a linear layer is used to extract high-dimensional feature vectors from the Cartesian coordinate system measurements. Measurement data is generated based on actual conditions and divided into training and testing sets. The training set is used to train the network model, and the testing set is used to test the trained TMSHF model to obtain the tracking fusion results.

[0145] This embodiment sets up two tasks. In Task 1, the tracking accuracy of TMSHF and Kalman Filter (KF) fusion with Covariance Cross (CI) is compared. In Task 2, the fusion accuracy and uncertainty of TMSHF and CI fusion are compared. The parameters for the two tasks are shown in the table. Then, the multi-sensor two-dimensional measurements are normalized within the field of view, and a linear layer is used to extract the XY coordinate measurements into a high-dimensional feature vector.

[0146] Table 1

[0147]

[0148] The TMSHF network is built upon SR-MT3, a high-performance deep learning method for online continuous tracking based on the Transformer architecture. TMSHF uses a parallel encoder-decoder architecture for each sensor to perform MTT independently, and uses a single decoder for fusion to integrate the results from all sensors.

[0149] In this embodiment, step S2 is specifically as follows:

[0150] like Figure 2 As shown, the TMSHF network model includes: a single-sensor tracking module and a multi-sensor hybrid fusion module.

[0151] The single-sensor tracking module includes: an encoder for analyzing target motion state information; a decoder for parsing and predicting target state; and a state regression mechanism.

[0152] First, the encoder analyzes the target's motion state information, encoding the measurements into a high-dimensional feature vector. The state regression mechanism obtains a newborn query, an autoregressive query, and a mask from the output predicted in the previous frame. Finally, the newborn query, autoregressive query, mask, and the high-dimensional feature vector obtained from the encoder are input into the decoder to predict the target state in the current frame, resulting in a predicted state output.

[0153] The multi-sensor hybrid fusion module includes: a multi-target distributed CI fusion algorithm module (MDCI) and a hybrid fusion network module (HFN).

[0154] The predicted state outputs of the decoders from multiple local sensors are initially fused using a multi-objective distributed CI fusion rule as decision-level information. Finally, the decision-level information and feature-level information are input into the fusion decoder as embeddings and queries, respectively, to obtain the deeply fused target predicted state and uncertainty.

[0155] To provide a clearer overview of the TMSHF algorithm architecture, we will first introduce the formula symbols involved below.

[0156] In single-sensor tracking tasks, the focus is on the state estimation problem of multiple targets. For the state estimation of the current T-th frame, the state estimation of the previous frame is used. The measurement sequence of sensor s over past time steps τ up to the current frame T. Each measurement value in τ is detected by sensor s, and the measurement sequence is collected in random order. In Chinese, the expression is as follows:

[0157]

[0158] in,

[0159]

[0160] in, This represents the measurement value of target i obtained by sensor s at frame t, along with time information t. [·]′ represents the matrix transpose operation. This represents the number of measurements detected by sensor s in frame t. This represents the x and y coordinates of the measurement k.

[0161] Then the target state estimated by sensor s in the previous frame The expression is as follows:

[0162]

[0163] in, This indicates the number of targets estimated by sensor s in the last frame. This represents the estimated coordinates of target i. This represents the standard deviation of the estimated target k covariance.

[0164] The result of each sensor performing a single-sensor tracking task is a set of estimated states and standard deviations of the target in the current frame, expressed as follows:

[0165]

[0166] To improve the performance of multi-sensor fusion tasks, a hybrid fusion method combining feature-level fusion and decision-level fusion is adopted. The expression for the measurement values ​​from all sensors used for feature-level fusion is as follows:

[0167]

[0168] Where S represents the number of sensors involved in feature-level fusion.

[0169] For decision-level fusion, all estimated states are tracked from a single sensor. The following expression is used as input:

[0170]

[0171] in, This represents the number of targets estimated by sensor s in frame T.

[0172] The result of a multi-sensor fusion task is a set of fused estimated states and standard deviations, expressed as follows:

[0173]

[0174] in, This indicates the estimated number of targets after fusion; the superscript F indicates fusion.

[0175] Each sensor makes a prediction. Input Transformer encoder converted to embedding Then, With query Together they are fed into the Transformer decoder to generate the estimated state.

[0176] The estimated state copy is processed by a state regression mechanism and recursively returned to the decoder as a query in the next frame. Another copy, along with the outputs from other sensors, is collected into a set. In this context, it serves as a collection of outputs from multiple sensors.

[0177] The multi-objective distributed covariance cross (MDCI) fusion algorithm uses the estimated state set for preliminary fusion to generate a query sequence. These queries and the cascaded data embedded in the s sensors are then sent to the fusion decoder. The two samples are input together into the fusion decoder, which transforms them into the final fusion estimate.

[0178] Where, n Cn represents the number of targets after MDCI. F This indicates the number of targets after the mixture is combined.

[0179] like Figure 3 The diagram illustrates the state changes during continuous tracking by the single-sensor tracking module. In this embodiment, step S2 involves the single-sensor tracking module as follows:

[0180] The single-sensor tracking module, namely the single-sensor SR-MT3 tracking module, uses the SR-MT3 network framework for single-sensor tracking.

[0181] The output of the single-sensor tracking module is derived from two types of queries: k neologisms, which allow the model to initialize trajectories for targets that did not exist in the previous frame; and k autoregressive queries, which are responsible for tracking trajectories that existed in the previous frame.

[0182] Where k represents a predetermined value equal to the maximum number of targets. Decoder query o 1:2k Query n from new students 1:k With autoregressive query r 1:k The results are obtained by combining the k-th dimension to jointly predict the target state of the current frame.

[0183] Figure 4 This is a schematic diagram of the state regression mechanism in the single-sensor tracking module of this embodiment. When the number of targets has not reached the number of queries, a masking mechanism based on the existence probability is used to construct the state regression module to generate a new query n. 1:k Autoregressive query r 1:k , mask m 1:2k This creates a completely new state regression query, as detailed below:

[0184] (1) Tracking initialization is achieved through newborn query;

[0185] Each embedding is initialized using static and learned target encodings, and new targets appearing in the current frame are detected by a fixed number of k output embeddings. During iteration, each new query input to the decoder is generated by n... 1:k express.

[0186] (2) The autoregressive query iterates during the tracking process;

[0187] First, let's consider the probability of existence P in the last frame. t-1,1:2k Choose k items with the highest probability of existence Then select the state with the corresponding top-k existence probability. The following are candidate expressions for the query:

[0188]

[0189] in, This indicates that the length of the sequence is a non-negative integer k, and r = argsort(P) t-1,1:2k ), argsort represents a function that returns the indices of the input array sorted according to their existence probabilities; rl is the index that gives the top-k existence probabilities. This indicates that based on index rl from The selected query candidates.

[0190] After normalization based on the field of view, the selected sequence The data is fed into a feedforward neural network layer for nonlinear mapping and feature extraction, generating a query r for the decoder. 1:k The expression is as follows:

[0191]

[0192] Each element Denotes the field of real numbers, and d′>d z Denotes a hyperparameter, d z Indicates the measurement dimension.

[0193] (3) Tracking termination is achieved based on the existence probability of the mask;

[0194] A masking mechanism based on existence probability is adopted, and the k highest existence probabilities are selected according to the top-k mechanism. This is then fed into the linear layer to calculate the existence threshold g. t The expression is as follows:

[0195]

[0196] in, This represents the learnable parameters. Because... and It is a one-to-one correspondence, so the mask m i According to g t The calculation is expressed as follows:

[0197]

[0198] in, express The probability of the l-th element in the sequence; The sequence length is a non-negative integer 2k; and new queries have no corresponding existence probability, their masks are always False.

[0199] like Figure 5 As shown in the diagram, the multi-sensor hybrid fusion module is described in this embodiment. In step S2, the multi-sensor hybrid fusion module is specifically as follows:

[0200] The multi-sensor fusion module consists of a multi-target distributed CI fusion algorithm (MDCI) and a hybrid fusion network (HFN). The target state estimates from the sensors, after coordinate system calibration, are processed sequentially by MDCI and HFN to form the final fused estimate. Since the fusion process only involves the current frame, the time step T is omitted in this part.

[0201] (1) Design MDCI;

[0202] MDCI is used for initial decision-making regarding fusion and for generating queries for the fusion decoder. This is reflected in the decoder's overall output. In this context, clutter and measurements from different sensors targeting different targets exist. To group these measurements from different sensors targeting the same target, it is necessary to compare the distance between the different measurements and a distance threshold δ. If the distance is greater than the threshold, the two measurements are classified as different targets. If the distance is less than the threshold, the two measurements are classified as one target.

[0203] After classifying the measurement values, existing CI fusion rules are used to fuse the estimates of the same target. Finally, the fused result is expressed as follows:

[0204]

[0205] in, This represents the fusion result of the xy-axis position of the i-th target after passing through MDCI. This represents the uncertainty of the xy-axis position of the i-th target after passing through the MDCI.

[0206] (2) Design HFN;

[0207] HFN is used to perform hybrid fusion tasks. During single-sensor tracking, the features of the measurements detected by each sensor are extracted and embedded by the encoder. In the middle. HFN performs feature fusion by concatenating these embeddings, as shown in the following expression:

[0208]

[0209] Where, m s This represents the number of measurements taken by sensor s, and concat represents an algorithm for concatenating a list. Denotes the embedding sequence of the fusion decoder, m C This indicates the length of the embedding sequence in the fusion decoder.

[0210] Decision fusion first generates preliminary fusion results through MDCI processing. Normalization is performed using the FOV parameter. The state in Then with covariance The inputs are fed together into a feedforward neural network (FFN) to form a fusion decoder query sequence. The expression is as follows:

[0211]

[0212] Where, n C This indicates the number of existing targets after MDCI processing.

[0213] Finally, the features measured from all sensors and the estimated results are stored in the embedded sequences respectively. and query sequence In the middle, the fusion decoder performs hybrid fusion and transforms them into fusion estimates. To determine whether the estimate is valid, a series of corresponding existence probabilities are also output. The corresponding probability of existence Estimates greater than the probability threshold g will be selected for the final fusion estimate. In Chinese, the expression is as follows:

[0214]

[0215] Where ε(·) represents the step function.

[0216] In this embodiment, step S3 is specifically as follows:

[0217] Final fusion estimate Designated as virtual sensor 0: The output is such that the sensor label becomes s∈[0,S], and since only the loss of the current frame is calculated, the time step T is omitted.

[0218] estimated value n s Parameters of the component Bernoulli (MB) density, each estimate The structural composition is Let represent the mean and covariance parameters of the Gaussian distribution, and represent the existence probability associated with the Bernoulli component, respectively. Predictions from all sensors (including fused results) are compared with ground truth values. The negative log-likelihood (NLL) loss is used for supervision, and the expression is as follows:

[0219]

[0220] in, Indicates in The calculation is done by Defined multi-Bernoulli density.

[0221] Because of direct calculation of f s (·) is difficult to handle, therefore, by adding Use elements to expand the sequence This produces the same number (all of them) A new sequence of elements Furthermore, the negative log-likelihood (25) is approximated as:

[0222]

[0223] Where σ represents the permutation function, defined as σ: nl represents the new sequence Length; Represents a new sequence The σ(i)th element. This function represents a potential correlation between the MB component and the true ground target state. express The specified Bernoulli density is The negative logarithm of the negative logarithm is calculated at the σ(i)th element as follows:

[0224]

[0225] Finally, σ s The most probable association between the target and the Bernoulli component predicted for sensor s is approximated as follows:

[0226]

[0227] in, This represents a correlation score between the target and the Bernoulli component predicted by the sensor. The efficient computation of can be achieved using the Hungarian algorithm, as shown below:

[0228]

[0229] The entire training process of the network model is divided into forward propagation and backward propagation. A large amount of training data is obtained through step S1 and input into the network model. At the same time, the weights of each unit are continuously adjusted using the loss function.

[0230] The design of the intelligent multi-sensor multi-target tracking network is completed based on step S2, and the design of the loss function is completed according to step S3. In this embodiment, the key parameter information of the TMSHF network model is shown in Table 1.

[0231] Table 1

[0232] parameter numerical values Encoder layers 6 Decoder layers 6 Number of multi-head attention 8 Number of hidden units in FFN 2048 State Dimension 256 Batch size 16

[0233] In this embodiment, step S4 specifically includes the following:

[0234] After step S3, the training of the TMSHF network is complete. Then, the test dataset generated in step S1 is input into the trained network model to obtain the prediction results. The optimal subpattern allocation (OSPA) metric is used to evaluate the model's performance, as follows:

[0235]

[0236] d c (x,y)=min{c,d(x,y)} (32)

[0237] Among them, D p,c Indicates OSPA error. Indicates the prediction result. Γ represents the true value, and Γ represents a set of values ​​corresponding to equation (31); p represents the distance sensitivity parameter, c represents a cutoff distance parameter, i.e., the target state estimation error threshold, used to adjust the ratio between the estimation error of the set potential and the position error; d(·) represents the function for calculating the distance. In this embodiment, the cutoff distance parameter c = 1 and the distance sensitivity parameter p = 1 are selected, and 1000 rounds of Monte Carlo testing are performed for each experiment. In this embodiment, two tasks are set. In task 1, the tracking accuracy of TMSHF and KF fusion with CI is compared. In task 2, the fusion accuracy and uncertainty of TMSHF and CI fusion are compared. The evaluation results are as follows: Figure 6 , Figure 7 , Figure 8 , Figure 9 As shown.

[0238] Figure 6 This is a comparison curve of the OSPA index for Task 1 in this embodiment of the invention. The dashed triangle and the dashed star symbol represent the KF tracking effect using the measurement data from sensor 1 and sensor 2, respectively. The solid triangle and the solid star symbol represent the TMSHF tracking effect using the measurement data from sensor 1 and sensor 2, respectively. The dashed circle and the solid circle represent the CI fusion effect and the deep fusion effect of TMSHF after KF prediction of the measurement data from the two sensors, respectively. The TMSHF algorithm undergoes an initialization process in the first two frames; therefore, the average OSPA score curve is compared starting from the third frame. Figure 6 This indicates that the accuracy of TMSHF fusion is significantly higher than that obtained by utilizing measurements from a single sensor. Overall, the TMSHF algorithm exhibits high tracking accuracy, whether tracking a single sensor or fusing tracking results from different sensors.

[0239] Figure 7This is an evaluation example of Task 1 in this embodiment of the invention. This sample has three trajectories, and this image is a magnified view of one of those trajectories, showing the tracking results across four frames using different algorithms. Smaller addition and multiplication symbols represent the tracking results of KF (Knowledge, Function, and Factor) measurements using Sensor 1 and Sensor 2, respectively. Larger addition and multiplication symbols represent the tracking results of TMSHF (Technology, Speed, and Factor) measurements using Sensor 1 and Sensor 2, respectively. Smaller circles represent the CI (Crystal, Integrity, and Gravity) fusion results of KF prediction, and larger circles represent the depth fusion results of TMSHF. Lines marked with squares represent the target's true trajectory. For most points in the image, the estimated target position from TMSHF fusion is closer to the true position than the estimated target position from CI fusion, demonstrating the superior fusion performance of TMSHF. From another perspective, the estimated position from CI fusion tends to be closer to the position predicted by the more accurate sensor (Sensor 1). However, this strategy is not always correct, leading to limitations in CI fusion. The TMSHF algorithm, however, can effectively overcome this problem and accurately predict the target position.

[0240] Figure 8 This is a comparison curve of the OSPA performance in Task 2 of this embodiment. For clarity, the upward and downward triangles represent the tracking performance of TMSHF measured using sensor 1 and sensor 2, respectively. The circles represent the CI fusion results using all local predictions of TMSHF, and the cross symbol represents the deep fusion effect of TMSHF. In Task 2, both the CI algorithm and the TMSHF algorithm utilize the local predictions of TMSHF for fusion to eliminate the influence of different single-sensor tracking results (KF and SR-MT3), thus presenting the fusion effect more clearly. From Figure 8 As can be seen, the performance of TMSHF deep fusion is significantly higher than that of single-sensor prediction, and direct CI fusion using only deep learning results is unreliable.

[0241] Figure 9This is an evaluation example of Task 2 in this embodiment of the invention. The sample has four trajectories, and the image is a magnified view showing the predictions of different algorithms in frame 12. Thick dashed and thick solid circular lines represent the estimation uncertainties of sensor 1 and sensor 2, and addition and multiplication symbols represent their estimated positions. Larger thin solid circular lines and their central dots represent the uncertainty and predicted position after CI fusion, while smaller thin solid circular lines and their central dots represent the uncertainty and predicted position after TMSHF deep fusion. The results show that when the true target position is within the uncertainty range of sensor 1 and sensor 2, both CI and TMSHF can correctly predict the target position, but TMSHF predicts a lower uncertainty. When the true target position exceeds the uncertainty range of one or even two sensors, CI fusion still biases towards local predictions with lower uncertainty, leading to incorrect predictions. Despite this erroneous interference, TMSHF can still perform correct fusion.

[0242] In summary, the method of this invention first extracts different high-dimensional feature information from the multi-sensor measurement data based on the SR-MT3 tracking encoder module with multiple local sensors, and merges the information as feature-level information. Then, at the initial decision layer, the covariance cross (CI) fusion rule is used to initially fuse the decoder prediction state outputs of multiple local sensors as decision-level information. Finally, the decision-level information and feature-level information are input into the fusion decoder as embedding and query respectively to obtain the deeply fused target prediction state and uncertainty. As can be seen from the embodiments of this invention, the method of this invention can excellently complete the multi-target tracking fusion task under multiple sensors. The method of this invention innovatively integrates the intelligent multi-target tracking task and the multi-sensor fusion task into a unified algorithm, combining the optimal fusion idea of ​​feature-level fusion with the advantages of easy expansion and stable reliability of decision-level fusion, realizing an intelligent tracking fusion method that combines feature-level and decision-level fusion. Compared with the tracking results of KF and the fusion results obtained by the existing CI method, TMSHF has significant advantages, especially in complex tasks, where TMSHF has high accuracy and robust fusion capabilities.

[0243] Those skilled in the art will recognize that the embodiments described herein are intended to help the reader understand the principles of the invention, and should be understood that the scope of protection of the invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations based on the technical teachings disclosed in this invention without departing from the spirit of the invention, and these modifications and combinations are still within the scope of protection of this invention.

Claims

1. A multi-sensor intelligent fusion multi-target tracking method based on Transformer architecture, the specific steps of which are as follows: S1. Design radar simulation data to generate multi-sensor two-dimensional measurements, perform data preprocessing, and divide the data into training and testing sets for the network model. S2. Construct the TMSHF network model, including: Single-sensor tracking module and multi-sensor hybrid fusion module; The TMSHF network model includes: a single-sensor tracking module and a multi-sensor hybrid fusion module; The single-sensor tracking module includes: an encoder for analyzing target motion state information; a decoder for parsing and predicting target state; and a state regression mechanism. First, the encoder analyzes the target's motion state information and encodes the measurements into a high-dimensional feature vector. The state regression mechanism obtains the newborn query, autoregressive query, and mask from the output predicted in the previous frame. Finally, the newborn query, autoregressive query, mask, and the high-dimensional feature vector obtained by the encoder are input into the decoder to predict the target state in the current frame and obtain the predicted state output. The multi-sensor hybrid fusion module includes: a multi-target distributed CI fusion algorithm module (MDCI) and a hybrid fusion network module (HFN); The multi-objective distributed CI fusion rule is used to initially fuse the decoder prediction state outputs of multiple local sensors as decision-level information. Finally, the decision-level information and feature-level information are respectively used as embedding and query and input into the fusion decoder to obtain the target prediction state and uncertainty after deep fusion. In single-sensor tracking tasks, the focus is on the state estimation problem of multiple targets, for the current i-th Frame state estimation uses the state estimation from the previous frame. and sensors Past time step Until the current frame Measurement sequence ; by sensor Detected Each measurement value in the sequence is collected in random order. In Chinese, the expression is as follows: (1); in, (2); (3); in, Indicate target In the Frame by sensor The obtained measurement values ​​also include time information. , This represents the matrix transpose operation. Indicated by sensor In the The number of measurements detected in the frame. Indicates measurement of and Axis coordinates; Then the target state estimated by sensor s in the previous frame The expression is as follows: (4); (5); in, Indicates sensor The estimated number of targets in the last frame. Indicates the estimated target coordinates Indicates the estimated target The standard deviation of the covariance; The result of each sensor performing a single-sensor tracking task is a set of estimated states and standard deviations of the target in the current frame, expressed as follows: (6); A hybrid fusion method combining feature-level fusion and decision-level fusion is adopted. The expression for the measurement values ​​from all sensors used for feature-level fusion is as follows: (7); in, Indicates the number of sensors involved in feature-level fusion; For decision-level fusion, all estimated states are tracked from a single sensor. The following expression is used as input: (8); in, Indicates sensor In the The estimated number of targets in the frame; The result of a multi-sensor fusion task is a set of fused estimated states and standard deviations, expressed as follows: (9); in, Indicates the estimated number of targets after fusion; superscript Indicates the result of fusion; Each sensor makes a prediction. Input Transformer encoder converted to embedding ,Then, With query Together they are fed into the Transformer decoder to generate the estimated state. ; The estimated state copy is processed by the state regression mechanism and recursively returned to the decoder as a query in the next frame; another copy is collected into a set along with the outputs of other sensors. In this context, it serves as a collection of outputs from multiple sensors; The multi-objective distributed covariance cross (MDCI) fusion algorithm uses the estimated state set for preliminary fusion to generate a query sequence. And send these queries and to the fusion decoder. cascade of embedded sensors The two samples are input together into the fusion decoder, which transforms them into the final fusion estimate. ; in, This indicates the number of targets after MDCI. Indicates the number of targets after fusion; The multi-sensor hybrid fusion module is specifically as follows: The multi-sensor fusion module consists of a multi-target distributed CI fusion algorithm (MDCI) and a hybrid fusion network (HFN). (1) Design MDCI; MDCI is used for initial decision fusion and queries to generate the fusion decoder; in the decoder's total output The process involves clutter and measurements from different sensors targeting different targets; comparing the distances and distance thresholds between these measurements. The system groups measurements of the same target from different sensors. If the distance is greater than a threshold, the two measurements are classified as different targets; if the distance is less than the threshold, the two measurements are classified as one target. After classifying the measurement values, the estimated values ​​of the same target are fused using existing CI fusion rules; finally, the expression of the fused result is as follows: (10); in, Indicates the first After the target passed MDCI - Axis position fusion result, Indicates the first The target after MDCI - Axis position uncertainty; (2) Design HFN; HFN is used to perform hybrid fusion tasks; during single-sensor tracking, the features of the measurements detected by each sensor are extracted by the encoder and embedded. In the middle; HFN performs feature fusion by concatenating these embeddings, as shown in the following expression: (11); in, Indicates sensor The number of measurements, This represents an algorithm for concatenating the first and last elements of a set of lists. This represents the embedding sequence of the fusion decoder. Indicates the length of the embedding sequence in the fusion decoder; Decision fusion first generates preliminary fusion results through MDCI processing. Normalization is achieved using the FOV parameter. The state in Then with covariance The inputs are fed together into a feedforward neural network (FFN) to form a fusion decoder query sequence. The expression is as follows: (12); in, This indicates the number of existing targets after MDCI processing; Finally, the features measured from all sensors and the estimated results are stored in the embedded sequences. and query sequence In the middle; the fusion decoder performs hybrid fusion and transforms them into fusion estimates. It also outputs a series of corresponding existence probabilities. Determine if the estimate is valid. The corresponding probability of existence Greater than the probability threshold The estimate will be selected for the final fusion estimate. In Chinese, the expression is as follows: (13); (14); in, Represents the step function; S3. Design a loss function and input the multi-sensor two-dimensional measurement training set obtained in step S1 into the TMSHF network model constructed in step S2 to train the network model. S4. Input the multi-sensor two-dimensional measurement test set obtained in step S1 into the TMSHF network model trained in step S3, output the deep fusion test tracking results of the multi-target and evaluate them.

2. The multi-sensor intelligent fusion multi-target tracking method based on Transformer architecture according to claim 1, characterized in that, The specific steps of S1 are as follows: First, a large amount of radar simulation data is generated using standard state equations and measurement equations as the dataset for the deep learning network; the target motion is approximated as uniform motion. In the The state vector at frame is ,Target In the The state vector at frame is The equation of motion is then expressed as follows: (15); in, Represents the state transition matrix; This represents the process noise matrix, where the process noise is Gaussian noise with zero mean and covariance of . ; In a two-dimensional scene, covariance matrix The expression is as follows: (16); (17); in, Representing the target In the at the frame Axis position, Axis position, Representing the target In the at the frame Directional velocity, Directional velocity; Represents the process noise variance. Indicates the sensor sampling period. Represents a second-order identity matrix. Indicates the Kronecker product; The new target is reached according to the Poisson point process, with a birth intensity of [missing information]. , Indicates the first The set of all target states in a frame; A set of sensors with the same field of view (FOV) detects the same group of targets. A simulated sensor generates measurements in a two-dimensional Cartesian coordinate system, where each existing target can generate at most one true measurement. The results are then determined by the sensors. The observation equation for the captured actual measurements is expressed as follows: (18); in, Indicate target In the Frame by sensor The obtained measurement values, Indicates sensor The measurement noise has a mean of zero and a variance of . , Indicates sensor The measurement matrix is ​​expressed as follows: (19); Clutter measurement is based on intensity The Poisson point process is reached independently of existing targets or actual measurements. The set of all measurements in a frame The expression is as follows: (20); in, Indicates the first In-frame sensor The collection of generated clutter; Finally, data preprocessing is performed. First, the two-dimensional measurements from multiple sensors are normalized within the field of view. Then, a linear layer is used to extract high-dimensional feature vectors from the Cartesian coordinate system measurements. Based on the actual situation, the measurement data is generated by simulation and divided into training and test sets. The training set is used to train the network model, and the test set is used to test the trained TMSHF model to obtain the tracking fusion results.

3. The multi-sensor intelligent fusion multi-target tracking method based on Transformer architecture according to claim 2, characterized in that, In step S2, the single-sensor tracking module is specifically as follows: The single-sensor tracking module, namely the single-sensor SR-MT3 tracking module, uses the SR-MT3 network framework for single-sensor tracking; The output of the single-sensor tracking module is derived from two types of query transformations, including: A new query allows the model to initialize a trajectory for a target that did not exist in the previous frame; Each autoregressive query is responsible for tracking the trajectory that existed in the previous frame; in, This represents a predetermined value equal to the maximum number of targets; decoder query. Inquiry by freshmen Autoregressive query exist The dimensions are combined to jointly predict the target state of the current frame; When the target number of queries has not been reached, a masking mechanism based on the probability of existence is used to construct a state regression module to generate new queries. Autoregressive query , mask This creates a completely new state regression query, as detailed below: (1) Tracking initialization is achieved through newborn query; Each embedding is initialized using static and learned target encodings, and new targets appearing in the current frame are encoded by a fixed number of... Each output embedding detection; during the iteration process, each new query of the input decoder is performed by... express; (2) The autoregressive query iterates during the tracking process; First, let's consider the probability of existence in the last frame. Select The highest probability of existence Then select the state with the corresponding top-k probability of existence. The following are candidate expressions for the query: (21); in, This indicates that the length of the sequence is a non-negative integer. ,and , This represents a function that returns the indices of an input array sorted according to their probability of existence. That is, we obtain the index of the corresponding top-k probability of existence. Indicates based on index from The selected query candidates; After normalization based on the field of view, the selected sequence The data is fed into a feedforward neural network layer for nonlinear mapping and feature extraction, generating a query for the decoder. The expression is as follows: (22); Each element , Represents the field of real numbers, and To represent a hyperparameter, Indicates the measurement dimension; (3) Tracking termination is achieved based on the existence probability of the mask; A masking mechanism based on existence probability is employed, and selection is based on the top-k mechanism. The highest probability of existence This is then fed into the linear layer to calculate the existence threshold. The expression is as follows: (23); in, , Represents learnable parameters; because and It's a one-to-one correspondence, so the mask According to The calculation is expressed as follows: (24); in, express The first in The probability of one; This indicates that the length of the sequence is a non-negative integer. Furthermore, new queries have no corresponding existence probability, and their masks are always False.

4. The multi-sensor intelligent fusion multi-target tracking method based on Transformer architecture according to claim 3, characterized in that, Step S3 is as follows: Final fusion estimate Designated as virtual sensor 0: The output of then the sensor tag becomes ; estimated value express Parameters of the component-based Bernoulli MB density, each estimate The structural composition is , where represent the mean and covariance parameters of the Gaussian distribution, and the existence probability associated with the Bernoulli component, respectively; Use predictions from all sensors and compare them with the actual ground values. The negative log-likelihood NLL loss is used for supervision, and the expression is as follows: (25); in, Indicates in The calculation is done by Defined Bernoulli density; By attach Use elements to expand the sequence Generate a new sequence with the same number of elements. Then the negative log-likelihood (25) is approximated as: (26); in, The permutation function is defined as follows: ; Represents a new sequence Length; Represents a new sequence The One element; express The specified Bernoulli density is The The negative logarithm of the logarithm is calculated by evaluating the value at each element as follows: (27); at last, Indicates target and sensor The most probable correlation between the predicted Bernoulli components can be approximated as follows: (28); in, Indicating the target and sensor A correlation score between predicted Bernoulli components. The efficient computation of is achieved through the Hungarian algorithm, as shown below: (29); The entire training process of the network model is divided into forward propagation and backward propagation. A large amount of training data is obtained through step S1 and input into the network model. At the same time, the weights of each unit are continuously adjusted using the loss function.

5. The multi-sensor intelligent fusion multi-target tracking method based on Transformer architecture according to claim 4, characterized in that, Step S4 is as follows: After step S3, the training of the TMSHF network is completed; then, the test dataset generated in step S1 is input into the trained network model to obtain the prediction results; the optimal sub-pattern allocation metric is used to evaluate the model's performance, as follows: (30); (31); (32); in, Indicates OSPA error. Indicates the prediction result. Represents the true value. The set of assignments corresponding to expression (31); Represents the distance sensitivity parameter. The parameter representing a cutoff distance, i.e., the target state estimation error threshold, is used to adjust the weighting between the estimation error of the ensemble potential and the position error. This represents a function for calculating distance.