A radar target detection, tracking and identification multi-modal model pre-training method
By using a pre-training method for a multimodal model of radar target detection, tracking, and recognition, the problem of insufficient versatility and generalization of radar target detection, tracking, and recognition methods is solved. This method enables integrated processing of multiple tasks and information fusion, thereby improving the performance of radar target detection, tracking, and recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING RES INST OF ELECTRONICS TECH
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing radar target detection, tracking and identification methods suffer from insufficient model universality and generalization, making them difficult to apply to various radar models. Furthermore, traditional serial processing leads to error cascading and difficulty in information reuse.
A multi-modal model pre-training method for radar target detection, tracking, and recognition is adopted. Through a unified model training framework, multi-task learning is achieved. Large-scale radar multi-modal data is used for pre-training, a multi-modal data pre-training framework is constructed, and a multi-task joint learning optimization algorithm is designed to perform parallel training of target detection, tracking, and recognition.
It improves the model's versatility and generalization, realizes integrated processing of radar target detection, tracking and identification tasks, reduces error propagation, supports zero-sample or few-sample target type identification, and is applicable to various radar models.
Smart Images

Figure CN121834480B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of radar information processing technology, and in particular to a pre-training method for a multimodal model of radar target detection, tracking and recognition. Background Technology
[0002] New-generation artificial intelligence technologies, represented by deep learning, have been widely applied in the radar field, effectively improving the performance of radar detection, tracking, and recognition tasks. Currently, related algorithms are mainly based on supervised learning, requiring the labeling of training data before training through corresponding neural network models. Therefore, the more high-quality labeled data accumulated, the more continuously deep learning methods can improve the performance of radar detection, tracking, and recognition tasks. With the emergence of semi-automatic and even fully automatic labeling tools, the problem of continuous accumulation of labeled data in the radar field has been largely solved, making data-driven machine learning a development trend.
[0003] In traditional radar information processing, target detection, target tracking, and target recognition are three relatively independent sequential tasks. First, target detection generates a point track; then, tracking is performed based on the point track to generate a target trajectory; and finally, target recognition is achieved based on the trajectory. However, in real-world scenarios, radar target detection, tracking, and recognition tasks are closely interrelated. The traditional sequential processing flow has two main shortcomings: first, it easily leads to cascading errors that reduce the performance of subsequent tasks; for example, false alarms generated by target detection can affect the performance of subsequent tracking tasks. Second, while the target features and results used in detection, tracking, and recognition can improve the performance of the other two tasks, the independent nature of these three tasks makes it difficult to reuse each other's information during sequential processing. To address these issues, integrated target detection and tracking, and integrated target detection, tracking, and recognition methods have emerged. These methods primarily use supervised deep learning to handle target detection, tracking, and recognition tasks in specific radar systems or scenarios, resulting in insufficient model versatility and generalization.
[0004] In recent years, large-scale modeling technologies, exemplified by ChatGPT, have demonstrated unprecedented artificial intelligence capabilities across various industries. Their versatility, generalization, and emergent nature have been highly praised, enabling zero-shot or few-shot reasoning. The large-scale model construction process mainly includes two stages: pre-training and post-training. Post-training can be further divided into supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Pre-training uses large-scale data to obtain a base model, while post-training fine-tunes and optimizes this base model to better adapt it to new scenarios and tasks. The advantages exhibited by large-scale models have inspired the radar field. By pre-training on massive amounts of radar data, a base model can be obtained, enabling integrated processing of radar target detection, tracking, and recognition tasks. Simultaneously, it can improve the model's versatility and generalization for intelligent processing tasks across various radar types. Summary of the Invention
[0005] Current integrated target detection, tracking, and recognition methods in the radar field typically rely on supervised deep learning based on data from a specific type of radar to meet the needs of particular radar models or application scenarios. These methods suffer from two main shortcomings: first, relying solely on a single radar makes it difficult to accumulate large-scale, high-quality training data, hindering the full utilization of deep learning's advantages; second, the models are only applicable to specific radar models or application scenarios, making it difficult to guarantee their versatility and generalization. This invention aims to propose a pre-training method for a multi-modal model of radar target detection, tracking, and recognition, establishing a multi-task learning mechanism within a unified model training framework. This enables integrated processing of radar target detection, tracking, and recognition tasks, while simultaneously improving versatility and generalization to make the model suitable for intelligent processing tasks across various radar models.
[0006] To achieve the above objectives, this invention provides a pre-training method for a multimodal model of radar target detection, tracking, and recognition, comprising the following steps:
[0007] Step 1: Collect model pre-training data: Collect information processing data of various radar types, including radar detection data, radar tracking data, and radar identification data, and realize the semantic association between radar targets and their original images and track data through data governance and data annotation;
[0008] Step 2: Training data preprocessing: The collected radar information data of various models is processed and converted into a unified data format to serve as input data for the pre-trained model;
[0009] Step 3: Construct a multimodal model pre-training framework: Through a unified pre-training framework, multimodal data including radar target images, tracks, and category text are input simultaneously, and the pre-training framework completes the joint representation learning of the input data;
[0010] Step 4: Perform pre-training of the radar target detection, tracking and recognition multimodal model: Design a multi-task joint learning optimization algorithm to achieve parallel training of radar target detection, tracking and recognition tasks.
[0011] Further, step 1 involves collecting pre-training data for the model, including the following steps:
[0012] Step 1.1: Collect radar detection data: Collect radar detection data for each available radar. The radar detection data is the actual detection result data generated by the radar.
[0013] Step 1.2: Collect radar tracking data and radar identification data: Collect radar tracking data and radar identification data separately for the available radars. The radar tracking data is the actual tracking result data generated by the radar, and the radar identification data is the actual identification result data generated by the radar.
[0014] Step 1.3: Construct image and track data pairs: For each radar, the collected radar detection data, radar tracking data, and radar identification data are correlated and processed to construct an (image, track) data pair set. Each (image, track) data pair in the data pair set is directly associated with each target point in the track. The track is a real track generated by the radar, and the image is a detected image obtained by transforming radar echo data.
[0015] Furthermore, step 2.4 is included: randomly shuffling and storing the preprocessed (image, track) data pair set to achieve uniform distribution of data pairs, ensuring that during the pre-training of the multimodal model, the data pairs selected in each training batch will cover the information processing data of each type of radar. Among them, each image and each track has a unique number in the (image, track) data pair set.
[0016] Furthermore, step 2.2 involves preprocessing work including attribute completion and unified data format conversion, including the following steps:
[0017] Step 2.2.1: Perform semantic alignment on the target point attribute information in different radar tracks. After semantic alignment, perform a union operation on the specific attributes of different radar target points to finally form a unified data format for track target point attributes consisting of "common attribute features + specific attribute features".
[0018] Step 2.2.2: Convert the collected (image, track) data of various radar types into a unified data format for each track in the set, where each target track contains two attribute features: target type and target model.
[0019] Furthermore, step 3 constructs a multimodal model pre-training framework, including the following steps:
[0020] Step 3.1: Unify radar target detection and radar target recognition into the same training task to achieve semantic matching between the radar target location area in the image and the target category name in the text. For a (image, track) data pair, represent all radar target types and models contained therein in a natural language text format.
[0021] Step 3.2: Perform joint representation learning on the input radar target image, track, and category text, extract corresponding features, and label the image encoder as... The text encoder is marked as The track encoder is marked as The features of the three modal input data are then represented as follows: ,in, , , These represent the features of the input radar target image, radar target category text, and radar target trajectory, respectively.
[0022] Step 3.3: Construct the deep fusion module and mark the deep fusion module as... The fusion feature of the three modal inputs is then expressed as: , where O, L, and P represent the features of the input radar target image, radar target category text, and radar target track after deep fusion, respectively;
[0023] Step 3.4: Calculate the alignment score between image features and category text features. Alignment score between image features and track features Its calculation formula is ,in, The input image contains target / image patch / boundary features. The target category name features are derived from the input text after it has been encoded by the language model. The input is the target track feature. , All are normalized 0-1 matrices. This represents the matrix transpose operation. Let represent the set of real numbers, N, M, and K represent the number of input image, text, and track samples, respectively, and d represent the dimension of the feature vector;
[0024] Step 3.5: Output radar target detection, tracking, and identification results: Align the score matrix In the middle, if This indicates that the i-th region in the input image is the target, and its corresponding target category is Therefore, all target areas in the output image are marked with target location boxes, which include attribute information such as target type and model.
[0025] Alignment score matrix In the middle, if This indicates that the i-th region in the input image is the target and belongs to the track. The target point is determined, therefore the target trajectory at time t is output. The target category corresponding to the track is ;
[0026] in, This represents the feature of the i-th region in an input image.
[0027] Furthermore, step 4 involves pre-training the radar target detection, tracking, and recognition multimodal model, including the following steps:
[0028] Step 4.1: Define the target localization loss function Supervised localization pre-training is performed based on the labeled image target bounding boxes. Supervised localization pre-training uses a loss function. ;
[0029] Pre-training is performed by comparing image region features with target category text features. The comparison pre-training uses a loss function. ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and L represent the features of the input target image and target category text after deep fusion, respectively. This represents the matrix transpose operation, where N and M represent the number of input image and text samples, respectively.
[0030] ;
[0031] Step 4.2: Define the track association loss function The trajectory association task involves pre-training by comparing image region features with target trajectory features, with the loss function being... ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and P represent the features of the input target image and target trajectory after deep fusion, respectively. This represents the matrix transpose operation, where M and K represent the number of text and track samples, respectively.
[0032] Step 4.3: Define the intra-batch contrastive loss function For a training batch of (image, track) data pairs, all target types and models contained therein are represented as a natural language text description. All images in the training batch are compared and learned against all target category texts and track features. The contrastive loss function is then used. , , These are the contrastive learning loss functions for images and category text, and images and tracks, respectively.
[0033] Step 4.4: Construct the pre-training loss function for the multimodal model The calculation formula is: Multi-modal radar detection, tracking, and identification models are pre-trained in batches, and the loss function is continuously optimized. Complete the model pre-training.
[0034] Furthermore, step 4.3 includes the following steps:
[0035] Step 4.3.1: The training batch (image, track) data pair set is denoted as... , where B is the number of (image, track) data pairs in this training batch;
[0036] Step 4.3.2: The features obtained after encoding the input image, track, and category text by their respective encoders and the deep fusion module are as follows: , The alignment similarity matrix between image features and category text features is represented as follows: The truth label matrix is represented as follows: The alignment similarity matrix between image features and track features is represented as follows: The truth label matrix is represented as follows: The similarity matrix between the i-th (image, track) data pair and the j-th (image, track) data pair is calculated as follows: The “actual value” indicates that in this training batch, if there is a successful cross-data pair match between an image region and a category text feature, the similarity value is still 1.
[0037] Furthermore, in step 4.3, the contrastive learning loss function between the image and the category text... The calculation is expressed as ;
[0038] Image-track contrastive learning loss function The calculation is expressed as ;
[0039] in, Represents the cross-entropy loss function. This indicates that softmax is applied to each row of the similarity matrix.
[0040] Furthermore, in step 4.4, the multimodal model pre-training adopts a "teacher-student" training architecture. The teacher model performs knowledge transfer and data augmentation on the training samples, and completes the parameter update of the student model. The pseudo-labeled data generated by the teacher model has consistency in time, space, and semantics. Specifically, it includes the following steps:
[0041] Step 4.4.1: Automated production of large-scale pseudo-labeled data: Input several frames of unlabeled measured radar RD maps into the teacher model and set a confidence threshold. Only when the teacher model's confidence in object detection is greater than Only when the output bounding box, text and track are recorded into the pseudo-label library; for the generated pseudo track, Kalman filtering is used to smooth the trajectory, correct the jumps in the teacher model in single-frame inference, and improve the logical consistency of the pseudo-label data.
[0042] Step 4.4.2: Distillation Update and Augmentation Training of Student Model: During pre-training, the student model adopts a mixed real and fake random sampling strategy, mixing real labeled data and fake labeled data in a 1:3 ratio in each training batch; for real labeled data, the conventional cross-entropy loss is calculated; for fake labeled data, a temperature coefficient T is introduced for softening, and the distillation loss between the student model and the teacher model is calculated; during the student model training phase, for the fake labeled samples generated by the teacher model, Doppler frequency jitter and random noise injection are further implemented to simulate different levels of real-world testing environments;
[0043] Step 4.4.3: Model Deployment and Performance Validation: After training is complete, export the student model as a lightweight inference format, including ONNX or TensorRT.
[0044] Beneficial Effects: This invention provides a pre-training method for a multimodal model of radar target detection, tracking and recognition. It adopts pre-training technology and makes full use of large-scale radar multimodal data to carry out model pre-training. On the basis of a unified model training framework, the joint learning optimization of target detection, tracking and recognition tasks is realized during the training process, which further improves the model performance and enhances the model's versatility and generalization. The specific beneficial effects of this invention are as follows: (1) Establishing a data fusion mechanism to give full play to the deep learning effect driven by radar data. Unlike the current method which is limited to single radar data, this invention uses a data preprocessing method to convert multiple heterogeneous radar data into a standardized data format acceptable to the model, thereby realizing the large-scale accumulation of radar data. (2) Constructing a unified model and establishing a multi-task learning mechanism to realize the integration of radar target detection, tracking and recognition tasks. Through task joint training optimization, parameter sharing between multiple training tasks is realized, error propagation is reduced and task performance is improved. (3) Designing a multimodal data pre-training framework and realizing joint inference prediction through feature-level modal fusion, thereby further improving the performance of radar target detection, tracking and recognition. (4) Integrating target detection and recognition tasks. From the semantic matching level, the two tasks of radar target detection and target recognition are effectively integrated, which effectively supports the recognition of new target types with zero or few samples. (5) Improve the model's versatility and generalization. The method proposed in this invention is used for model pre-training on large-scale radar data. The trained model has the characteristics of a large model, namely versatility and generalization, and can be directly applied to multiple downstream tasks, which can better solve the current problems faced by radar target detection, tracking and recognition. Attached Figure Description
[0045] Figure 1 This is a flowchart of the pre-training method for a multimodal radar target detection, tracking, and recognition model according to an embodiment of the present invention;
[0046] Figure 2 This is a pre-training framework diagram of a multimodal radar target detection, tracking, and recognition model involved in an embodiment of the present invention;
[0047] Figure 3 This is a schematic diagram illustrating the principle of deep fusion of multimodal features for radar target detection, tracking, and recognition, as described in an embodiment of the present invention.
[0048] Figure 4 This is a schematic diagram illustrating the pre-training principle of a multimodal radar target detection, tracking, and recognition model involved in an embodiment of the present invention. Detailed Implementation
[0049] like Figures 1 to 4 As shown, the present invention provides a pre-training method for a multimodal model of radar target detection, tracking and recognition. Figure 1 This is a flowchart of the pre-training method for a multimodal radar target detection, tracking, and recognition model according to an embodiment of the present invention; Figure 2This is a pre-training framework diagram of a multimodal radar target detection, tracking, and recognition model involved in an embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the principle of deep fusion of multimodal features for radar target detection, tracking, and recognition, as described in an embodiment of the present invention. Figure 4 This is a schematic diagram illustrating the pre-training principle of a multimodal radar target detection, tracking, and recognition model involved in an embodiment of the present invention.
[0050] Example 1: A pre-training method for a multi-modal model of radar target detection, tracking, and recognition, comprising the following steps:
[0051] Step 1: Collect model pre-training data: Collect information processing data of various radar types, including radar detection data, radar tracking data, and radar identification data, and realize the semantic association between radar targets and their original images and track data through data governance and data annotation;
[0052] Step 2: Training data preprocessing: The collected radar information data of various models is processed and converted into a unified data format to serve as input data for the pre-trained model;
[0053] Step 3: Construct a multimodal model pre-training framework: Through a unified pre-training framework, multimodal data including radar target images, tracks, and category text are input simultaneously, and the pre-training framework completes the joint representation learning of the input data;
[0054] Step 4: Perform pre-training of the radar target detection, tracking and recognition multimodal model: Design a multi-task joint learning optimization algorithm to achieve parallel training of radar target detection, tracking and recognition tasks.
[0055] Step 1 involves collecting pre-training data for the model, including the following steps:
[0056] Step 1.1: Collect radar detection data: Collect radar detection data for each available radar. The radar detection data is the actual detection result data generated by the radar.
[0057] Step 1.2: Collect radar tracking data and radar identification data: Collect radar tracking data and radar identification data separately for the available radars. The radar tracking data is the actual tracking result data generated by the radar, and the radar identification data is the actual identification result data generated by the radar.
[0058] Step 1.3: Constructing Image and Track Data Pairs: For each radar, the collected radar detection data, radar tracking data, and radar identification data are correlated to construct an (image, track) data pair set. Each (image, track) data pair in the set is directly associated with each target point in the track. The track is a real track generated by that radar, and the image is a detection image obtained by transforming radar echo data. The original image can include both the detection image and the identification image, and the radar echo data is the initial input.
[0059] In step 1.3, each (image, track) data pair represents a complete detection, tracking, and identification result for a target. A track contains several target points, each corresponding to an image; therefore, the target points and images in each (image, track) data pair have a many-to-many relationship. Step 1.3 processes the target type and model identification results into attributes of the (image, track) data pair.
[0060] Step 2, training data preprocessing, includes the following steps:
[0061] Step 2.1: Preprocess image data: Based on the collected radar (image, track) data sets of various types, analyze and check whether the number of target points and location boxes in the image are consistent with the corresponding track. If there is an error, supplement the missing target point location boxes in the image, or delete the redundant target point location boxes in the image.
[0062] Step 2.2: Preprocess track data: Based on the collected radar (image, track) data sets of various types, analyze and check the target point attribute information in the track one by one, and perform preprocessing work including attribute completion and unified data format conversion;
[0063] Step 2.2 involves preprocessing steps including attribute completion and unified data format conversion, including the following steps:
[0064] Step 2.2.1: Perform semantic alignment on the target point attribute information in different radar tracks. After semantic alignment, perform a union operation on the specific attributes of different radar target points to finally form a unified data format for track target point attributes consisting of "common attribute features + specific attribute features".
[0065] Step 2.2.2: Convert the collected (image, track) data of various radar types into a unified data format for each track in the set, where each target track contains two attribute features: target type and target model;
[0066] Step 2.3: Constructing the information association between (image, track) data pairs: After the preprocessing operations in Steps 2.1 and 2.2, each (image, track) data pair has been converted into a unified data format. There are several target points in a track, and each target point corresponds to an image. The target points and images in each (image, track) data pair have a many-to-many relationship.
[0067] It also includes step 2.4: randomly shuffling and storing the preprocessed (image, track) data pair set to achieve uniform distribution of data pairs, ensuring that during the pre-training of the multimodal model, the data pairs selected in each training batch will cover the information processing data of each type of radar. Each image and each track has a unique number in the (image, track) data pair set.
[0068] Step 3: Construct a multimodal model pre-training framework, such as... Figure 2 As shown, it includes the following steps:
[0069] Step 3.1: Unify target detection and target recognition into the same training task, namely classification and matching task, to achieve semantic matching between the target localization region in the image and the target category (type and model combination) name in the text. For a (image, track) data pair, represent all the target types and models contained therein in a natural language text format, such as: type 1-model 1, type 1-model 2, etc.
[0070] Step 3.2: Perform joint representation learning on the input target image, track, and category text, extract corresponding features, and label the image encoder as... The text encoder is marked as The track encoder is marked as The features of the three modal input data are then represented as follows: ,in, , , These represent the features of the input target image, target category text, and target trajectory, respectively.
[0071] Step 3.3: Construct the deep fusion module and mark the deep fusion module as... The fusion feature of the three modal inputs is then expressed as: , where O, L, and P represent the features of the input target image, target category text, and target trajectory after deep fusion, respectively.
[0072] Implementation principle as follows Figure 3 As shown, RT-DETR is used as the image encoder, BERT as the text encoder, and the original Transformer as the track encoder.
[0073] The specific implementation measures for deep fusion of image features and text features are as follows:
[0074] Among them, X-MHA is a cross-modal multi-head attention network. denoted by DyHeadMoudle, the number of detector heads in the RT-DETR image encoder; BERT Layer is a pre-trained language model based on a Transformer network. This represents the features extracted through the image's base model. This represents the text features obtained through a pre-trained language model; , These represent the features of the i-th input image and text sample, respectively. This means that the new i-th input image feature is obtained by fusing the i-th input text feature with the i-th input image feature. This means that the new i-th input text feature is obtained by fusing the i-th input image feature with the i-th input text feature; This represents the output features of the image encoder. This represents the output characteristics of the text encoder.
[0075] Similarly, the specific implementation method for deep fusion of image features and track features is as follows:
[0076] ,in, This represents the features extracted from the trajectory baseline model; , These represent the features of the i-th input image and the track sample, respectively. This indicates that the new i-th input image feature is obtained by fusing the i-th input track feature with the i-th input image feature. This indicates that the new i-th input track feature is obtained by fusing the i-th input image feature with the i-th input track feature; This represents the output features of the image encoder. This represents the output characteristics of the track encoder.
[0077] In cross-modal multi-head attention networks, each head needs to incorporate other modalities to compute the feature vector of the current modality. The multi-head attention calculation method between images and text is as follows:
[0078] ,in, , These represent the Q matrices of the input image and the text sample, respectively. , V matrices represent the input image and text sample, respectively. This represents the new image sample features obtained by fusing text sample features and image sample features from a training batch. This represents the new text sample features obtained by fusing image sample features and text sample features in a training batch. To train hyperparameters.
[0079] The multi-head attention calculation method between the image and the track is as follows:
[0080] ,in, , These represent the Q matrices of the input image and the track sample, respectively. , V matrices represent the input image and the track sample, respectively. This represents the new image sample features obtained by fusing track sample features and image sample features in a training batch. This represents the new track sample features obtained by fusing image sample features and track sample features in a training batch. To train hyperparameters.
[0081] Step 3.4: Calculate the alignment score between image features and category text features. Alignment score between image features and track features Its calculation formula is ,in, The input image contains target / image patch / boundary features. The target category name features are derived from the input text after it has been encoded by the language model. The input is the target track feature. , All are normalized 0-1 matrices. This represents the matrix transpose operation. Let represent the set of real numbers, where N, M, and K represent the number of input image, text, and track samples, respectively, and d represents the dimension of the feature vector.
[0082] Step 3.5: Output target detection, tracking, and recognition results: Alignment score matrix In the middle, if This indicates that the i-th region in the input image is the target, and its corresponding target category is Therefore, all target areas in the output image are marked with target location boxes, which include attribute information such as target type and model.
[0083] Alignment score matrix In the middle, if This indicates that the i-th region in the input image is the target and belongs to the track. The target point is determined, therefore the target trajectory at time t is output. The target category corresponding to the track is ;
[0084] in, This represents the feature of the i-th region in an input image.
[0085] If the target does not have corresponding type and model attributes, the target category is "unknown".
[0086] Step 4 involves pre-training the radar target detection, tracking, and recognition multimodal model, including the following steps:
[0087] Step 4.1: Define the target localization loss function Supervised localization pre-training is performed based on the labeled image target bounding boxes. Supervised localization pre-training uses a loss function. .
[0088] Pre-training is performed by comparing image region features with target category text features. The comparison pre-training uses a loss function. ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and L represent the features of the input target image and target category text after deep fusion, respectively. This represents the matrix transpose operation, where N and M represent the number of input image and text samples, respectively.
[0089] .
[0090] Step 4.2: Define the track association loss function The trajectory association task involves pre-training by comparing image region features with target trajectory features, with the loss function being... ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and P represent the features of the input target image and target trajectory after deep fusion, respectively. This represents the matrix transpose operation, where M and K represent the number of text and track samples, respectively.
[0091] Step 4.3: Define the intra-batch contrastive loss function For a training batch of (image, track) data pairs, all target types and models contained therein are represented as a natural language text description. All images in the training batch are compared and learned against all target category texts and track features. The contrastive loss function is then used. , , These are the contrastive learning loss functions for images and category text, and images and tracks, respectively.
[0092] Step 4.3 includes the following steps:
[0093] Step 4.3.1: Given a certain training batch (image, track) data pair set, denoted as... , where B is the number of (image, track) data pairs in this training batch;
[0094] Step 4.3.2: The features obtained after encoding the input image, track, and category text by their respective encoders and the deep fusion module are as follows: , The alignment similarity matrix between image features and category text features is represented as follows: The truth label matrix is represented as follows: The alignment similarity matrix between image features and track features is represented as follows: The truth label matrix is represented as follows: The similarity matrix between the i-th (image, track) data pair and the j-th (image, track) data pair is calculated as follows: The “actual value” indicates that in this training batch, if there is a successful cross-data pair match between an image region and a category text feature, the similarity value is still 1.
[0095] Step 4.3: Contrast learning loss function between image and category text The calculation is expressed as ;
[0096] Image-track contrastive learning loss function The calculation is expressed as ;
[0097] in, Represents the cross-entropy loss function. This indicates that softmax is applied to each row of the similarity matrix.
[0098] In step 4.1 Step 4.2 Step 4.3 The implementation principles of the three contrastive learning losses are as follows: Figure 4 As shown, the text sequence "Category 1, Category 2,..." represents the set of target categories in the training data batch. Each target "category" consists of target type and target model (optional). Track 1, Track 2, etc., represent the track of the corresponding target in the training data batch at a certain moment. Considering that some scenarios do not require target recognition, "Unknown" is added to the category text sequence to correspond to the relevant target region in the image.
[0099] Step 4.4: Construct the pre-training loss function for the multimodal model The calculation formula is: Multi-modal radar detection, tracking, and identification models are pre-trained in batches, and the loss function is continuously optimized. Complete the model pre-training.
[0100] To obtain more pre-training data, this embodiment employs a "teacher-student" model training method for data augmentation. The pre-trained teacher model performs knowledge transfer and data augmentation on the training samples, updating the parameters of the student model. The pseudo-labels generated by the teacher model possess temporal, spatial, and semantic consistency. First, based on a pre-processed set of high-quality labeled (image, track) data pairs, a teacher model is trained using the multimodal model pre-training method proposed in this invention. Then, this teacher model is used for target image detection, tracking, and recognition, predicting target bounding boxes, target category text, and target tracks in a large-scale unlabeled image dataset, forming a set of pseudo-labeled (image, track) data pairs. Finally, based on a set of mixed real and pseudo-labeled (image, track) data pairs, a student model is pre-trained, which is the radar target detection, tracking, and recognition multimodal pre-trained model. Specifically, the following steps are included:
[0101] Step 4.4.1: Automated production of large-scale pseudo-label data: A large number (e.g., 100,000 frames) of unlabeled measured radar RD maps are fed into the teacher model. To ensure data quality, a confidence threshold is set. (e.g., a value of 0.7), only when the teacher model's confidence in object detection is greater than... Only when the output bounding box, text, and track are recorded into the pseudo-label library; for the generated pseudo track, Kalman filtering is used to smooth the trajectory, correcting the possible jumps in the teacher model in single-frame inference, thereby improving the logical consistency of the pseudo data.
[0102] Step 4.4.2: Distillation Update and Augmentation Training of Student Model: During pre-training, the student model adopts a mixed real and fake random sampling strategy. In each training batch, real labeled data and fake labeled data are mixed in a 1:3 ratio. For real labeled data, the conventional cross-entropy loss is calculated. For fake labeled data, a temperature coefficient T (ranging from 1.5 to 3.0) is introduced for softening, and the distillation loss between the student model and the teacher model is calculated. During the training phase of the student model, Doppler frequency jitter and random noise injection are further implemented for the fake labeled samples generated by the teacher model to simulate the test environment with different levels of complexity.
[0103] Step 4.4.3: Model Deployment and Performance Validation: After training, export the student model as a lightweight inference format (such as ONNX or TensorRT). Because the student model has been exposed to a large amount of real-world noise background through pseudo-labeled datasets during the pre-training phase, its target detection, tracking, and recognition accuracy in complex electromagnetic environments is significantly improved compared to the unenhanced model.
[0104] This invention proposes a pre-training method for a multimodal model of radar target detection, tracking and recognition, establishes a multi-task learning mechanism under a unified model training framework, realizes integrated processing of radar target detection, tracking and recognition tasks, and improves the versatility and generalization of the model to make it applicable to intelligent processing tasks of various types of radar.
[0105] Finally, it should be noted that the above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. However, any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A pre-training method for a multimodal model of radar target detection, tracking, and recognition, characterized in that, Includes the following steps: Step 1: Collect model pre-training data: Collect information processing data of various radar types, including radar detection data, radar tracking data, and radar identification data, and realize the semantic association between radar targets and their original images and track data through data governance and data annotation; Step 2: Training data preprocessing: The collected radar information data of various models is processed and converted into a unified data format to serve as input data for the pre-trained model; Step 3: Construct a multimodal model pre-training framework: Through a unified pre-training framework, multimodal data including radar target images, tracks, and category text are input simultaneously, and the pre-training framework completes the joint representation learning of the input data; Step 4: Perform pre-training of the radar target detection, tracking and recognition multimodal model: Design a multi-task joint learning optimization algorithm to achieve parallel training of radar target detection, tracking and recognition tasks; Step 4 involves pre-training the radar target detection, tracking, and recognition multimodal model, including the following steps: Step 4.1: Define the target localization loss function Supervised localization pre-training is performed based on the labeled image target bounding boxes. Supervised localization pre-training uses a loss function. ; Pre-training is performed by comparing image region features with target category text features. The comparison pre-training uses a loss function. ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and L represent the features of the input target image and target category text after deep fusion, respectively. This represents the matrix transpose operation, where N and M represent the number of input image and text samples, respectively. ; Step 4.2: Define the track association loss function The trajectory association task involves pre-training by comparing image region features with target trajectory features, with the loss function being... ,in, Label the truth value. This is a typical cross-entropy or focal loss function, where O and P represent the features of the input target image and target trajectory after deep fusion, respectively. This represents the matrix transpose operation, where M and K represent the number of text and track samples, respectively. Step 4.3: Define the intra-batch contrastive loss function For a training batch of (image, track) data pairs, all target types and models contained therein are represented as a natural language text description. All images in the training batch are compared and learned against all target category texts and track features. The contrastive loss function is then used. , , These are the contrastive learning loss functions for images and category text, and images and tracks, respectively. Step 4.4: Construct the pre-training loss function for the multimodal model The calculation formula is: Multi-modal radar detection, tracking, and identification models are pre-trained in batches, and the loss function is continuously optimized. Complete the model pre-training.
2. The pre-training method for a multi-modal model of radar target detection, tracking, and recognition according to claim 1, characterized in that, Step 1 involves collecting pre-training data for the model, including the following steps: Step 1.1: Collect radar detection data: Collect radar detection data for each available radar, wherein the radar detection data is the actual detection result data generated by the radar; Step 1.2: Collect radar tracking data and radar identification data: Collect radar tracking data and radar identification data separately according to the available radars. The radar tracking data is the actual tracking result data generated by the radar, and the radar identification data is the actual identification result data generated by the radar. Step 1.3: Construct image and track data pairs: For each radar, the collected radar detection data, radar tracking data, and radar identification data are correlated to construct an (image, track) data pair set. Each (image, track) data pair in the data pair set is directly associated with each target point in the track. The track is a real track generated by the radar, and the image is a detection image obtained by transforming radar echo data.
3. The pre-training method for a multi-modal model of radar target detection, tracking, and recognition according to claim 2, characterized in that, Step 2, training data preprocessing, includes the following steps: Step 2.1: Preprocess image data: Based on the collected radar (image, track) data sets of various types, analyze and check whether the number of target points and location boxes in the image are consistent with the corresponding track. If there is an error, supplement the missing target point location boxes in the image, or delete the redundant target point location boxes in the image. Step 2.2: Preprocess track data: Based on the collected radar (image, track) data sets of various types, analyze and check the target point attribute information in the track one by one, and perform preprocessing work including attribute completion and unified data format conversion; Step 2.3: Constructing the information association between (image, track) data pairs: After the preprocessing operations in Steps 2.1 and 2.2, each (image, track) data pair has been converted into a unified data format. There are several target points in a track, and each target point corresponds to an image. The target points and images in each (image, track) data pair have a many-to-many relationship.
4. The pre-training method for a multimodal model of radar target detection, tracking, and recognition according to claim 3, characterized in that, It also includes step 2.4: randomly shuffling and storing the preprocessed (image, track) data pair set to achieve uniform distribution of data pairs, ensuring that during the pre-training of the multimodal model, the data pairs selected in each training batch will cover the information processing data of each type of radar. Each image and each track has a unique number in the (image, track) data pair set.
5. The pre-training method for a multimodal model of radar target detection, tracking, and recognition according to claim 3, characterized in that, Step 2.2 involves preprocessing steps including attribute completion and unified data format conversion, including the following steps: Step 2.2.1: Perform semantic alignment on the target point attribute information in different radar tracks. After semantic alignment, perform a union operation on the specific attributes of different radar target points to finally form a unified data format for track target point attributes consisting of "common attribute features + specific attribute features". Step 2.2.2: Convert the collected (image, track) data of various radar types into a unified data format for each track in the set, where each target track contains two attribute features: target type and target model.
6. The pre-training method for a multimodal model of radar target detection, tracking, and recognition according to claim 2, characterized in that, Step 3 involves constructing a multimodal model pre-training framework, including the following steps: Step 3.1: Unify radar target detection and radar target recognition into the same training task to achieve semantic matching between the radar target location area in the image and the target category name in the text. For a (image, track) data pair, represent all radar target types and models contained therein in a natural language text format. Step 3.2: Perform joint representation learning on the input radar target image, track, and category text, extract corresponding features, and label the image encoder as... The text encoder is marked as The track encoder is marked as The features of the three modal input data are then represented as follows: ,in, , , These represent the features of the input radar target image, radar target category text, and radar target trajectory, respectively. Step 3.3: Construct the deep fusion module and mark the deep fusion module as... The fusion feature of the three modal inputs is then expressed as: , where O, L, and P represent the features of the input radar target image, radar target category text, and radar target track after deep fusion, respectively; Step 3.4: Calculate the alignment score between image features and category text features. Alignment score between image features and track features Its calculation formula is ,in, The input image contains target / image patch / boundary features. The target category name features are derived from the input text after it has been encoded by the language model. The input is the target track feature. , All are normalized 0-1 matrices. This represents the matrix transpose operation. Let represent the set of real numbers, N, M, and K represent the number of input image, text, and track samples, respectively, and d represent the dimension of the feature vector; Step 3.5: Output radar target detection, tracking, and identification results: Align the score matrix In the middle, if This indicates that the i-th region in the input image is the target, and its corresponding target category is Therefore, all target areas in the output image are marked with target location boxes, which include attribute information such as target type and model. Alignment score matrix In the middle, if This indicates that the i-th region in the input image is the target and belongs to the track. The target point is determined, therefore the target trajectory at time t is output. The target category corresponding to the track is ; in, This represents the feature of the i-th region in an input image.
7. The pre-training method for a multi-modal model of radar target detection, tracking, and recognition according to claim 1, characterized in that, Step 4.3 includes the following steps: Step 4.3.1: The training batch (image, track) data pair set is denoted as... , where B is the number of (image, track) data pairs in this training batch; Step 4.3.2: The features obtained after encoding the input image, track, and category text by their respective encoders and the deep fusion module are as follows: , The alignment similarity matrix between image features and category text features is represented as follows: The truth label matrix is represented as follows: The alignment similarity matrix between image features and track features is represented as follows: The truth label matrix is represented as follows: The similarity matrix between the i-th (image, track) data pair and the j-th (image, track) data pair is calculated as follows: The "actual value" indicates that in this training batch, if there is a successful cross-data pair match between an image region and a category text feature, the similarity value is still 1.
8. The pre-training method for a multimodal model of radar target detection, tracking, and recognition according to claim 7, characterized in that, Step 4.3: Contrast learning loss function between image and category text The calculation is expressed as ; Image-track contrastive learning loss function The calculation is expressed as ; in, Represents the cross-entropy loss function. This indicates that softmax is applied to each row of the similarity matrix.
9. The pre-training method for a multi-modal model of radar target detection, tracking, and recognition according to claim 1, characterized in that, Step 4.4, multimodal model pre-training, adopts a "teacher-student" training architecture. The teacher model performs knowledge transfer and data augmentation on the training samples, and updates the parameters of the student model. The pseudo-labeled data generated by the teacher model has consistency in time, space, and semantics. Specifically, it includes the following steps: Step 4.4.1: Automated production of large-scale pseudo-labeled data: Input several frames of unlabeled measured radar RD maps into the teacher model and set a confidence threshold. Only when the teacher model's confidence in object detection is greater than Only when the output bounding box, text and track are recorded into the pseudo-label library; for the generated pseudo track, Kalman filtering is used to smooth the trajectory, correct the jumps in the teacher model in single-frame inference, and improve the logical consistency of the pseudo-label data. Step 4.4.2: Distillation Update and Augmentation Training of Student Model: During pre-training, the student model adopts a mixed real and fake random sampling strategy, mixing real labeled data and fake labeled data in a 1:3 ratio in each training batch; for real labeled data, the conventional cross-entropy loss is calculated; for fake labeled data, a temperature coefficient T is introduced for softening, and the distillation loss between the student model and the teacher model is calculated; during the student model training phase, for the fake labeled samples generated by the teacher model, Doppler frequency jitter and random noise injection are further implemented to simulate different levels of real-world testing environments; Step 4.4.3: Model Deployment and Performance Validation: After training is complete, export the student model as a lightweight inference format, including ONNX or TensorRT.