A low, small and slow target detection and tracking method and system based on multi-modal fusion

By employing a multimodal fusion-based method for detecting and tracking small, slow targets, this approach utilizes preprocessing of visible light, infrared, and radar data, feature extraction, cross-modal Transformer fusion, and asynchronous communication to address the issues of detection accuracy and tracking stability for small, slow targets in complex environments, achieving efficient real-time target detection and tracking.

CN121884015BActive Publication Date: 2026-06-30CHONGQING UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV
Filing Date
2026-03-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing methods for detecting and tracking small, slow targets suffer from low detection accuracy, poor tracking stability, and insufficient processing efficiency in complex environments. They also lack radar data fusion and have imperfect real-time communication mechanisms, making it difficult to meet real-time requirements.

Method used

A multimodal fusion-based approach is adopted, which uses visible light images, infrared images and radar data for preprocessing, extracts features through a two-stream structure network, performs information fusion by combining a cross-modal Transformer feature fusion module, performs spatiotemporal alignment and information fusion through a radar-visual fusion processing module, and pushes results by combining asynchronous task processing and MQTT real-time communication mechanism.

Benefits of technology

It improves the detection accuracy and tracking stability of small, slow targets in complex environments, meets real-time requirements, and realizes effective complementarity and collaborative processing of multimodal information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121884015B_ABST
    Figure CN121884015B_ABST
Patent Text Reader

Abstract

This invention relates to the field of target detection and tracking technology, specifically to a method and system for detecting and tracking small, slow targets based on multimodal fusion. The method preprocesses visible light images, infrared images, and radar data; detects visible light and infrared images separately using a two-stream network; inputs the detection results into a cross-modal Transformer fusion module for feature fusion to obtain multimodal target detection results; inputs the fused detection results and radar data into a radar-vision fusion processing module to complete radar clutter suppression, spatiotemporal alignment, and information fusion; maintains the target's tracking ID and historical trajectory through target tracking and trajectory prediction; and pushes the results to external systems in real time through asynchronous task processing and MQTT real-time communication. This invention achieves efficient fusion of visible light and infrared information, improves detection stability in complex environments, meets real-time requirements, and enhances the detection accuracy and tracking stability of small, slow targets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection and tracking technology, and more specifically to a method and system for detecting and tracking small, slow targets based on multimodal fusion. Background Technology

[0002] Low-altitude, small, and slow-moving targets (referred to as "low-small-slow targets") are characterized by their small size, slow motion, and susceptibility to occlusion, making their detection difficult and tracking stability poor, thus posing significant challenges to low-small-slow target detection and tracking tasks. Existing target detection and tracking methods can be mainly divided into two directions: one is detection and tracking models based on single-modal vision, and the other is detection and tracking models based on multimodal fusion. Among them, multimodal fusion methods can make full use of the complementary information of visible light images and infrared images, thereby achieving a more comprehensive feature representation of the target and providing cross-modal correlation, which helps to understand the target features of interest in different modalities.

[0003] However, some existing multimodal fusion-based models neglect the fusion and utilization of radar data during their construction, and also lack efficient real-time communication and asynchronous task processing mechanisms. Furthermore, because single-modal detection is easily interfered with and loses key target information in complex environments (such as changes in lighting conditions and severe weather), some methods attempt to address this problem by fusing multimodal information, but their designs fail to adequately consider the collaborative processing of radar and visual data. While some systems combine visible light and infrared modes for detection exist, they often lack a complete system architecture and real-time communication mechanisms, resulting in low overall processing efficiency and difficulty in meeting the real-time requirements of practical applications.

[0004] Based on the current state of existing technologies, the urgent technical problem to be solved is: how to provide a small, slow target detection and tracking system that can effectively integrate multimodal information fusion, radar and visual data collaborative processing, real-time communication capabilities, and support asynchronous and efficient processing, so as to improve the detection accuracy and tracking stability of small, slow targets in complex environments. Summary of the Invention

[0005] In view of this, the present invention provides a method and system for detecting and tracking small, slow targets based on multimodal fusion, aiming to solve the problems of low detection accuracy, poor tracking stability and insufficient processing efficiency of small, slow targets caused by the lack of radar data fusion, imperfect real-time communication mechanism and incomplete system architecture in the prior art, thereby improving the detection and tracking performance of small, slow targets in complex environments.

[0006] In a first aspect, the present invention provides a method for detecting and tracking small, slow targets based on multimodal fusion, comprising:

[0007] Preprocessing of visible light images, infrared images, and radar data;

[0008] Based on a two-stream structure network, feature extraction and initial target detection are performed on visible light images and infrared images respectively to obtain visible light detection results and infrared detection results.

[0009] The visible light detection results and the infrared detection results are input into the cross-modal Transformer fusion module, and the fused multimodal target detection results are obtained by using the cross-modal Transformer feature fusion mechanism.

[0010] The fused multimodal target detection results and radar data are input into the radar-vision fusion processing module to complete radar clutter suppression, spatiotemporal alignment and information fusion;

[0011] The target tracking ID and historical trajectory are maintained through the target tracking and trajectory prediction steps, and future trajectory prediction is performed by combining radar ranging information.

[0012] The detection and tracking results are pushed to external systems in real time through asynchronous task processing and MQTT real-time communication mechanism.

[0013] In one specific implementation, the feature extraction and initial target detection based on the dual-stream network for visible light and infrared images respectively includes:

[0014] The target features in the visible light image are extracted by the visible light image detection submodule, and the target features in the infrared image are extracted by the infrared image detection submodule.

[0015] The visible light image detection submodule and the infrared image detection submodule together form a two-stream structure network.

[0016] In one specific implementation, obtaining the fused multimodal target detection result using the cross-modal Transformer feature fusion mechanism includes:

[0017] The target feature representations output by different modal detection networks are uniformly encoded into sequence form. Through a Transformer structure that includes multi-head attention and feedforward networks, the spatial positional relationship, category consistency and confidence distribution between targets of different modalities are modeled to obtain the attention weight matrix that represents cross-modal correlation.

[0018] Based on the attention weight matrix, the target features of different modalities are weighted, fused, and redundancy is suppressed, and the fused multimodal target detection results are output.

[0019] In one specific implementation, before inputting the visible light detection results and the infrared detection results into the cross-modal Transformer fusion module, the method further includes:

[0020] Calculate the cross-union ratio (CUI) between each pair of visible light candidate boxes and infrared candidate boxes. Only when Greater than the first threshold Furthermore, both predicted categories are the same or the difference in category confidence is less than the second threshold. When the candidate boxes are considered to be matched, it is assumed that the pair of candidate boxes can be matched.

[0021] according to The cross-modal matching score is calculated by weighting the confidence level. ,in These are the category confidence levels for visible light and infrared detection, respectively. , These are the weighting coefficients;

[0022] according to Sort the candidate pairs from largest to smallest and assign them one-to-one in sequence.

[0023] In one specific implementation, the step of maintaining the target's tracking ID and historical trajectory through target tracking and trajectory prediction includes:

[0024] For each existing trajectory, maintain a state vector and covariance matrix, and use a Kalman filter to predict the target state;

[0025] Calculate the association cost between each predicted state and the detection result of the current frame;

[0026] The Hungarian algorithm or a greedy matching algorithm is used to find the minimum cost matching between all predicted trajectories and the current detection results, so as to obtain a one-to-one correspondence between the trajectory and the detected target, thereby updating the trajectory state and maintaining the tracking ID of each target.

[0027] Secondly, the present invention provides a low-speed, small-target detection and tracking system based on multimodal fusion, comprising:

[0028] The multimodal data acquisition and preprocessing module is used to acquire and preprocess visible light images, infrared images, and radar data.

[0029] The multimodal fusion detection module is used to extract features and detect targets in visible light and infrared images based on a two-stream structure network, and to complete multimodal feature fusion and result reweighting through a cross-modal Transformer fusion module;

[0030] The target tracking and trajectory prediction module is used to perform multimodal target tracking based on the fused detection results and to perform trajectory prediction in combination with radar data;

[0031] The radar-visual fusion processing module is used to perform clutter suppression, threshold detection, and spatiotemporal alignment and fusion of radar data with visual targets;

[0032] The asynchronous task processing and real-time communication module is used to achieve asynchronous task scheduling and load balancing through message queues, and to push detection and tracking results frame by frame through the MQTT protocol.

[0033] In one specific implementation, the multimodal fusion detection module includes a visible light image detection submodule, an infrared image detection submodule, and a multimodal fusion submodule;

[0034] The visible light image detection submodule is used to extract target features from visible light images and output visible light detection results;

[0035] The infrared image detection submodule is used to extract target features from infrared images and output infrared detection results;

[0036] The multimodal fusion submodule is used to match, weight, and filter the visible light detection results and the infrared detection results, and to execute the cross-modal Transformer feature fusion mechanism to output the fused target detection results.

[0037] In one specific implementation, the cross-modal Transformer fusion module is integrated inside the multimodal fusion submodule and consists of four stacked Transformer encoders, each of which includes a multi-head self-attention sublayer and a feedforward network sublayer.

[0038] The multimodal fusion submodule first encodes the target feature representations output from the visible light and infrared streams into feature sequences, and superimposes a two-dimensional sinusoidal position code constructed based on the center coordinates of the bounding box. The feature sequences are then fed into the cross-modal Transformer fusion module for cross-modal correlation modeling to obtain the fused feature vector and corresponding attention weight matrix for each target.

[0039] In one specific implementation scheme, the asynchronous task processing and real-time communication module includes a message queue mechanism based on RabbitMQ and a real-time communication bridging system based on MQTT;

[0040] The message queue mechanism is used to encapsulate image detection tasks and video stream tracking tasks into task message units of a unified format, thereby achieving persistent storage of tasks, at least-once delivery, and load balancing between different algorithm services.

[0041] The MQTT real-time communication bridging system is used to receive JSON command messages issued by the front end through the command topic, and push detection results, tracking results and trajectory prediction results frame by frame through the result topic.

[0042] In a specific implementation scheme, the radar-visual fusion processing module is used to perform clutter suppression and constant false alarm rate detection or threshold decision on radar echo data, align the target information detected by radar with the visual fusion detection results on the time axis and spatial coordinate system, establish the correspondence between radar targets and visual targets according to a preset matching strategy, and perform joint estimation and confidence update of target status.

[0043] Compared with existing technologies, the present invention provides a method and system for detecting and tracking low-altitude, small, and slow-moving targets based on multimodal fusion. This method and system is used for continuous perception and monitoring of low-altitude, small, and slow-moving targets. By constructing a multimodal fusion detection module that includes a cross-modal Transformer feature fusion mechanism, and combining a radar-vision fusion processing mechanism with asynchronous task processing and an MQTT real-time communication bridging system, it achieves effective complementarity and collaborative processing of visible light, infrared, and radar information. This significantly improves the overall detection and tracking performance of targets in complex environments, and has the following beneficial effects:

[0044] 1. By employing a dual-stream network to extract multi-scale features from visible light and infrared images respectively, and using a cross-modal Transformer fusion module and an adaptive feature aggregation module for fine-grained enhancement, the fused multi-modal feature representation can more comprehensively capture the complementary information of the target, thereby improving the detection accuracy and model robustness for small, slow targets.

[0045] 2. By designing a radar-vision fusion processing architecture, the fused visual detection results are spatiotemporally aligned and information-fused with radar data. A target trajectory prediction algorithm is then used to estimate the future motion state of the target, thereby enhancing the continuity of target tracking and the accuracy of trajectory prediction. 3. By introducing asynchronous task processing and real-time communication modules, a message queue mechanism is used to achieve parallel task processing and load balancing. Frame-by-frame tracking results are pushed using an MQTT real-time communication bridge system, thus improving the overall system's processing efficiency and the real-time performance of output, meeting the high real-time requirements of practical applications. Attached Figure Description

[0046] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0047] Figure 1 This is a flowchart of a method for detecting and tracking small, slow targets based on multimodal fusion, as described in this invention.

[0048] Figure 2 This is a schematic diagram of the architecture of a low-speed, small-target detection and tracking system based on multimodal fusion as described in this invention. Detailed Implementation

[0049] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] The following technical terms are used in this invention:

[0051] The cross-modal Transformer feature fusion mechanism refers to uniformly encoding the target feature representations (including target bounding boxes, categories, and confidence scores) output by different modalities (visible light and infrared) detection networks into a sequence form. Through a Transformer structure that includes multi-head attention and feedforward networks, it models the spatial relationships, category consistency, and confidence score distributions between targets of different modalities, obtaining an attention weight matrix that characterizes cross-modal correlation. Based on this matrix, the target features of different modalities are weighted, fused, and redundancy is suppressed, thereby outputting the fused multimodal target detection results.

[0052] The cross-modal Transformer fusion module is a functional sub-module integrated within the multimodal fusion detection module. This sub-module takes a target list or feature tensor in memory as input. First, it performs unified dimension mapping and position encoding on the multi-scale features output from the visible light and infrared streams. Then, it models the correlation between different modalities through the cross-modal Transformer feature fusion mechanism. Finally, it outputs the fused target features or fused detection results and provides a unified format of data input for subsequent radar-visual fusion processing and target tracking.

[0053] The radar-visual fusion processing mechanism refers to the process in which, after performing clutter suppression, constant false alarm rate (CFAR) detection, or threshold decision on radar echo data, the radar-detected target range, azimuth, and other information are aligned with the visual fusion detection results on the time axis and spatial coordinate system. Based on a preset matching strategy (e.g., based on spatial proximity and motion consistency), a correspondence between radar targets and visual targets is established. On this basis, the target state is jointly estimated and its credibility is updated, thereby improving the overall detection and tracking stability under complex backgrounds and adverse weather conditions.

[0054] A dual-stream network refers to a two-branch network structure used within a multimodal fusion detection module. One branch takes a visible light image as input, uses a deep convolutional network or a Transformer network to extract multi-scale features of the visible light modality, and outputs the visible light detection result. The other branch takes an infrared image as input, uses a network with similar structure or shared parameters to extract multi-scale features of the infrared modality, and outputs the infrared detection result. The two branches are relatively independent in the feature extraction and detection stages. In the multimodal fusion stage, the detection results of the two modalities are unified and fused through a cross-modal Transformer fusion module and an adaptive feature aggregation module.

[0055] The adaptive feature aggregation module is a functional unit that performs weighted fusion of multi-source features from visible light flow, infrared flow, and (optionally) radar-aided features. Based on attention weights or learnable fusion coefficients, this module adaptively weights and aggregates features from different modalities, scales, and spatial locations to highlight feature regions that contribute more to the discrimination of small, slow-moving targets, while suppressing background noise and redundant information. The output of the adaptive feature aggregation module is used to generate the final multimodal fused feature map or target list, and directly affects the input quality of subsequent target tracking and trajectory prediction modules.

[0056] This invention discloses a method and system for detecting and tracking small, slow targets based on multimodal fusion. It fully leverages the advantages of multimodal fusion by constructing a multimodal fusion detection and tracking system. Utilizing the system's multimodal fusion detection module, target tracking and trajectory prediction module, radar-vision fusion processing module, and asynchronous task processing and real-time communication module, it handles multimodal data fusion, target tracking, radar-vision fusion, and real-time communication tasks respectively. This effectively captures and fuses visible light, infrared, and radar information, achieving real-time communication and asynchronous processing, thereby improving the detection accuracy and tracking stability of small, slow targets. Furthermore, the system employs a modular architecture to support flexible configuration and expansion in different scenarios.

[0057] like Figure 1 As shown, the present invention provides a method for detecting and tracking small, slow targets based on multimodal fusion, comprising the following steps:

[0058] S1. Preprocess visible light images, infrared images, and radar data to construct a multimodal target detection and tracking dataset;

[0059] S2. Construct a multimodal target detection and tracking system, the system including a multimodal fusion detection module, a target tracking and trajectory prediction module, a radar-vision fusion processing module, and an asynchronous task processing and real-time communication module;

[0060] S3. Perform multimodal target detection and tracking tasks. Input the preprocessed multimodal dataset into the multimodal fusion detection module. The multimodal fusion detection module consists of a visible light image detection submodule, an infrared image detection submodule, and a multimodal fusion submodule. The visible light image detection submodule extracts target features from the visible light image, and the infrared image detection submodule extracts target features from the infrared image. The obtained visible light detection results and infrared detection results are input into the multimodal fusion submodule to perform cross-modal feature fusion and obtain the fused target detection result.

[0061] S4. Input the generated fusion detection results into the target tracking and trajectory prediction module to obtain the target's tracking ID, motion trajectory and predicted trajectory, and perform target trajectory prediction in combination with radar data;

[0062] S5. The asynchronous task processing and real-time communication module processes tasks asynchronously, the message queue mechanism enables parallel processing of tasks, and the MQTT real-time communication bridging system enables real-time push of frame-by-frame tracking results, thus completing the output of detection and tracking results.

[0063] To facilitate the implementation of steps S1 to S5 by those skilled in the art, the present invention configures the training dataset and core network parameters of the multimodal fusion detection and tracking system as follows: First, in step S1, a training set, a validation set, and a test set are constructed. Visible light images and infrared images in the training set and validation set are acquired synchronously in time and organized into pairs of images. Radar data is stored as text or binary files in units of frames. Each frame contains at least fields such as distance, azimuth, (optional) elevation angle, and echo intensity. Image annotation adopts YOLO format bounding box annotation files. Each image corresponds to a .txt annotation file. Each line contains five fields: "class_id x_center y_center width height". The coordinates and values ​​are normalized ratios within the range of [0,1]. class_id is used to distinguish low, small, and slow target categories (such as small drones, birds, etc.). x_center and y_center are the normalized coordinates of the target box center point. width and height are the normalized dimensions of the target box width and height. During training, visible light and infrared images are uniformly scaled to a fixed resolution (e.g., 800×800 pixels), and color normalization and data augmentation processing (including random flipping, random cropping, random scaling, and brightness / contrast perturbation) are performed on the images to improve the robustness of the model to different scales and different scenes. The radar data is aligned with the image frames on the time axis by timestamps and transformed to a spatial coordinate system consistent with the images through sensor calibration parameters.

[0064] In the multimodal fusion detection module of step S2, the visible light image detection submodule and the infrared image detection submodule use a convolutional or residual network with 5 backbone stages as the feature extraction backbone. The backbone network outputs multi-scale feature maps at different downsampling scales such as 1 / 4, 1 / 8, 1 / 16, and 1 / 32, with the number of channels in each stage set to 64, 128, 256, 512, and 1024 respectively. A feature pyramid network is superimposed on the top of the backbone to upsample and fused the feature maps at different scales to obtain multi-scale fusion features. Then, the detection head simultaneously predicts the target category and bounding box regression parameters at three scales. When training the detection network, the loss function consists of classification loss (such as cross-entropy or focus loss), bounding box regression loss (such as L1 or IoU loss), and a regularization term for multi-scale fusion stability. Training hyperparameters such as learning rate and batch size start from conventional configurations (e.g., batch size is 8 to 32, and initial learning rate is 1e-4 to 1e-3), and are tuned according to the aforementioned principle of "using detection accuracy, recall, tracking and association accuracy, or a comprehensive evaluation of the above indicators as evaluation criteria on the validation set".

[0065] In the target tracking and trajectory prediction module of step S4, the tracking ID and state vector of each target are maintained based on the fusion detection results. The state vector includes at least the target center position and velocity components. The state transition matrix and observation matrix of the filter are configured using a common uniform motion model in the art. The process noise covariance and observation noise covariance are set within an empirical range (e.g., standard deviation of a few pixels to tens of pixels) based on the magnitude of radar ranging accuracy and visual detection error, and are tuned on the validation set using tracking ID retention rate, trajectory continuity, and trajectory prediction error as evaluation indicators. The trajectory prediction submodule uses radar measurements as observation inputs to the Kalman filter or other state estimation methods at the same timestamp, updates the target state together with the visually estimated target position, and extrapolates based on the current velocity vector within a short time window to obtain the target prediction position at several future times. Through the above training dataset construction method and core network parameter configuration, those skilled in the art can implement and train the multimodal fusion detection and tracking system described in this invention on existing deep learning platforms (such as PyTorch).

[0066] This invention fully leverages the advantages of multimodal fusion. By constructing a multimodal fusion detection and tracking system, it utilizes the system's multimodal fusion detection module, target tracking and trajectory prediction module, radar-vision fusion processing module, and asynchronous task processing and real-time communication module to effectively capture and fuse visible light, infrared, and radar information, achieving real-time communication and asynchronous processing, thereby improving the detection accuracy and tracking stability of small, slow targets.

[0067] In this embodiment, the multimodal fusion detection module described in S2 introduces a cross-modal feature fusion mechanism. The multimodal fusion detection module includes a visible light image detection submodule, an infrared image detection submodule, and a multimodal fusion submodule. The visible light image detection submodule extracts target features from visible light images, specifically using a single-stage target detection network with five backbone stages and a feature pyramid structure as its backbone. It outputs candidate target boxes and their category confidence scores at multiple scales for each frame of the visible light image. The infrared image detection submodule has the same structure as the visible light image detection submodule, outputting corresponding candidate target boxes and category confidence scores for each frame of the infrared image. The multimodal fusion submodule matches, weights, and filters the visible light and infrared detection results, and then executes a cross-modal Transformer feature fusion mechanism to provide basic data for the subsequent step S3, which outputs the fused target detection results.

[0068] The "cross-modal Transformer fusion module" consists of four stacked Transformer encoder layers, each including a multi-head self-attention sub-layer and a feedforward network sub-layer. The multi-head self-attention sub-layer has 8 attention heads and an attention hidden dimension of 256. The feedforward network is a two-layer fully connected network with an intermediate layer dimension of 1024, using GELU as the activation function. LayerNorm and residual connections are set before and after each sub-layer. The multi-modal fusion sub-module first encodes the target feature representations output from the visible light and infrared streams into a length of... The feature sequence, where This represents the number of visible light targets in the current frame's visible light detection results that, after being filtered using confidence thresholds and non-maximum suppression, participate in cross-modal fusion. The number of infrared targets participating in cross-modal fusion after similar screening in the current frame's infrared detection results; both are non-negative integers and vary with frame. Each target's feature vector is obtained by linearly mapping and concatenating fields such as bounding box center coordinates, width and height, class embedding, and confidence score, and then superimposed with a two-dimensional sinusoidal position code constructed based on the bounding box center coordinates. This feature sequence is then fed into the aforementioned four-layer Transformer encoder for cross-modal correlation modeling, yielding the fused feature vector and corresponding attention weight matrix for each target. For each visible / infrared target pair, the multimodal fusion submodule calculates the cross-modal correlation weights based on the attention weights and the aforementioned correlation weights. The features are weighted and summed and redundancy suppressed to output fused features with a dimension of 256. The fused features are then restored to the channel dimension required by the detection head through a linear mapping layer. This is used to generate the fused multimodal target detection results and a unified format input for the radar-vision fusion processing module and the target tracking and trajectory prediction module.

[0069] The "adaptive feature aggregation module" performs weighted aggregation of multi-source features from visible light flow, infrared flow, and radar-aided features in both channel and spatial dimensions. Internally, it includes channel attention branches and spatial attention branches: the channel attention branch generates channel weights by applying the global average pooling result of each feature channel through two fully connected layers and ReLU activation; the spatial attention branch performs max pooling and average pooling on the feature map in the channel dimension, followed by a 3×3 convolutional layer and Sigmoid activation to generate spatial weights. The final feature map is then multiplied point-by-point by the channel weights and spatial weights, achieving adaptive enhancement and suppression of features from different modalities, scales, and spatial locations. This highlights feature regions that contribute more to the discrimination of small, slow, and low-profile targets while suppressing background noise and redundant information. The output of the adaptive feature aggregation module, as the final multimodal fusion feature map or target feature list, is directly fed into the target tracking and trajectory prediction module. During the training phase, it, along with the detection loss and tracking loss, updates all network parameters, including those of the cross-modal Transformer fusion module and the adaptive feature aggregation module, end-to-end through backpropagation. Regarding network training and update rules, this invention employs an end-to-end joint training approach for the cross-modal Transformer fusion module, the adaptive feature aggregation module, and the aforementioned detection backbone network. The optimizer is AdamW, with an initial learning rate of 1×10^-4 and a weight decay coefficient of 1×10^-2. The training epochs are no less than 100. The learning rate decays gradually using a cosine annealing strategy, or in a stepwise manner if the validation set metrics do not improve over a long period. The total loss function consists of the detection loss... fusion loss and tracking loss It consists of three parts, including detection loss. Includes classification cross-entropy loss and bounding box regression IoU loss, fusion loss The tracking loss is used to constrain the consistency between the fused features and single-modal features in class prediction and bounding box regression of the cross-modal Transformer output. Used to constrain the consistency of the same target in center position, velocity, and ID preservation between adjacent frames, the total loss is based on Weighted summation in the form of weight coefficients and The learning rate and loss weights were set to 0.5 and 1.0 respectively, and were jointly tuned according to the aforementioned validation set evaluation criteria.

[0070] In this embodiment, the S2 process specifically includes: inputting the preprocessed multimodal dataset into the multimodal fusion detection module; extracting target features from the visible light and infrared images through the visible light image detection submodule and the infrared image detection submodule, respectively; and pairing and weighting candidate targets of the two modalities according to a preset matching rule in the multimodal fusion submodule. The adjustable parameters (such as thresholds and weight coefficients) involved in cross-modal fusion and target tracking below are all optimized with the overall performance of multimodal detection and tracking as the optimization objective. On the validation set, the evaluation criteria are based on detection accuracy, recall, tracking association accuracy, or a comprehensive evaluation of the above indicators (such as weighted sum) for tuning or fine-tuning. The matching rule includes: firstly, calculating the intersection-union ratio (IoU) between each pair of visible light candidate boxes and infrared candidate boxes. The formula for calculating IoU can be expressed as: ,in These represent two candidate boxes. and Let these represent the areas of the intersection and union of the two frames, respectively; only if... Greater than the first threshold Furthermore, both predicted categories are the same or the difference in category confidence is less than the second threshold. When the candidate boxes are considered to be matched, then... Minimum intersection-union ratio (MOU) threshold for cross-modal candidate box matching The upper limit of the allowable difference in confidence scores between the two modal prediction classes, where the first threshold is... The minimum intersection-union ratio (CIU) threshold for cross-modal candidate box matching ranges from 0.3 to 0.7, preferably 0.5, to ensure sufficient spatial overlap between matching pairs; the second threshold... The upper limit of the allowable difference in confidence scores between the two modal prediction categories is set, ranging from 0.1 to 0.3, preferably 0.2. This limit is used to restrict matching pairs with low confidence or inconsistent categories. Then, for all candidate pairs that meet the conditions, the cross-modal matching score is calculated based on the weighted sum of IoU and confidence scores. ,in 、 These are the category confidence levels for visible light and infrared detection, respectively. , The weighting coefficients, used to balance the contributions of the intersection-union ratio and the average confidence score to the matching score, satisfy the following conditions: , The value range is 0.5 to 0.8. The value range is 0.2 to 0.5, preferably... Within this range, through grid search or stepwise adjustment, the detection precision, recall, and tracking association accuracy on the validation set are optimized; finally, based on... Candidate pairs are sorted from largest to smallest and assigned one-to-one in sequence to prevent duplicate matching of candidate boxes of the same modality. For successfully matched target pairs, feature-level and result-level fusion is performed according to a weighted fusion strategy of visible light and infrared features. For example, a weighted average is used for bounding box coordinates, and a maximum value or weighted sum strategy is used for class confidence, resulting in multi-scale, multi-level fused candidate results for the detection and recognition of small, slow-moving targets of different sizes.

[0071] Preferably, in one embodiment, step S3 can be further subdivided into the following sub-steps:

[0072] S31. Input the preprocessed visible light image and infrared image into the multimodal fusion detection module, and perform feature extraction through the visible light image detection submodule and the infrared image detection submodule respectively to obtain target detection results in two modalities; each detection result can be represented as a set of several "target feature representations", and a single target feature representation includes the target bounding box coordinates. Category label c and its confidence score.

[0073] S32. For visible light image detection results, extract the target's bounding box, category information, and confidence score. Define this combined information as the "target feature representation" of the visible light modality, and then apply it according to the confidence score threshold. Filter out candidate boxes with low confidence ( The lower bound of the visible light detection confidence level is set, and candidate boxes below this value are filtered out (the specific value can be determined through the validation set or general settings). Optionally, non-maximum suppression is applied to the target bounding boxes to eliminate duplicate boxes near the same target.

[0074] S33. For the infrared image detection results, extract the target's bounding box, category information, and confidence score. Define this combined information as the "target feature representation" of the infrared modality, and also use a confidence score threshold. (The lower bound of the confidence level for infrared detection is determined in the same way.) Screening is performed using non-maximum suppression; visible light and infrared detection results are input into the multimodal fusion submodule, and cross-modal correlation weights are calculated based on the positional overlap and category consistency between targets. It can be represented as ,in These are indices for visible light targets and infrared targets, respectively. For the first The visible light target and the first The intersection-union ratio of the bounding boxes of several infrared targets. The category consistency score takes values ​​within the range [0,1], and can be expressed using the following mathematical expression: When the category labels of the two categories are the same, let When the category labels are different, let ,in and The first The first visible light target and the first Category confidence level of each infrared target; The balancing coefficient is used to balance the contributions of positional overlap and category consistency in the association weight. The value range is 0.3 to 0.7, preferably 0.5. Within this range, it can be adjusted... This optimizes the overall performance of detection precision and recall on the validation set.

[0075] S34. Based on the cross-modal correlation weights, perform weighted fusion and redundancy suppression on the visible light detection results and infrared detection results: First, for each visible light target, find the weights. The largest infrared target is selected as the candidate matching object when Greater than the preset threshold When it is determined to be a valid match The lower bound of the relevance weight for effective cross-modal matching is set, ranging from 0.3 to 0.7, with 0.4 being preferred. This weight is used to filter out matching pairs with low relevance. Within this range, the specific value is selected based on the cross-modal matching accuracy and overall detection precision on the validation set. For successfully matched target pairs, the bounding box coordinates and class confidence scores are weighted or reweighted according to the weights, for example, the fused confidence scores. ,in To incorporate weights and balance the confidence levels of visible light and infrared light, and Monotonic positive correlation, with a value range of 0.3 to 0.7, is preferred. For single-modal targets that fail to find a valid match, they can be retained or discarded depending on the task requirements. The fused target list is converted into a unified data structure containing fields such as target ID placeholders, bounding boxes, categories, and confidence levels, and is output to the target tracking and trajectory prediction module in this list format.

[0076] Preferably, in one embodiment, step S4 can be further subdivided into the following sub-steps:

[0077] S41. The fused detection results are input into the target tracking and trajectory prediction module, and the multimodal tracking submodule performs synchronous tracking of targets in visible light video and infrared video. The multimodal tracking submodule attempts to find the corresponding tracking trajectory in the target trajectory set of the previous frame for each target detected in the current frame. Specifically, this includes: maintaining a state vector (e.g., containing target center position, velocity components, etc.) and covariance matrix for each existing trajectory; using a Kalman filter to predict the target state; calculating the association cost between each predicted state and the detection result of the current frame. The association cost can comprehensively consider factors such as bounding box intersection-union ratio, center point distance, and (optionally) radar ranging differences. The association cost is calculated using the following formula: ,in For trajectory indexing, For index detection; For the first The trajectory prediction box and the first The intersection-union ratio of the detection boxes; The Euclidean distance (in pixels) between the centers of the two frames on the image plane; In order to detect Corresponding radar measurements and the first The Euclidean distance (in pixels) between the centers of the trajectory prediction boxes, after being calibrated and unified to the image coordinate system, is used in the detection... The value is obtained by selecting the radar point closest to the center of the frame within the threshold. If there is no radar or no measurement within the threshold, this value is set to 0 and set to 0. ; The reference length, which is related to the image scale, is the diagonal length of the current frame image. , , These are the image width and height (in pixels), respectively, thus making... , , It is dimensionless or comparable in magnitude to avoid the inconsistency in dimensions caused by directly adding the intersection-union ratio and pixel distance. These are weighting coefficients used to balance the contributions of the intersection-union ratio (IUU), center point distance, and radar range error terms to the association cost. To balance the relative contributions of the intersection-union ratio term, the center point distance term, and the radar range error term to the associated cost, the values ​​of all three are set to a range of 0.3 to 1.5, preferably... In actual deployment, the three costs are first normalized to ensure consistency of dimensions, and then fine-tuned within the above range. This optimizes the association matching accuracy and tracking ID retention rate on the validation set. Then, the Hungarian algorithm or greedy matching algorithm is used to solve the minimum cost matching between all predicted trajectories and the current detection results to obtain a one-to-one correspondence between trajectories and detection targets. The trajectory status is then updated and the tracking ID of each target is maintained. Trajectories that have not been matched with detection results for a long time are terminated according to the number of lost frames threshold.

[0078] S42. Input the historical tracking trajectory and radar data into the trajectory prediction submodule. By filtering and smoothing the trajectory sequence, and combining the distance and azimuth information measured by the radar for spatiotemporal alignment and information fusion, predict the future motion trajectory of the target. Specifically, at the same timestamp, determine the radar measurement corresponding to each trajectory based on the correspondence between radar and vision (e.g., the matching of radar target and visual trajectory established by the aforementioned radar-vision fusion processing mechanism according to time alignment and spatial proximity). Use the radar measurement as the observation input for Kalman filtering or other state estimation methods, and update the state of the trajectory together with the target position estimated by vision. After obtaining the filtered state sequence, short-time extrapolation can be performed based on the current velocity vector to obtain the target prediction position at several future times, thereby realizing the estimation and prediction of the target's motion state.

[0079] The radar-visual fusion processing module implements a "radar-visual fusion processing mechanism," specifically including: suppressing clutter in radar echo data (such as moving target display or filtering), and then obtaining the target range, azimuth, and optional elevation information detected by the radar through constant false alarm rate detection or threshold decision; aligning the radar target and visual fusion detection results on the time axis and spatial coordinate system, aligning them to the same frame according to the timestamp. The matching strategy can be implemented using any existing radar-visual target association algorithm in the field. Spatially, the radar measurements are converted to a coordinate system consistent with vision through sensor calibration; according to the preset matching strategy, the matching strategy can be implemented using any existing radar-visual target association algorithm in the field, such as nearest neighbor association based on spatial proximity and velocity consistency, gated filtering, and joint probability data association, etc. As long as one-to-one or one-to-many associations are completed within a given time difference tolerance and spatial threshold, this invention does not limit this. A correspondence between radar targets and visual targets is established, and joint state estimation and confidence updates are performed on successfully matched radar-visual target pairs, thereby improving the overall detection and tracking stability under complex backgrounds and adverse weather conditions.

[0080] In this embodiment, in S4: the generated fusion detection result is input into the target tracking and trajectory prediction module, and the target tracking ID, motion trajectory and predicted trajectory information are obtained through the above target association and state estimation algorithm; the radar-vision fusion processing module fuses the radar and vision results according to the above radar-vision fusion processing mechanism, and works with the target tracking and trajectory prediction module to complete trajectory prediction and state update.

[0081] In this embodiment, in step S5: the asynchronous task processing and real-time communication module uses a RabbitMQ-based message queue mechanism to perform asynchronous processing and load balancing of tasks. Specifically, after receiving a user request, the front-end or upper-layer scheduling module encapsulates the image detection task and video stream tracking task into a unified task message unit. The task message is written in JSON format to a RabbitMQ queue named " / jobcommand". The message structure includes: a header field (containing timestamp and version), a data field (containing command type, path1 / path2 input paths, and pathtype path type identifier), and a commandid field (a unique task identifier generated using UUID). The queue is declared as a durable queue, and the producer sets delivery_mode=2 when sending messages to ensure that messages are not lost after the proxy server restarts. The backend detection and tracing service, acting as a RabbitMQ consumer, sets `basic_qos(prefetch_count=1)` on the ` / jobcommand` queue after establishing a connection. It then sequentially pulls task messages, parses the JSON, and calls the corresponding target detection, camouflage target recognition, or multimodal tracking algorithm to complete the processing. Upon successful task processing, it calls `basic_ack` to acknowledge the message. In case of unrecoverable errors such as JSON parsing failure, it calls `basic_nack` without re-queuing the message. In case of temporary errors, it calls `basic_nack` and re-queues the message, thus achieving reliable task delivery and error recovery. For the execution results of detection and tracing tasks, the asynchronous task processing module uses a result builder to encapsulate the results into a result message containing a header, data, and commandid. The data includes at least `resultpath` (result file path), `pathtype`, `isfinish` (whether it is the last result of this task), `tempresult` (frame-by-frame temporary statistics), and `result` (overall statistics or evaluation metrics). This can be pushed via the result queue or directly by the algorithm service as an MQTT message. In terms of MQTT real-time communication bridging, this embodiment uses Eclipse Mosquitto as the MQTT server. The backend algorithm program subscribes to the command topic " / cqu / cmdop" (QoS 1) through the paho-mqtt client and receives JSON-formatted command messages from the Web platform. The command message structure follows the system MQTT interface specification and includes at least the fields command, id, arguments (file path / type / filecategory), and timespan.After processing each frame of image or each video segment, the algorithm organizes the infrared target detection results, multimodal tracking results, and trajectory prediction results into JSON format result messages, and publishes them to result topics such as " / cqu / itdr", " / cqu / ittr", and " / cqu / ttpr" with QoS 1 and retain=False. The result message contains at least the following fields: commandid (original command ID returned), timespan, origin_image_url, result_visible_image_url, result_infrared_image_url, currentframe, totalframe, and detection / tracking / trajectory fields. These fields describe the target detection bounding box, category, confidence level, IoU, position, and size of each frame. The web platform, as a subscriber, receives these results according to topic categories and updates the interface display, thereby achieving decoupling integration with the external command and control platform and real-time push of frame-by-frame results.

[0082] In this embodiment, in S1: a multimodal target detection and tracking dataset is constructed, the visible light image and infrared image are unified in size and standardized in format, and the radar data is preprocessed and converted in format.

[0083] like Figure 2 As shown, the present invention discloses a low-speed, small-target detection and tracking system based on multimodal fusion. In the multi-module data acquisition and preprocessing module, an infrared image detection submodule is used to extract target feature information from infrared images; a multimodal fusion detection module is used to fuse visible light and infrared detection results to extract cross-modal correlation information; and a target tracking and trajectory prediction module includes a multimodal tracking submodule and a trajectory prediction submodule to integrate detection results and radar data to achieve continuous target tracking and trajectory prediction.

[0084] The various embodiments described in this specification are presented in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting and tracking small, slow targets based on multimodal fusion, characterized in that, include: Preprocessing of visible light images, infrared images, and radar data; Feature extraction and initial target detection are performed on visible light and infrared images using a two-stream network, respectively, to obtain visible light and infrared detection results. Target features from the visible light and infrared images are paired and weighted according to a pre-defined matching rule: the intersection-union ratio (CUI) between each pair of visible light and infrared candidate boxes is calculated. Only when Greater than the first threshold Furthermore, both predicted categories are the same or the difference in category confidence is less than the second threshold. When the candidate boxes are considered to be matched, it is assumed that the pair of candidate boxes can be matched. according to The cross-modal matching score is calculated by weighting the confidence level and the score. ,in These are the category confidence levels for visible light and infrared detection, respectively. , These are the weighting coefficients; according to Sort the candidate pairs from largest to smallest and assign them one-to-one in sequence; The visible light detection results and the infrared detection results are input into the cross-modal Transformer fusion module, and the fused multimodal target detection results are obtained by using the cross-modal Transformer feature fusion mechanism. The fused multimodal target detection results and radar data are input into the radar-vision fusion processing module to complete radar clutter suppression, spatiotemporal alignment and information fusion; The target tracking ID and historical trajectory are maintained through the target tracking and trajectory prediction steps, and future trajectory prediction is performed by combining radar ranging information. The generated fusion detection results are input into the target tracking and trajectory prediction module, and the target tracking ID, motion trajectory and predicted trajectory information are obtained through target association and state estimation algorithms; The radar-vision fusion processing module fuses radar and vision results according to the radar-vision fusion processing mechanism, and works with the target tracking and trajectory prediction module to complete trajectory prediction and status update; The step of maintaining the target's tracking ID and historical trajectory through target tracking and trajectory prediction includes: For each existing trajectory, maintain a state vector and covariance matrix, and use a Kalman filter to predict the target state; Calculate the association cost between each predicted state and the detection result of the current frame; The Hungarian algorithm or a greedy matching algorithm is used to find the minimum cost matching between all predicted trajectories and the current detection results, so as to obtain a one-to-one correspondence between trajectories and detected targets, thereby updating the trajectory status and maintaining the tracking ID of each target; The detection and tracking results are pushed to external systems in real time through asynchronous task processing and MQTT real-time communication mechanism.

2. The method for detecting and tracking small, slow targets based on multimodal fusion according to claim 1, characterized in that, The method of performing feature extraction and initial target detection on visible light and infrared images based on a two-stream network includes: The target features in the visible light image are extracted by the visible light image detection submodule, and the target features in the infrared image are extracted by the infrared image detection submodule. The visible light image detection submodule and the infrared image detection submodule together form a two-stream structure network.

3. The method for detecting and tracking small, slow targets based on multimodal fusion according to claim 1, characterized in that, The method of obtaining the fused multimodal target detection result using the cross-modal Transformer feature fusion mechanism includes: The target feature representations output by different modal detection networks are uniformly encoded into sequence form. Through a Transformer structure that includes multi-head attention and feedforward networks, the spatial positional relationship, category consistency and confidence distribution between targets of different modalities are modeled to obtain the attention weight matrix that represents cross-modal correlation. Based on the attention weight matrix, the target features of different modalities are weighted, fused, and redundancy is suppressed, and the fused multimodal target detection results are output.

4. A low-speed, small target detection and tracking system based on multimodal fusion, characterized in that, The method applied to any one of claims 1-3 includes: The multimodal data acquisition and preprocessing module is used to acquire and preprocess visible light images, infrared images, and radar data. The multimodal fusion detection module is used to extract features and detect targets in visible light and infrared images based on a two-stream structure network, and to complete multimodal feature fusion and result reweighting through a cross-modal Transformer fusion module; The target tracking and trajectory prediction module is used to perform multimodal target tracking based on the fused detection results and to perform trajectory prediction in combination with radar data; The radar-visual fusion processing module is used to perform clutter suppression, threshold detection, and spatiotemporal alignment and fusion of radar data with visual targets; The asynchronous task processing and real-time communication module is used to achieve asynchronous task scheduling and load balancing through message queues, and to push detection and tracking results frame by frame through the MQTT protocol.

5. The low-speed, small-target detection and tracking system based on multimodal fusion according to claim 4, characterized in that, The multimodal fusion detection module includes a visible light image detection submodule, an infrared image detection submodule, and a multimodal fusion submodule; The visible light image detection submodule is used to extract target features from visible light images and output visible light detection results; The infrared image detection submodule is used to extract target features from infrared images and output infrared detection results; The multimodal fusion submodule is used to match, weight, and filter the visible light detection results and the infrared detection results, and to execute the cross-modal Transformer feature fusion mechanism to output the fused target detection results.

6. The low-speed, small-target detection and tracking system based on multimodal fusion according to claim 5, characterized in that, The cross-modal Transformer fusion module is integrated inside the multimodal fusion submodule and consists of four stacked Transformer encoders, each of which includes a multi-head self-attention sublayer and a feedforward network sublayer. The multimodal fusion submodule first encodes the target feature representations output from the visible light and infrared streams into feature sequences, and superimposes a two-dimensional sinusoidal position code constructed based on the center coordinates of the bounding box. The feature sequences are then fed into the cross-modal Transformer fusion module for cross-modal correlation modeling to obtain the fused feature vector and corresponding attention weight matrix for each target.

7. The low-speed, small-target detection and tracking system based on multimodal fusion according to claim 4, characterized in that, The asynchronous task processing and real-time communication module includes a message queue mechanism based on RabbitMQ and a real-time communication bridging system based on MQTT. The message queue mechanism is used to encapsulate image detection tasks and video stream tracking tasks into task message units of a unified format, thereby achieving persistent storage of tasks, at least-once delivery, and load balancing between different algorithm services. The MQTT real-time communication bridging system is used to receive JSON command messages issued by the front end through the command topic, and push detection results, tracking results and trajectory prediction results frame by frame through the result topic.

8. The low-speed, small-target detection and tracking system based on multimodal fusion according to claim 4, characterized in that, The radar-visual fusion processing module is used to suppress clutter and perform constant false alarm rate detection or threshold decision on radar echo data, align the target information detected by radar with the visual fusion detection results on the time axis and spatial coordinate system, establish the correspondence between radar targets and visual targets according to the preset matching strategy, and perform joint estimation and confidence update of target status.