Unmanned aerial vehicle low-altitude target detection and tracking method based on multi-modal fusion
By employing a multimodal fusion method for low-altitude UAV target detection, and utilizing spatiotemporal alignment and dynamic adaptive gating fusion modules, combined with Kalman filtering and nonlinear maneuver compensation, the accuracy and stability issues in low-altitude target detection and tracking are resolved, achieving high-precision target tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN YOUJIA INTELLIGENT TECH CO LTD
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-26
AI Technical Summary
Existing UAV low-altitude target detection and tracking methods suffer from low detection accuracy and poor tracking stability in complex environments. They also have insufficient spatiotemporal alignment accuracy in multi-sensor data fusion, fixed feature fusion methods that are difficult to adapt to environmental changes, and weak trajectory prediction capabilities.
A multimodal fusion method is adopted, which constructs a dynamic adaptive gating fusion module by using spatiotemporally aligned visible light and infrared images. Combined with Kalman filtering and nonlinear maneuver compensation mechanism, target detection and trajectory prediction are performed to achieve multi-scale feature extraction and cross-modal feature interaction.
It significantly improves the detection accuracy and robustness of low-altitude targets in complex environments, enhances the continuity and stability of target tracking, and is suitable for highly maneuverable targets.
Smart Images

Figure CN122289779A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of unmanned aerial vehicle (UAV) target detection and tracking technology, and more specifically, to a method for detecting and tracking low-altitude UAV targets based on multimodal fusion. Background Technology
[0002] In recent years, unmanned aerial vehicles (UAVs) have been increasingly used in military reconnaissance, security patrols, and agricultural monitoring, making low-altitude target detection and tracking a key technology. However, the low-altitude environment is complex and variable, with small-scale and highly maneuverable targets. Single sensors (such as visible light or infrared sensors) are easily affected by factors such as lighting, occlusion, and weather conditions, resulting in low detection accuracy and poor tracking stability. Existing methods mostly employ multi-sensor data fusion, but these suffer from insufficient spatiotemporal alignment accuracy, fixed feature fusion methods, difficulty in adapting to environmental changes, and weak trajectory prediction capabilities. Therefore, there is an urgent need for a method that can adapt to the environment, fuse multimodal features, and achieve high-precision and stable tracking. Summary of the Invention
[0003] To overcome the aforementioned deficiencies of the prior art, embodiments of the present invention provide a method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion, in order to solve the problems mentioned in the background art.
[0004] To achieve the above objectives, the present invention provides the following technical solution: A1: Acquire spatiotemporally aligned visible light and infrared images; A2: Extract multi-scale feature maps from the visible light image and the infrared image respectively; A3: Construct a dynamic adaptive gating fusion module to calculate modal confidence based on current environmental information, and perform channel-level dynamic weighting and feature interaction in the ROI region to generate fused features. A4: Target detection is performed based on the fused features, and deep appearance features are extracted; A5: Construct a trajectory prediction and calibration model based on the temporal co-occurrence matrix, and combine Kalman filtering and nonlinear maneuver compensation mechanism to associate the detected target with historical trajectory and output a continuous and stable tracking trajectory.
[0005] Preferably, in step A1, firstly, a multimodal data acquisition synchronization triggering mechanism is activated. A synchronization acquisition command is issued through the UAV flight control system to control the visible light camera and infrared thermal imager to simultaneously begin data acquisition. The visible light image is output at a resolution of 1920×1080 and a frame rate of 30 FPS, while the infrared image is output at a resolution of 640×512 and a frame rate of 25 FPS, ensuring the initial synchronization of the original data acquisition. During the acquisition process, the timestamp information of the two sensors (with microsecond-level accuracy) and the attitude data output by the UAV IMU (MPU9250) (roll, pitch, and yaw angle update frequency of 100Hz) are recorded in real time, providing basic data support for subsequent spatiotemporal calibration. Next, a time synchronization calibration process is performed: Addressing the sampling time difference between visible light and infrared sensors caused by hardware response delays and differences in data transmission links, a time correlation model is constructed based on sensor timestamps and IMU data. Linear interpolation is used to compensate for the asynchronous data frames. For example, when the infrared image sampling delay is 0.8ms compared to the visible light image, the timestamp of the infrared image is precisely corrected based on the IMU attitude change trend within that time period, ensuring that the data frames from both sensors are strictly aligned in the time dimension. Ultimately, the time synchronization error is controlled to ≤1ms, avoiding target position deviations caused by time misalignment. Subsequently, spatial coordinate calibration was performed: First, sensor extrinsic parameter calibration was completed using the hand-eye calibration method to obtain the position offset and attitude deflection angle of the visible light camera and infrared thermal imager relative to the UAV's body coordinate system, and an extrinsic parameter matrix was established. During data acquisition, the attitude angle data output by the UAV's IMU was read in real time, and the acquisition angle of the two sensors was dynamically corrected in combination with the extrinsic parameter matrix. For example, when the UAV experiences a pitch angle change of ±5°, the pixel coordinates of the infrared image were mapped to the visible light image coordinate system through a spatial coordinate transformation algorithm to correct the spatial offset caused by the body's attitude jitter, ensuring that the pixel position deviation of the same target in the two images is ≤0.5 pixels. At the same time, to address the issue that the resolution of the infrared image is lower than that of the visible light image, bilinear interpolation was used during the spatial calibration process to adapt the resolution of the infrared image, making it consistent with the pixel dimension of the visible light image, laying a spatial foundation for subsequent data layer fusion.
[0006] Preferably, in A2, a lightweight multi-scale feature extraction network is used, combined with a cross-modal feature enhancement mechanism, to perform hierarchical feature extraction on the spatiotemporally aligned dual-modal images. Specifically: First, a dual-branch parallel feature extraction architecture is constructed, in which the visible light image branch and the infrared image branch share the network structure but have independent training parameters; the network as a whole is based on the improved MobileNetV3 architecture, balancing feature extraction capability and computational efficiency through depthwise separable convolutions and inverse residual structures; considering the rich texture and prominent color information of visible light images, a channel attention module (CBAM) is added to the shallow convolutional layers (Conv1-Conv3) of its branch to enhance the detailed features such as target edges and contours; considering the characteristics of infrared images being less affected by illumination and having significant temperature differences between the target and the background, a spatial attention module (SAM) is introduced into the middle convolutional layers (Conv4-Conv6) of the infrared branch to improve the feature response of the target region; the multi-scale feature extraction process is executed, and feature maps of four scales (denoted as S1, S2, S3, and S4, corresponding to downsampling factors of respectively) are output through different stages of the network. The dimensions and number of channels of the feature maps at each scale are as follows: For an input resolution of 1920... Visible light images, S1 size is 480×270, number of channels is 64, S2 size is The S3 image has 128 channels, a size of 120×68 pixels and 256 channels, and a size of 60×34 pixels and 512 channels. For infrared images with a resolution of 1920×1080 after bilinear interpolation adaptation, the size of the feature maps at each scale is consistent with that of the visible light image, and the number of channels is the same. During feature extraction, a weighted feature fusion (WFF) mechanism is used to fuse adjacent scale features of the same modality. The specific calculation method is as follows: in, This represents the modal feature map (modale{Vis,IR}, corresponding to visible light and infrared light respectively) after fusion at the k-th scale. This is the original feature map at scale k. This indicates a bilinear upsampling operation. For adaptive weight coefficients ( hour, , , ); Subsequently, feature enhancement processing is performed on the multi-scale feature maps of each modality: For visible light image features, adaptive histogram equalization (CLAHE) is used to enhance the contrast of shallow feature maps (S1, S2) and suppress background noise interference; for infrared image features, the target edge gradient information of the middle feature maps (S2, S3) is enhanced by fusing Gaussian filtering and the Laplacian operator. The specific calculation method is as follows: in, The enhanced infrared feature map, Gaussian filtering operation representing standard deviation, This indicates edge detection using the Laplacian operator. The fusion weight is set to 0.6.
[0007] Preferably, in A3, a dynamic adaptive gating fusion module is constructed to intelligently fuse features based on environmental perception and modal reliability assessment. Specifically, the overall module architecture is built first, which includes an environmental information perception unit, a modal confidence calculation unit, a ROI region precise positioning unit, a channel-level dynamic weighting unit, and a cross-modal feature interaction unit, forming a closed-loop process of "perception-assessment-fusion-enhancement". Among them, the environmental information perception unit extracts environmental feature vectors by analyzing the key parameters of the spatiotemporally aligned dual-modal images. ,in The average illumination intensity of the visible light image (normalized to [0,1]) To improve the temperature contrast between the target and the background in an infrared image. Image occlusion rate (calculated through edge detection and connected component analysis), This provides an environmental basis for modal noise intensity (Gaussian noise is used for estimation in visible light images, and salt-and-pepper noise density is used for evaluation in infrared images) and modal confidence calculation. The specific method for calculating the temperature difference contrast between the target and the background in an infrared image is as follows: The average temperature of the target area. The average temperature of the background area; Next, the reliability of the two modalities is quantified using a modal confidence calculation unit, and differentiated evaluation models are designed based on the characteristics of visible light and infrared images. For visible light images, the confidence level is... With light intensity Positive correlation with occlusion rate and noise intensity Negative correlation, the specific calculation method is as follows: in To adjust the coefficient, the Sigmoid function is used to enhance the influence of light intensity on the confidence level, ensuring sufficient lighting. )hour Approaching 1, dim lighting ( )hour It dropped below 0.3.
[0008] Preferably, in A4, a lightweight multi-scale detection network is constructed based on the multi-scale fusion feature map generated in the preceding sequence. This network extracts features... Using downsampling at 4×, 8×, 16×, and 32× as input, an improved YOLOv8-nano architecture is employed. Cross-scale feature fusion is achieved through a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN), enhancing the adaptability to targets of different sizes. Addressing the high proportion of small, slow-moving targets in low-altitude scenes, a small target enhancement branch is added to the network neck, using a 1×1 convolution to... The number of channels (480×270) has been increased to 128 dimensions, and is consistent with... (240×135) element-wise addition and fusion are performed to enhance the expressive power of small target features; at the same time, a decoupled head design is adopted in the detection head, separating the classification branch and the regression branch. The classification branch enhances semantic feature extraction through global average pooling, and the regression branch introduces a coordinate-based force module to improve the location prediction accuracy; the target detection process is executed as follows: the fused feature maps of 4 scales are input into the detection network, and the detection head outputs the target's class probability, bounding box coordinates, and confidence score. Among them, the bounding box regression uses the CloU loss function to optimize the location prediction accuracy. The specific calculation method is as follows: Where IoU is the intersection-union ratio between the predicted bounding box and the ground truth bounding box. The square of the Euclidean distance between the center points of the two frames. The length of the diagonal of the smallest bounding rectangle enclosing the two frames. For balance coefficient, This parameter measures the consistency of the aspect ratio of the bounding box. This loss function helps the bounding box regression to be closer to the actual location of the target. After detection, a modified non-maximum suppression (Soft-NMS) algorithm is used to filter the detection results. For candidate boxes with an overlap greater than 0.5, low-confidence boxes are not directly eliminated, but their confidence is reduced through a decay function. in, Confidence of candidate boxes To determine the overlap with the high-confidence bounding box, The attenuation coefficient; The final output is a target detection result with a confidence level of ≥0.7, including the target category (such as drones, birds, balloons), precise bounding box coordinates, and detection confidence level.
[0009] Preferably, in A5, a temporal co-occurrence matrix is constructed to mine the target motion correlation, and a target temporal correlation model is established based on previous detection results and historical trajectory data; the historical data is defined. frame( The target detection set is ,in ( Indicates the first The set of detected targets in a frame, each target containing bounding box coordinates. Deep appearance features and confidence level; where z represents the bounding box width; construct the temporal co-occurrence matrix. Where U represents history Total number of targets in the frame, matrix elements Indicates the m-th historical goal and The relevance of each current target is calculated as follows: in, , , These are the weighting coefficients. The intersection-union ratio (CUC) of the two target bounding boxes, cosine Represented as cosine similarity of appearance features, The frame interval between the two targets. The time decay coefficient is used to quantify the spatial overlap, apparent similarity, and temporal continuity between targets through the temporal co-occurrence matrix, providing multi-dimensional basis for trajectory association.
[0010] Next, a Kalman filter (KF) is introduced to construct a basic trajectory prediction model. Considering the motion characteristics of low-altitude targets, a uniform-uniform acceleration hybrid motion model is used to describe the target's motion state; the target state vector is defined as... ,in The coordinates of the bounding box center are For bounding box dimensions, The velocity of the center coordinate is the velocity of the motion. For acceleration; state transition matrix With process noise matrix Set them to: in For frame interval (take) ); observation matrix Only the position and size information in the state vector is retained, and the observation noise matrix is... The prior estimate of the target state is obtained through the prediction step of Kalman filtering. With the prior covariance matrix This provides predictive location support for trajectory association; The technical effects and advantages of this invention are as follows: This invention significantly improves the detection accuracy and robustness of low-altitude targets in complex environments by combining spatiotemporal alignment and multimodal feature fusion; it constructs a dynamic adaptive gating fusion module to dynamically adjust modal weights based on environmental information, thereby enhancing the expressive power of fused features; and it introduces a nonlinear maneuver compensation mechanism and a temporal co-occurrence matrix to improve the continuity and stability of target tracking, making it particularly suitable for highly maneuverable targets. Attached Figure Description
[0011] Figure 1 This is a schematic diagram of the method flow of the present invention.
[0012] Figure 2 This is a schematic diagram of the detection-tracking bidirectional closed-loop process of the present invention. Detailed Implementation
[0013] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0014] Please see Figure 1 As shown, this invention provides a method for detecting and tracking low-altitude targets of unmanned aerial vehicles (UAVs) based on multimodal fusion, including: A1: Acquire spatiotemporally aligned visible light and infrared images; In A1, firstly, a multimodal data acquisition synchronization triggering mechanism is activated. The UAV flight control system issues a synchronization acquisition command to control the visible light camera and the infrared thermal imager to simultaneously start data acquisition. The visible light image is output at a resolution of 1920×1080 and a frame rate of 30 FPS, while the infrared image is output at a resolution of 640×512 and a frame rate of 25 FPS, ensuring the initial synchronization of the raw data acquisition. During the acquisition process, the timestamp information of the two sensors (with microsecond-level accuracy) and the attitude data output by the UAV IMU (MPU9250) (roll, pitch, and yaw angle update frequency of 100Hz) are recorded in real time, providing basic data support for subsequent spatiotemporal calibration. Next, a time synchronization calibration process is performed: Addressing the sampling time difference between visible light and infrared sensors caused by hardware response delays and differences in data transmission links, a time correlation model is constructed based on sensor timestamps and IMU data. Linear interpolation is used to compensate for the asynchronous data frames. For example, when the infrared image sampling delay is 0.8ms compared to the visible light image, the timestamp of the infrared image is precisely corrected based on the IMU attitude change trend within that time period, ensuring that the data frames from both sensors are strictly aligned in the time dimension. Ultimately, the time synchronization error is controlled to ≤1ms, avoiding target position deviations caused by time misalignment. Subsequently, spatial coordinate calibration was performed: First, sensor extrinsic parameter calibration was completed using the hand-eye calibration method to obtain the position offset and attitude deflection angle of the visible light camera and infrared thermal imager relative to the UAV's body coordinate system, and an extrinsic parameter matrix was established. During data acquisition, the attitude angle data output by the UAV's IMU was read in real time, and the acquisition angle of the two sensors was dynamically corrected in combination with the extrinsic parameter matrix. For example, when the UAV experiences a pitch angle change of ±5°, the pixel coordinates of the infrared image were mapped to the visible light image coordinate system through a spatial coordinate transformation algorithm to correct the spatial offset caused by the body's attitude jitter, ensuring that the pixel position deviation of the same target in the two images is ≤0.5 pixels. At the same time, to address the issue that the resolution of the infrared image is lower than that of the visible light image, bilinear interpolation was used during the spatial calibration process to adapt the resolution of the infrared image, making it consistent with the pixel dimension of the visible light image, laying a spatial foundation for subsequent data layer fusion. Finally, the spatiotemporal alignment effect is detected in real time through a calibration quality verification unit: the mutual information value of the same target region in the visible light and infrared images after alignment is calculated, and the specific calculation method is as follows: in, and These represent the number of gray levels in visible light and infrared images, respectively. Let be the joint probability distribution function, representing that in medium grayscale value And in medium grayscale value The probability of a pixel appearing; and These are marginal probability distribution functions, representing respectively medium grayscale value and medium grayscale value The probability of a pixel appearing; When mutual information value If the mutual information value is greater than 0.7, the calibration is considered valid, and the spatiotemporally aligned bimodal image is output; if the mutual information value is greater than 0.7, the calibration is considered valid, and the spatiotemporally aligned bimodal image is output. If the value is less than 0.7, the external parameter calibration and attitude correction process will be retried until the calibration accuracy requirements are met, ensuring the consistency and accuracy of target features in the subsequent multimodal fusion process.
[0015] A2: Extract multi-scale feature maps from the visible light image and the infrared image respectively; In A2, a lightweight multi-scale feature extraction network is employed, combined with a cross-modal feature enhancement mechanism, to perform hierarchical feature extraction on the spatiotemporally aligned dual-modal images. Specifically: First, a dual-branch parallel feature extraction architecture is constructed, where the visible light image branch and the infrared image branch share the network structure but have independent training parameters; the overall network is based on an improved MobileNetV3 architecture, balancing feature extraction capability and computational efficiency through depthwise separable convolutions and inverse residual structures; considering the rich texture and prominent color information of visible light images, channel attention modules (CBAM) are added to the shallow convolutional layers (Conv1-Conv3) of its branch to enhance detailed features such as target edges and contours; considering the characteristics of infrared images being less affected by illumination and having significant temperature differences between the target and background, spatial attention modules (SAM) are introduced into the middle convolutional layers (Conv4-Conv6) of the infrared branch to improve the feature response of the target region; a multi-scale feature extraction process is executed, outputting feature maps of four scales (denoted as S1, S2, S3, and S4, corresponding to downsampling factors of respectively) through different stages of the network. The dimensions and number of channels of the feature maps at each scale are as follows: For an input resolution of 1920... Visible light images, S1 size is 480×270, number of channels is 64, S2 size is The S3 image has 128 channels, a size of 120×68 pixels and 256 channels, and a size of 60×34 pixels and 512 channels. For infrared images with a resolution of 1920×1080 after bilinear interpolation adaptation, the size of the feature maps at each scale is consistent with that of the visible light image, and the number of channels is the same. During feature extraction, a weighted feature fusion (WFF) mechanism is used to fuse adjacent scale features of the same modality. The specific calculation method is as follows: in, This represents the modal feature map after fusion at the k-th scale ( (corresponding to visible light and infrared light respectively), This is the original feature map at scale k. This indicates a bilinear upsampling operation. For adaptive weight coefficients ( hour, , , ); Subsequently, feature enhancement processing is performed on the multi-scale feature maps of each modality: For visible light image features, adaptive histogram equalization (CLAHE) is used to enhance the contrast of shallow feature maps (S1, S2) and suppress background noise interference; for infrared image features, the target edge gradient information of the middle feature maps (S2, S3) is enhanced by fusing Gaussian filtering and the Laplacian operator. The specific calculation method is as follows: in, The enhanced infrared feature map, Gaussian filtering operation representing standard deviation, This indicates edge detection using the Laplacian operator. The fusion weight is set to 0.6. Finally, the extracted multi-scale feature maps are normalized using a batch normalization algorithm to eliminate the distribution differences among modal features. The calculation method is as follows: in, This represents the average of the feature map batches. For batch variance, To prevent tiny values with a denominator of zero, and The scaling and offset coefficients are learnable. Normalization is used to ensure that the features of each scale and modality are in the same numerical range, providing a unified feature basis for subsequent cross-modal feature fusion. After extraction, enhanced feature maps of four scales each of visible light and infrared are output for subsequent feature layer fusion processing.
[0016] A3: Construct a dynamic adaptive gating fusion module to calculate modal confidence based on the current environmental information, and perform channel-level dynamic weighting and feature interaction in the ROI region to generate fused features; In A3, a dynamic adaptive gating fusion module is constructed to intelligently fuse features based on environmental perception and modal reliability assessment. Specifically: First, the overall module architecture is built, which includes an environmental information perception unit, a modal confidence calculation unit, a precise ROI region positioning unit, a channel-level dynamic weighting unit, and a cross-modal feature interaction unit, forming a closed-loop process of "perception-assessment-fusion-enhancement". Among them, the environmental information perception unit extracts environmental feature vectors by analyzing the key parameters of the spatiotemporally aligned bimodal images. ,in The average illumination intensity of the visible light image (normalized to [0,1]) To improve the temperature contrast between the target and the background in an infrared image. Image occlusion rate (calculated through edge detection and connected component analysis), This provides an environmental basis for modal noise intensity (Gaussian noise is used for estimation in visible light images, and salt-and-pepper noise density is used for evaluation in infrared images) and modal confidence calculation. The specific method for calculating the temperature difference contrast between the target and the background in an infrared image is as follows: The average temperature of the target area. The average temperature of the background area; Next, the reliability of the two modalities is quantified using a modal confidence calculation unit, and differentiated evaluation models are designed based on the characteristics of visible light and infrared images. For visible light images, the confidence level is... With light intensity Positive correlation with occlusion rate and noise intensity Negative correlation, the specific calculation method is as follows: in To adjust the coefficient, the Sigmoid function is used to enhance the influence of light intensity on the confidence level, ensuring sufficient lighting. )hour Approaching 1, dim lighting ( )hour Dropped below 0.3; For infrared images, their confidence level Contrast with temperature difference Positively correlated with occlusion rate and noise intensity, and negatively correlated with noise intensity, the calculation formula is as follows: in, To prevent a correction term with a denominator of zero, when the temperature difference is significant... hour Approximately 1; No significant temperature difference ( )hour Below 0.2; To ensure weight normalization, the confidence scores for both modes are normalized: in, Represents the normalized visual modality weights. Represents the normalized infrared mode weights; making This ensures the rationality of weighted fusion; Subsequently, the target region is precisely located using ROI (Region of Interest) units to reduce background interference. Based on the results of the preceding multi-scale feature extraction, the S3-scale feature map (120×68) is used to generate candidate target regions. Non-maximum suppression (NMS) is then used to filter out regions with a confidence level greater than [missing information]. Candidate bounding boxes are mapped to feature maps at various scales to obtain the corresponding Regions of Interest (ROIs). Feature clipping and size unification are performed on the ROIs to ensure consistency in subsequent fusion operations. At the same time, a masking mechanism is used to shield background features outside the ROIs to improve fusion efficiency. Then perform channel-level dynamic weighted fusion for visible light multi-scale feature maps. (Corresponding to 4 scales) and infrared multi-scale feature maps Based on normalized confidence and Dynamically weight the characteristics of each channel: in, This is the feature map after weighting the k-th scale channel.
[0017] A4: Target detection is performed based on the fused features, and deep appearance features are extracted; In A4, a lightweight multi-scale detection network is constructed based on the multi-scale fusion feature map generated in the preceding sequence. This network extracts features... Using downsampling at 4×, 8×, 16×, and 32× as input, an improved YOLOv8-nano architecture is employed. Cross-scale feature fusion is achieved through a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN), enhancing the adaptability to targets of different sizes. Addressing the high proportion of small, slow-moving targets in low-altitude scenes, a small target enhancement branch is added to the network neck, using a 1×1 convolution to... The number of channels (480×270) has been increased to 128 dimensions, and is consistent with... (240×135) element-wise addition and fusion are performed to enhance the expressive power of small target features; at the same time, a decoupled head design is adopted in the detection head, separating the classification branch and the regression branch. The classification branch enhances semantic feature extraction through global average pooling, and the regression branch introduces a coordinate-based force module to improve the location prediction accuracy; the target detection process is executed as follows: the fused feature maps of 4 scales are input into the detection network, and the detection head outputs the target's class probability, bounding box coordinates, and confidence score. Among them, the bounding box regression uses the CloU loss function to optimize the location prediction accuracy. The specific calculation method is as follows: Where IoU is the intersection-union ratio between the predicted bounding box and the ground truth bounding box. The square of the Euclidean distance between the center points of the two frames. The length of the diagonal of the smallest bounding rectangle enclosing the two frames. For balance coefficient, This parameter measures the consistency of the aspect ratio of the bounding box. This loss function helps the bounding box regression to be closer to the actual location of the target. The specific method for calculating the balance coefficient is as follows: The specific method for calculating the parameters that measure the consistency of the aspect ratio of the measurement box is as follows: Where z represents the width of the predicted bounding box, and h represents the height of the predicted bounding box. This represents the width of the actual bounding box. Represented as the height of the actual bounding box; After detection, a modified non-maximum suppression (Soft-NMS) algorithm is used to filter the detection results. For candidate boxes with an overlap greater than 0.5, low-confidence boxes are not directly eliminated, but their confidence is reduced through a decay function. in, Confidence of candidate boxes To determine the overlap with the high-confidence bounding box, The attenuation coefficient; The final output is a target detection result with a confidence level of ≥0.7, including the target category (such as drones, birds, balloons), precise bounding box coordinates, and detection confidence level; Subsequently, the target depth appearance features are extracted based on the detection results. First, according to the bounding box coordinates output by the detection, the corresponding target feature regions are cropped on the fused feature maps at each scale, and differentiated processing is performed based on the characteristics of features at different scales: for shallow fused feature maps... and The focus is on extracting detailed appearance features such as target edges and textures, and then enhancing these features through 3×3 convolutions; for deep feature map fusion... and Focusing on the semantic-level appearance features of the target, a dual-pooling operation combining global average pooling and global max pooling is used to capture the global distribution and key local information of the features. The specific dual-pooling feature calculation method is as follows: in, , These represent the height and width of the target feature region, respectively. For average pooling characteristics, The features are max-pooled. The double-pooled features at each scale are concatenated and then input into a feature refinement network. This network consists of two fully connected layers, a Batch Normalization layer, and a Dropout layer (dropout rate = 0.2). L2 normalization is used to standardize the feature vectors to a unit sphere. The specific calculation method is as follows: in, This is the feature vector obtained by concatenating features from different scales. Feature dimension (set to 256 dimensions), To prevent correction terms with zero denominators, the final result is the depth appearance feature vector. ; Finally, the extracted deep appearance features are verified and optimized for quality. The variance and information entropy of the feature vector are calculated. When the variance is less than 0.1 or the information entropy is less than 5, it is judged as a low-quality feature, triggering the feature re-extraction process. The feature is re-extracted by expanding the feature pruning region and adjusting the pooling window size.
[0018] A5: Construct a trajectory prediction and calibration model based on the temporal co-occurrence matrix, and combine Kalman filtering and nonlinear maneuver compensation mechanism to associate the detected target with historical trajectory and output a continuous and stable tracking trajectory.
[0019] In A5, a temporal co-occurrence matrix is constructed to mine the target motion correlation. Based on the preceding detection results and historical trajectory data, a target temporal correlation model is established; the historical data is defined. frame( The target detection set is ,in ( Indicates the first The set of detected targets in a frame, each target containing bounding box coordinates. Deep appearance features and confidence level; where z represents the bounding box width; construct the temporal co-occurrence matrix. Where U represents history Total number of targets in the frame, matrix elements Indicates the m-th historical goal and The relevance of each current target is calculated as follows: in, , , These are the weighting coefficients. The intersection-union ratio (CUC) of the two target bounding boxes, cosine Represented as cosine similarity of appearance features, The frame interval between the two targets. The time decay coefficient is used to quantify the spatial overlap, apparent similarity, and temporal continuity between targets through the temporal co-occurrence matrix, providing multi-dimensional basis for trajectory association.
[0020] Next, a Kalman filter (KF) is introduced to construct a basic trajectory prediction model. Considering the motion characteristics of low-altitude targets, a uniform-uniform acceleration hybrid motion model is used to describe the target's motion state; the target state vector is defined as... ,in The coordinates of the bounding box center are For bounding box dimensions, The velocity of the center coordinate is the velocity of the motion. For acceleration; state transition matrix With process noise matrix Set them to: in For frame interval (take) ); observation matrix Only the position and size information in the state vector is retained, and the observation noise matrix is... The prior estimate of the target state is obtained through the prediction step of Kalman filtering. With the prior covariance matrix This provides predictive location support for trajectory association; To address the decrease in KF prediction accuracy caused by nonlinear target maneuvers (such as sudden acceleration and turning), a nonlinear maneuver compensation mechanism is introduced. This is achieved by calculating historical data... frame( Variance of target motion acceleration ,in This is the average acceleration; it should be noted that the acceleration here... and its variance All are based on normalization to image width and height. Calculations are performed using a coordinate system within the specified range; when ( When the threshold value is reached (i.e., the target is in a maneuvering state), a compensation strategy is triggered: on the one hand, the process noise matrix is dynamically adjusted. This expands the noise variance corresponding to the acceleration to its original value. This enhances the filter's adaptability to maneuvers; on the other hand, it introduces the interactive multiple model (IMM) concept, fusing the prediction results of the uniform velocity model and the uniform acceleration model, with weights dynamically allocated based on the acceleration variance, as shown in the following formula: in , These are the predicted states for the uniform velocity model and the uniform acceleration model, respectively. , The corresponding fusion weights are used to ensure the prediction accuracy under maneuvering conditions; Subsequently, trajectory association was performed based on the temporal co-occurrence matrix and prediction results, and the Hungarian algorithm was used to solve for the optimal association matching; the temporal co-occurrence matrix was then used to... The elements are used as the association cost, combined with the Euclidean distance cost between the Kalman filter predicted position and the current detection position. The specific calculation method is as follows: in, Represented as the x-coordinate of the predicted location, Represented as the ordinate of the predicted location, Represented as the x-coordinate of the current detection position, Represented as the ordinate of the current detection position; Construct a comprehensive correlation cost matrix The calculation method is as follows: in, The distance normalization constant is used; the Hungarian algorithm is used to find the minimum cost matching pair to associate the current detected target with the historical trajectory; for trajectories that do not match the detected target, if the number of consecutive unmatched frames is... If the number of consecutive unmatched frames exceeds 3, the target is determined to have disappeared, and the trajectory is terminated. Finally, the associated trajectory is calibrated and optimized, and drift correction is performed by combining the detection results and appearance features. When the IoU between the detected target and the predicted trajectory position is... At that time, the Kalman filter observations are updated using the detection results to correct the trajectory position. However, when the cosine similarity of the manifest features is greater than 0.8, the state vector is updated through weighted fusion: in, Let be the posterior estimate of the target state vector in frame t. Let be the prior estimate of the target state vector in frame t. To detect the target state vector, For predicting weights; when the apparent feature similarity is <0.8, the trajectory re-initialization process is triggered, and trajectory tracking is restarted based on the current detection results; simultaneously, for consecutive l frames ( The trajectory is smoothed, and a moving average filter is used to eliminate high-frequency vibrations, ultimately outputting a continuous and stable tracking trajectory, including the target's precise bounding box, motion speed, and confidence level for each frame.
[0021] In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for detecting and tracking low-altitude targets of unmanned aerial vehicles (UAVs) based on multimodal fusion, characterized in that, include: A1: Acquire spatiotemporally aligned visible light and infrared images; A2: Extract multi-scale feature maps from the visible light image and the infrared image respectively; A3: Construct a dynamic adaptive gating fusion module to calculate modal confidence based on the current environmental information, and perform channel-level dynamic weighting and feature interaction in the ROI region to generate fused features; A4: Target detection is performed based on the fused features, and deep appearance features are extracted; A5: Construct a trajectory prediction and calibration model based on the temporal co-occurrence matrix, and combine Kalman filtering and nonlinear maneuver compensation mechanism to associate the detected target with historical trajectory and output a continuous and stable tracking trajectory.
2. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 1, characterized in that: In step A1, spatiotemporal alignment of visible light and infrared images is achieved through a multimodal data acquisition synchronization triggering mechanism, time synchronization calibration, spatial coordinate calibration, and a calibration quality verification unit. Specifically, time synchronization calibration constructs a time correlation model based on sensor timestamps and UAV attitude data, using linear interpolation to compensate for asynchronous data frames and control time synchronization errors. Spatial coordinate calibration obtains the sensor extrinsic parameter matrix using hand-eye calibration, dynamically corrects the acquisition viewpoint using UAV attitude data, and adapts the infrared image resolution using bilinear interpolation. The calibration quality verification unit performs real-time detection by calculating the mutual information value of the same target region in the aligned dual-modal images. If the mutual information value meets a preset threshold, the calibration is deemed valid; otherwise, the extrinsic parameter calibration and attitude correction process is retried.
3. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 1, characterized in that: In step A2, a lightweight multi-scale feature extraction network is used, combined with a cross-modal feature enhancement mechanism, to perform hierarchical feature extraction on the spatiotemporally aligned dual-modal images. A dual-branch parallel feature extraction architecture is constructed, in which the visible light image branch and the infrared image branch share independent training parameters of the network structure. A channel attention module is added to the shallow convolutional layer of the visible light branch, and a spatial attention module is introduced into the middle convolutional layer of the infrared branch. Feature maps of multiple scales are output through different stages of the network, and adjacent scale features are fused through a weighted feature fusion mechanism. Subsequently, feature enhancement processing is performed on the multi-scale feature maps of each modality, and the extracted multi-scale feature maps are normalized to output enhanced multi-scale feature maps.
4. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 1, characterized in that: In step A3, a dynamic adaptive gating fusion module is constructed. This module includes an environmental information perception unit, a modal confidence calculation unit, an ROI region precise positioning unit, a channel-level dynamic weighting unit, and a cross-modal feature interaction unit. The environmental information perception unit extracts environmental feature vectors by analyzing key parameters of dual-modal images; the modal confidence calculation unit designs differentiated evaluation models for visible light images and infrared images respectively, dynamically calculates modal confidence based on light intensity, temperature difference contrast, occlusion rate and noise intensity, and normalizes the two modal confidences. The ROI region precise localization unit generates target candidate regions based on multi-scale feature maps, filters candidate boxes through non-maximum suppression, and maps the ROI region to feature maps of various scales for feature cropping and size unification; the channel-level dynamic weighting unit performs channel-level weighted fusion of feature maps of various scales based on normalized modal confidence to generate weighted fused features.
5. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 1, characterized in that: In step A4, a lightweight multi-scale detection network is constructed based on the generated multi-scale fusion feature map. This network takes the multi-scale fusion features as input and uses a feature pyramid network and a path aggregation network to perform cross-scale feature fusion. For small targets in low-altitude scenarios, a small target enhancement branch is added to the network neck; The detection results are filtered by an improved nonmaximum suppression algorithm, and the target category, bounding box coordinates and detection confidence are output. Then, the target depth appearance features are extracted according to the detection results. Differentiated processing is performed on the characteristics of features at different scales. The features at each scale are input into the feature purification network after double pooling, and the depth appearance feature vector is obtained after L2 normalization.
6. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 1, characterized in that: In step A5, a temporal co-occurrence matrix is constructed to mine the motion correlation of targets. A correlation matrix is constructed based on the detection results of historical frames and the current detection results to quantify the spatial overlap, apparent similarity and temporal continuity between targets. A basic trajectory prediction model is constructed by introducing Kalman filtering. A uniform-uniform acceleration hybrid motion model is used to describe the target motion state. The prior estimate of the target state is obtained through the prediction step of Kalman filtering. A nonlinear maneuver compensation mechanism is introduced. The target motion acceleration variance in historical frames is calculated to determine whether the target is in a maneuver state. When it is determined to be in a maneuver state, the process noise matrix is dynamically adjusted, and the prediction results of the uniform model and the uniform acceleration model are fused.
7. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 6, characterized in that: A comprehensive association cost matrix is constructed based on the temporal co-occurrence matrix and Kalman filter prediction results. The association between the current detected target and historical trajectories is calculated using the Hungarian algorithm. The associated trajectories are calibrated and optimized, and drift correction is performed by combining detection results and appearance features. The trajectories of multiple consecutive frames are smoothed, and the tracking trajectory is output.
8. The method for detecting and tracking low-altitude targets of unmanned aerial vehicles based on multimodal fusion according to claim 6, characterized in that: In A5, during trajectory calibration, the Kalman filter observations are updated when the IoU between the detected target and the predicted position is ≥0.6, and the state vector is updated by weighted fusion when the IoU is <0.6 but the cosine similarity of the apparent features is ≥0.8, with a prediction weight μ=0.3; the trajectory smoothing adopts a moving average filter, with 8 consecutive smoothing frames l.