Infrared ship small target detection method in video stream mode based on multi-frame fusion

By employing a multi-frame fusion detection method, utilizing multi-scale adaptive high-frequency boosting filtering, saliency-guided dual-path attention feature enhancement, and a temporal fusion module, the problems of low signal-to-noise ratio, missing morphological and texture features, and unstable performance in infrared ship small target detection are solved, achieving efficient and accurate target detection.

CN122244819APending Publication Date: 2026-06-19NAVAL UNIV OF ENG PLA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NAVAL UNIV OF ENG PLA
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing infrared small target detection methods for ships suffer from problems such as low signal-to-noise ratio, lack of morphological and texture features, and unstable performance in long-distance detection scenarios, resulting in high false alarm and false negative rates. Furthermore, traditional methods have high computational complexity, poor robustness and adaptability, making it difficult to meet real-time processing requirements.

Method used

A multi-frame fusion detection method is adopted, which optimizes feature extraction and temporal modeling by using a multi-scale adaptive high-frequency enhancement filtering module, a saliency-guided dual-path attention feature enhancement module, and a saliency-guided temporal fusion module, combined with a multi-task loss function and progressive adaptive training, thereby improving the accuracy and stability of detection.

🎯Benefits of technology

It achieves efficient and accurate detection of small infrared targets on ships, reduces computational complexity, improves robustness and adaptability, meets the requirements of real-time processing, and enhances the recall and accuracy of target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244819A_ABST
    Figure CN122244819A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of infrared ship small target detection technology, and discloses an infrared ship small target detection method based on multi-frame fusion video stream mode. This invention has high detection accuracy and is particularly sensitive to small targets. By introducing a saliency-guided dual-path attention feature enhancement module to globally enhance single-frame features from a spatial perspective, and combining it with a saliency-guided temporal fusion module to fuse multi-frame features from a temporal perspective, it fully mines the potential information of weak targets from both spatial and temporal dimensions. This spatiotemporal dual-domain collaborative enhancement mechanism effectively overcomes the problems of low target signal-to-noise ratio and weak features in single-frame images, significantly improving the detection rate of long-range infrared ship small targets and greatly reducing the risk of missed detection. The algorithm is robust and adaptable to various environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of infrared ship small target detection technology, and particularly relates to an infrared ship small target detection method based on multi-frame fusion video stream mode. Background Technology

[0002] With the booming development of global trade and maritime transport, the importance of maritime navigation safety has become increasingly prominent. When ships navigate in low-visibility conditions such as nighttime, rain, snow, and fog, their visual observation capabilities based on visible light are significantly reduced, greatly increasing the risk of collisions between vessels. Infrared thermal imaging technology, with its advantages of all-weather operation and the ability to penetrate certain levels of rain, snow, and fog, has become a crucial means for ship nighttime situational awareness and safe navigation. Real-time, rapid, and accurate type identification and position detection of ship targets in images acquired by infrared sensors are key technologies for achieving automatic collision avoidance and intelligent navigation for our vessels.

[0003] However, ship target detection based on infrared imagery, especially in long-range detection scenarios, still faces significant challenges. Targets in this scenario typically appear as "small targets" in the image, and their main characteristics and impacts can be summarized in the following three points:

[0004] ① Low signal-to-noise ratio: The infrared signal radiated by distant ship targets is weak and occupies only a few pixels in the image (e.g., 3×3 pixels or even smaller). Its energy characteristics are easily submerged by complex sea surface background noise (such as sea wave clutter, solar glare, and cloud edges).

[0005] ② Lack of morphological and textural features: Due to resolution limitations, small targets lack clear structural outlines, geometric shapes, and texture information, making it difficult for traditional shape-based feature extraction methods and general object detection models based on deep learning to learn their effective representations.

[0006] ③ Unstable performance: Affected by atmospheric disturbances, field of view jitter, and changes in the target's own attitude, small targets may experience brightness fluctuations, size changes, or even temporary disappearance in the image sequence, resulting in poor stability.

[0007] Currently, the mainstream infrared ship small target detection methods that can be used in video mode are mainly divided into the following two categories, but both have obvious limitations.

[0008] The first category is detection methods based on single-frame images. These methods utilize only a single infrared image at the current moment for detection, ignoring the continuous information of the target in the time dimension. The implementation methods mainly include the following two major directions:

[0009] ① Background modeling-based filtering method: By modeling the sea and sky background, areas deviating from the background model are considered as targets. However, dynamically changing waves and cloud edges are easily misjudged as false targets, while weak targets with very low contrast to the background are prone to over-filtering and missed detection.

[0010] ② Deep learning-based detection models, such as YOLO, SSD, and Faster R-CNN, perform well in conventional target detection. However, when applied to small infrared targets, due to the extremely limited contour and texture features of the target, weak feature information is easily lost during convolution and downsampling, resulting in insufficient sensitivity of the model to small targets and low recall.

[0011] The second category is traditional detection methods based on multi-frame information. These methods attempt to utilize temporal information, such as optical flow and frame difference methods. While they can theoretically utilize motion consistency, they still face many problems in practical applications.

[0012] ① High computational complexity: Optical flow methods require intensive pixel-level calculations, which are difficult to meet the requirements of real-time video stream processing.

[0013] ② Sensitive to noise: In image sequences with low signal-to-noise ratio, reliable optical flow fields are difficult to calculate, which can easily lead to erroneous motion trajectories.

[0014] ③ Poor adaptability: The frame difference method fails for stationary or slow-moving targets; for scenes with field-of-view jitter, traditional methods require complex global motion compensation and have poor robustness.

[0015] Based on the above analysis, the problems and shortcomings of the existing technology are as follows:

[0016] In existing technologies, both advanced single-frame deep learning models and traditional multi-frame detection methods have certain limitations when dealing with infrared ship small target detection tasks in video stream mode. Among them, single-frame detection methods are limited in performance due to their single information dimension, resulting in high false positive and false negative rates; while traditional multi-frame detection methods are difficult to put into practical applications due to problems such as computational efficiency, robustness, and adaptability. Summary of the Invention

[0017] To address the problems existing in the prior art, this invention provides an infrared small target detection method for ships based on a multi-frame fusion video stream mode.

[0018] This invention is implemented as follows: A method for detecting small infrared targets on ships in a video stream mode based on multi-frame fusion includes:

[0019] Step 1: The internal texture details and contour edge information of infrared small targets are separated by a multi-scale adaptive high-frequency boosting filter module through filters of different scales; learnable weight parameters are introduced to intelligently fuse multi-scale features in different scenarios; based on standard convolution operations, the module can be efficiently integrated into a real-time detection system.

[0020] Step 2: The saliency-guided dual-path attention feature enhancement module utilizes local windows and sparse sampling mechanisms to reduce computational complexity; the guided fusion mechanism ensures that small target features are enhanced in a focused manner; and the contribution ratio of local and global attention is automatically adjusted according to the input features.

[0021] Step 3: The saliency-guided temporal fusion module applies different pooling strategies to high and low saliency regions using "saliency-weighted feature compression," compressing background information while preserving target details, thus optimizing the computational cost and feature quality of temporal modeling. By encoding multi-dimensional saliency information into attention biases and using them as dynamic adjustment factors for temporal position encoding and attention computation, the fusion process can automatically focus on key frames and key information. The frame-level importance weights are derived in reverse using the bias matrix naturally generated in multi-head attention computation.

[0022] Step 4, design the multi-task loss function;

[0023] Step 5, progressive adaptive training.

[0024] Furthermore, the multi-scale adaptive high-frequency boosting filter module:

[0025] Step 1: Multi-scale feature separation

[0026] Small-scale detail component extraction (for internal texture of the target):

[0027] ;

[0028] Mesoscale edge component extraction (for target contour boundaries):

[0029] ;

[0030] in,

[0031] D small (x,y): Represents the small-scale detail component value extracted at coordinates (x,y);

[0032] D edge (x,y): Represents the mid-scale edge component value extracted at coordinates (x,y);

[0033] : Represents the infrared image input at coordinates (x,y); , Mean filter kernels representing 3×3 and 5×5; Represents the convolution operator;

[0034] Step 2: Adaptive Feature Fusion

[0035] Establish a weighted fusion model:

[0036] ;

[0037] in, , The enhancement intensities of small-scale details and mesoscale edges are controlled separately as learnable parameters, satisfying the adaptive optimization conditions:

[0038] ; ;

[0039] It is the loss function of the downstream detection task;

[0040] Step 3: High-frequency enhancement;

[0041] Construct the complete high-frequency boosting formula:

[0042] ;

[0043] in, These are learnable parameters that control the overall contrast enhancement intensity of the original input infrared image, satisfying the adaptive optimization conditions:

[0044] ;

[0045] The complete expression after expansion is:

[0046] ;

[0047] Among them, parameters , , Adaptive learning is adopted through end-to-end training, and the initialization strategy is designed as follows:

[0048] ; ; .

[0049] Furthermore, the dual-path attention feature enhancement module:

[0050] Step 1: Local window attention calculation:

[0051] Input feature map Divided into non-overlapping 1 window, calculate self-attention within each window:

[0052] (1) Window division operation:

[0053]

[0054] ;

[0055] in, For window size, Number of windows;

[0056] (2) Calculation of self-attention within the window:

[0057] For each window Flatten it from a 3D feature map into a 2D matrix Calculate the query, key, and value matrix, and perform multi-head partitioning:

[0058] , , ;

[0059] in, , , These are learnable projective weights; It is the attention head dimension;

[0060] Will , , Divided into Size:

[0061] , ;

[0062] , ;

[0063] , ;

[0064] in, Dimensions for each attention head;

[0065] Relative position offset calculation:

[0066] First, construct a relative position offset table:

[0067] Define a relative position offset table It can cover all possible relative positional offsets within the window; where, For window size, For the number of attention heads;

[0068] Relative position index:

[0069] For any two positions within the window and Its relative position offset is:

[0070] , ;

[0071] in, , The coordinates of the pixel within the window (range is) arrive );

[0072] Mapping two-dimensional offsets to one-dimensional indices:

[0073] ;

[0074] This mapping yields the relative position index matrix for all position pairs within the window. ;

[0075] Window relative position offset matrix generation:

[0076] First, set the bias table. Remodeled into a two-dimensional matrix :

[0077] ;

[0078] Then, based on the relative position index matrix From the bias table Extract the corresponding bias values ​​and construct the window relative position bias matrix:

[0079] ;

[0080] For the The formula for calculating the attention weight of a window is as follows:

[0081] ;

[0082] in, For the first The relative positional offset of each attention head For attention head dimension;

[0083] Feature aggregation:

[0084] ;

[0085] Multi-head fusion and output projection:

[0086] ;

[0087] in, To output the projection matrix;

[0088] (3) Window reorganization:

[0089] Reconstruct the window attention results into a complete feature map:

[0090] ;

[0091] Step 2: Global Sparse Attention Calculation

[0092] (1) Generation of saliency plot:

[0093] Feature compression and conversion:

[0094] ;

[0095] in, These are learnable 1×1 convolutional kernel weights; It is a learnable bias; It is the number of intermediate channels. Compression ratio (can be set between 4 and 8); represent Activation function;

[0096] Channel attention calculation:

[0097] ;

[0098] ;

[0099] ;

[0100] in, For channel descriptor vectors; These are learnable weights for fully connected layers; For learnable biases of fully connected layers; for function; Channel attention weights; Broadcast multiplication representing channel dimensions; Feature maps representing attention-weighted features;

[0101] Spatial saliency generation:

[0102] ;

[0103] ;

[0104] in, These are learnable 1×1 convolutional kernel weights; For learnable bias terms; The original score represents the significance. This is the final saliency plot;

[0105] (2) Key point selection:

[0106] Based on saliency map Before choosing The most prominent position:

[0107] ;

[0108] in, It is a significance threshold, used before screening. The most prominent point;

[0109] (3) Sparse attention computation:

[0110] Calculate the attention relationship between keypoints and their positions across the entire image only:

[0111] ;

[0112] ;

[0113] ;

[0114] ;

[0115] ;

[0116] in, This represents flattening a 3D feature map into a 2D matrix. This represents the feature vector corresponding to the key points extracted from the flattened features. , , It is a learnable projection matrix for global attention. It is a feature dimension of global attention;

[0117] Will , , Divided into Size:

[0118] , ;

[0119] , ;

[0120] , ;

[0121] in, Dimensions for each attention head;

[0122] For the The formula for calculating the sparse attention weights is as follows:

[0123] ;

[0124] Feature aggregation:

[0125] ;

[0126] Multi-head fusion and output projection:

[0127] ;

[0128] in, To output the projection matrix;

[0129] (4) Sparse attention output:

[0130] Global feature reconstruction:

[0131] Key features Based on the index Repositioning them to their corresponding spatial locations creates a global sparse augmentation feature:

[0132] ;

[0133] Shape restoration:

[0134] Restore the flattened global features to a 3D feature map with the original spatial dimensions:

[0135] ;

[0136] Step 3: Saliency-guided feature fusion

[0137] Adaptively fuse local attention results and global sparse attention results based on a saliency map:

[0138] ;

[0139] in, This indicates element-wise multiplication (Hadamard product).

[0140] Step 4: Channel Attention Recalibration

[0141] Channel recalibration of fused features is performed using a compression excitation mechanism:

[0142] ;

[0143] ;

[0144] in, and These are the weight matrices for dimensionality reduction and dimensionality increase, respectively. The compression ratio is... for Activation function;

[0145] ;

[0146] in, This represents broadcast multiplication at the channel dimension;

[0147] To ensure training stability, consider adding residual connections, and the final output... for:

[0148] ;

[0149] Furthermore, the saliency-guided time-series fusion module:

[0150] Significance-weighted feature compression;

[0151] Significant region delineation;

[0152] For the input of the first Frame feature map and the corresponding saliency plot Based on the learnable saliency threshold The feature map is divided into highly significant regions and low significant regions:

[0153] ;

[0154] ;

[0155] in, It is an indicator function; the input is 1 if the condition is met, and 0 otherwise. The parameter is learnable and its initial value is set to 0.3; and These represent high-salience masks and low-salience masks, respectively.

[0156] Region-specific feature extraction;

[0157] Different feature extraction strategies are used for different salient regions:

[0158] Highly saliency region feature extraction:

[0159] ;

[0160] ;

[0161] Feature extraction of low saliency regions:

[0162] ;

[0163] ;

[0164] in, This indicates element-wise multiplication (Hadamard product). This represents global max pooling, which retains the most salient feature responses. This indicates global average pooling, which obtains overall contextual information about the background.

[0165] Enhancement in locally significant regions;

[0166] To preserve the spatial details of the target, local pooling is performed at the locations with the highest saliency:

[0167] Local area positioning:

[0168] ;

[0169] Local feature extraction:

[0170] ;

[0171] ;

[0172] The formula for calculating the boundary of a local region is as follows. The window size for the local pooling region is set to 1 / 8 of the feature map size.

[0173] ;

[0174] ;

[0175] ;

[0176] ;

[0177] Adaptive weight fusion;

[0178] The fusion weights of features in each region are dynamically adjusted based on their saliency intensity.

[0179] First, the average saliency score is calculated based on the saliency feature map of each frame:

[0180] ;

[0181] Then, based on the average significance score, the feature fusion results of the final high significance region, low significance region, and locally significant region are calculated:

[0182] ;

[0183] in, This is the weighting coefficient for local features, set to 0.2;

[0184] Significance-weighted temporal position coding;

[0185] Basic location encoding generation;

[0186] For length of The time sequence is used to generate the basic sinusoidal position coding matrix. :

[0187] ;

[0188] ;

[0189] in, For time step index; For dimension indexing; For feature dimensions;

[0190] Saliency encoding mapping;

[0191] The saliency intensity is mapped to an offset in the encoding space using a neural network, and the mapping function is as follows:

[0192] ;

[0193] in, This is the first layer weight matrix; This is the weight matrix for the second layer; For the first layer bias; For the second layer bias; For the first The average saliency score of the frames; To modify the activation function of the linear unit;

[0194] Adaptive fusion;

[0195] The basic positional encoding and saliency encoding are weighted and fused to obtain the final positional encoding vector:

[0196] ;

[0197] in, This represents the learnable weight fusion parameter, with an initial value set to 0.1;

[0198] Temporal feature enhancement;

[0199] The fused positional encoding is added to the compressed temporal features:

[0200] ;

[0201] in, Represents the compressed number Frame feature vector;

[0202] Attention computation based on bias generation using multi-dimensional saliency features;

[0203] Multidimensional salient feature extraction;

[0204] For the input sequence length is The infrared image sequence, with each frame corresponding to a saliency map, from which the following four features are extracted:

[0205] Significance intensity feature, representing the first The overall saliency level of the frame is calculated using the following formula:

[0206] ;

[0207] The saliency consistency feature represents the saliency pattern similarity between the current frame and adjacent frames, calculated using cosine similarity.

[0208] ;

[0209] in, and Representing the first Saliency maps of the previous and next frames; for two saliency maps and The formula for calculating cosine similarity is:

[0210] ;

[0211] The saliency stability feature represents the similarity between the current frame and the average saliency pattern of the entire sequence;

[0212] First, calculate the average significance plot for the entire sequence:

[0213] ;

[0214] Then, calculate the first... Cosine similarity between the frame and the average saliency map:

[0215] ;

[0216] The time decay feature assigns higher weights to recent frames based on the current frame's position in the sequence; for the th... The formula for calculating the time decay feature of a frame is:

[0217] ;

[0218] in, The index of the current frame; This is the total length of the sequence;

[0219] Attention bias matrix generation involves concatenating the four features into a feature vector, which is then mapped to an attention bias matrix via a neural network.

[0220] Feature concatenation, for position pairs The concatenated 8-dimensional feature vector is:

[0221] ;

[0222] Bias matrix calculation:

[0223] Using a multilayer perceptron ( ) will each pair of frames (the first) Frame and the The feature vector of a frame is mapped to a bias value; specifically, the attention bias matrix. elements in The calculation is as follows:

[0224] ;

[0225] in, Indicates the first Frame to the first Attention bias value of the frame; , , These are the weight matrices for the first layer, the second layer, and the output layer, respectively. , , These are the biases for the first layer, the second layer, and the output layer, respectively.

[0226] Bias-enhanced multi-head attention calculation:

[0227] First, design an offset scaling mechanism for the first... Size, scaled offset is:

[0228] ;

[0229] scaling factor The calculation formula is as follows:

[0230] ;

[0231] in, This is a learnable parameter, and its initial value is set to 0. Set the hyperparameter to 2;

[0232] Then, obtain the query matrix, key matrix, and value matrix, and perform multi-head splitting:

[0233] For the feature sequence after fusion and position encoding, a query matrix, key matrix, and value matrix are generated through linear transformation:

[0234] , , ;

[0235] in, , , The weight matrix is ​​a learnable weight matrix;

[0236] Will , , Divided into Size:

[0237] , ;

[0238] , ;

[0239] , ;

[0240] in, For each dimension of attention head, For attention head dimension;

[0241] For the Each attention head has an attention weight calculated using the following formula:

[0242] ;

[0243] Weighted aggregation:

[0244] ;

[0245] Finally, multi-head fusion and output projection are performed:

[0246] ;

[0247] in, To output the projection weight matrix;

[0248] To maintain gradient flow and training stability, residual connections and layer normalization operations are performed:

[0249] ;

[0250] ;

[0251] The formula for the layer normalization function is as follows:

[0252] ;

[0253] In the formula, , These are the mean and standard deviation of the input, respectively. , These are the learnable scaling and offset parameters, respectively;

[0254] To further improve model performance and enhance its nonlinear expressive power, a feedforward network and corresponding residual connections and layer normalization designs are added based on the above:

[0255] ;

[0256] in, Responsible for expanding the dimensions to ; Responsible for compressing the dimensions back ;

[0257] Residual connectivity and layer normalization:

[0258] ;

[0259] Temporal importance pooling guided by multi-head attention bias;

[0260] Generation of temporal importance weights;

[0261] Bias aggregation generates initial frame-level scores:

[0262] By aggregating bias information from multiple heads and multiple query perspectives into a single initial importance score vector representing the global importance of each frame;

[0263] The aggregate function formula is as follows:

[0264] ;

[0265] in, For the number of attention heads; This represents the total number of frames in the sequence. This represents the multi-head attention bias value. For the first The initial importance score vector of the frame;

[0266] Nonlinear transformation and weight recalibration:

[0267] Based on the initial importance score vector, a lightweight, learnable transformation network is introduced to perform nonlinear transformations and weight recalibration, thereby increasing the model's expressive power and better adapting it to downstream detection tasks. The calculation formula is as follows:

[0268] ;

[0269] in, and These are the linear transformation weights for the first and second layers, respectively; and These are the biases for the first and second layers, respectively; the dimension reduction ratio is set to 4. It is a non-linear activation function; This is the transformed score vector;

[0270] Normalization yields the final fusion weights:

[0271] The transformed score application The function is normalized along the time dimension, resulting in the final temporal fusion weight vector. The calculation formula is:

[0272] , ;

[0273] The constraints are satisfied:

[0274] , ;

[0275] Time-weighted fusion:

[0276] Use the obtained weight vector For time-series augmentation vector sequences Weighted summation is performed to generate the final fused feature vector used for the classification task. The calculation formula is as follows:

[0277] ;

[0278] This is the fusion feature vector that the saliency-guided temporal fusion module ultimately outputs for the classification task;

[0279] Meanwhile, the deep temporal context information extracted by the SG-TFM module is efficiently and lightweightly injected into the feature stream of the regression task, enabling the regression task to share and utilize the temporal dependencies and contextual understanding learned by the classification task, thereby improving the accuracy and stability of target localization.

[0280] Global temporal context vector extraction:

[0281] Temporal Enhancement Feature Sequences Aggregate along the time dimension to extract a compact global temporal context vector. Global average pooling is used:

[0282] ;

[0283] Modulation parameter generation:

[0284] Global Temporal Context Vector Through a lightweight parameter generation network Generate scale parameters for feature modulation. and offset parameters ; It consists of two fully connected layers, with a modified linear unit in between. As an activation function:

[0285] ;

[0286] ;

[0287] in, and For the first layer weights and biases, For intermediate layer dimensions (set to) To reduce computational load); and For the second layer weights and biases; This indicates that the output vector is split into two parts, the first part... Each parameter is used as a scale parameter, and then... Each element is used as an offset parameter;

[0288] Temporal context-guided feature modulation:

[0289] Using the generated modulation parameters and Perform an affine transformation on the spatial enhancement features of each frame to generate modulated features:

[0290] ;

[0291] Among them, modulation parameters and Broadcast to dimension , This indicates multiplication by channel;

[0292] Timing importance weighted fusion:

[0293] The modulated feature sequences are weighted and summed using a temporal importance weight vector to obtain the final fused feature vector used for the regression task. :

[0294] ;

[0295] This is the final output fused feature vector used for the regression task.

[0296] Furthermore, the design of the multi-task loss function;

[0297] The first term, spectrum-guided perceptual loss ( ):

[0298] To accurately guide the multi-scale adaptive high-frequency enhancement filter module to learn the expected image enhancement characteristics and avoid it from getting stuck in a suboptimal solution when the optimization process backpropagates through a lengthy detection network, a corresponding spectrum-guided perceptual loss is designed.

[0299] Step 1: High-frequency energy enhancement loss, designed to maximize the energy of specific high-frequency components in the output image that are relevant to small targets;

[0300] A high-frequency component extraction operator is defined, using a Laplacian convolution kernel as a high-pass filter to approximately extract the second-order gradient of the image; for any image Its high-frequency components The calculation is as follows:

[0301] ;

[0302] in, Represents convolution operation; This indicates taking the absolute value, used to obtain the intensity of the high-frequency response;

[0303] The high-frequency component contrast loss is defined as the difference in high-frequency energy between the output image and the input image across the entire image range, and its calculation formula is as follows:

[0304] ;

[0305] in, The original input infrared image, Output images to the module;

[0306] Step 2: Multi-scale structural fidelity loss

[0307] To prevent over-processing of modules from causing image structure damage, contrast distortion, or the introduction of unnatural artifacts, the structural similarity between the output and input images is constrained at multiple scales. A multi-scale structural similarity index is used as a metric for fidelity loss. The formula for calculating the multi-scale structural fidelity loss is as follows:

[0308] ;

[0309] in, Represents scale level, Set the total number of scales to 5; subscript Indicates the coarsest scale; , , The index is used to adjust the importance of each component; the calculation formulas for each comparison function are as follows, including brightness, contrast, and structure comparison functions, where... , These are two images used for comparison. , Represents the local mean. , Represents standard deviation, Represents covariance, , , It is a small constant used to avoid division by zero and to ensure stable calculations;

[0310] ;

[0311] ;

[0312] ;

[0313] Step 3: Loss Integration

[0314] The spectrum-guided perceptual loss is a weighted sum of the two sub-losses mentioned above:

[0315] ;

[0316] in, and As adjustable hyperparameters, the balance between high-frequency enhancement strength and structural fidelity strength is controlled separately, with initial values ​​set to 0.7 and 0.3 respectively;

[0317] The second item is the detection loss that is sensitive to small targets ( ):

[0318] To accurately optimize the infrared ship small target detection task and solve the core problems of traditional detection loss when dealing with small-scale targets, such as gradient instability, positive and negative sample imbalance, and inaccurate positioning, a corresponding small target sensitive detection loss is designed.

[0319] Step 1: Classification Loss Due to Dynamic Focus Modulation and Class Balance :

[0320] To address the imbalance between background pixels (negative samples) and target pixels (positive samples) in images, a classification loss based on dynamic focus modulation and class balancing is proposed. The optimization process is rebalanced through two levels of weight adjustment.

[0321] Dynamic category weight calculation:

[0322] Let the total number of samples in the batch be Count the number of positive and negative samples in the current batch:

[0323] , ;

[0324] in, For the first The true label of each sample; For indicator functions;

[0325] Dynamically calculate category balance weights :

[0326] ;

[0327] ;

[0328] in, It is a minimal constant, thus maintaining numerical stability;

[0329] Focus modulation factor design:

[0330] Define sample The probability of a correct prediction :

[0331] ;

[0332] in, Predict the probability that a sample belongs to the positive class for the model;

[0333] Focus modulation factor is defined as ,in The adjustable focusing parameter can be set to 2. The corrected weighted cross-entropy loss is:

[0334] ;

[0335] Complete loss function and numerical stability handling:

[0336] To maintain the stability of numerical calculations and avoid logarithmic overflow caused by probabilities approaching zero, the predicted probabilities are truncated in actual calculations:

[0337] ;

[0338] in, ;

[0339] Used when calculating final loss Alternative The batch-average dynamic focus modulation classification loss is:

[0340] ;

[0341] Step 2: Regression loss based on normalized Wasserstein distance :

[0342] To address the fundamental problems of traditional IoU and its variants in small target detection, such as gradient instability, oversensitivity to small offsets, and excessive scale dependence, we adopt Normalized Wasserstein Distance (NWD) as the core metric and loss function for bounding box regression. This redefines bounding box matching from the perspective of probability distribution similarity, providing a smooth, stable, and scale-invariant optimized gradient for infrared ship small target detection.

[0343] Gaussian distribution modeling of the bounding box:

[0344] For any bounding box ,in With the center coordinates, and The width and height are respectively represented, and the model is a two-dimensional Gaussian distribution. :

[0345] The mean vector represents the central location of the distribution:

[0346] ;

[0347] The covariance matrix represents the extent to which the distribution is spread in the horizontal and vertical directions:

[0348] ;

[0349] Wasserstein distance calculation:

[0350] For two Gaussian distributions, the second-order Wasserstein distance has a closed-form solution.

[0351] Wasserstein distance squared decomposition:

[0352] ;

[0353] in, Denotes the Euclidean norm; Represents the trace of a matrix;

[0354] Simplification of the diagonal covariance matrix:

[0355] Since both and are diagonal matrices, the above equation can be simplified. Let:

[0356] , ;

[0357] in, , , , ;

[0358] but:

[0359] ;

[0360] Substituting the specific parameters yields the final expression:

[0361] ;

[0362] Normalized Wasserstein distance calculation:

[0363] The Wasserstein distance is mapped to the [0,1] interval, exponentially normalized, and a small constant is added. Avoid numerical issues:

[0364] ;

[0365] Where is a normalization constant, which can be taken as the average of the diagonal lengths of the target boxes in the dataset;

[0366] Constructing the regression loss function:

[0367] ;

[0368] Step 3: Loss Integration

[0369] The detection loss for small targets is a weighted sum of the two sub-losses mentioned above:

[0370] ;

[0371] in, and These are adjustable hyperparameters that control the balance between classification and regression tasks, with initial values ​​set to 1.0 and 2.0 respectively.

[0372] The third significant comparison guides the loss ( ):

[0373] Ensure that the saliency map generated by the SG-DAFE module can accurately focus on the real target area, enhance the distinguishability between the target and the background, and avoid attention being distracted by background noise;

[0374] Step 1: Region Mask Generation

[0375] First, a binary region mask is generated based on the true annotations of the current training samples; let the first... The bounding box of each real target is Its corresponding target region mask for:

[0376] ;

[0377] The background sampling region consists of two parts: the first is a ring-shaped region around each target bounding box, ranging from 0 to 2 times the target size; the second is a region within the entire image that does not overlap with the target bounding box. Random samples are taken from the total background sampling region. A background region block with an area equivalent to the target box. The mask for each background region is:

[0378] ;

[0379] The background mask for the entire image is:

[0380] ;

[0381] Step 2: Calculation of regional significance statistic:

[0382] The saliency plot generated for the SG-DAFE module has a value range of [range missing]. ;

[0383] For the There are 1 target, and the mean significance of the target region is:

[0384] ;

[0385] For the current image, the mean saliency of the background region is:

[0386] ;

[0387] in, and All are the total number of non-zero pixels within the mask;

[0388] Step 3: Significance Comparison Guided Loss Formula:

[0389] Employing a boundary-based contrastive loss form, for each image containing [various parameters]... For each real target, the loss is calculated as follows:

[0390] ;

[0391] in, This is a preset marginal hyperparameter, initially set to 0.4, indicating that the average significance of the target region and the background region should differ by at least [a certain value]. ; Indicates the first Significant differences between the target and the background;

[0392] The fourth term is time-aware adaptive loss ( ):

[0393] To enhance the temporal modeling capability of the SG-TFM module in dynamically changing environments, loss terms are constructed from three dimensions: scale change coherence, motion invariance features, and temporal saliency focus, to ensure stable and reliable detection performance in complex temporal sequences close to the target.

[0394] Step 1: Loss of coherence with scale changes ( ):

[0395] For the Frame, the predicted target box size is Calculate the pixel area of ​​the target for:

[0396] ;

[0397] For uniform or uniformly accelerated approach, the target area change rate should change smoothly, constraining the difference in area change rates between adjacent frames. The loss function is:

[0398] ;

[0399] in, It is the numerical stability constant;

[0400] Step 2: Motion Invariance Feature Loss ( ):

[0401] Calculation of the feature relation matrix:

[0402] Given a feature sequence enhanced by the SG-TFM module Construct a relation matrix , of which elements Indicates the first Frame and the Cosine similarity of frame features:

[0403] ;

[0404] in, It is the numerical stability constant;

[0405] Parameterization of affine transformation:

[0406] Using affine transformations to simulate global motion:

[0407] ;

[0408] The parameterization form is as follows:

[0409] , ;

[0410] in, and Represent and The field-of-view scaling factor in the direction, ranging from ; This represents the angle by which the image rotates counterclockwise around the origin, and its range is... ; and These represent the horizontal and vertical shear deformation parameters, respectively, with a range of... ; and Represents image translation, with a range of and ;

[0411] Loss function calculation:

[0412] For the original feature sequence Apply random affine transformation Obtain enhanced feature sequences The relationship matrices of the original feature sequence and the enhanced feature sequence are respectively and The Frobenius norm is used to measure the matrix difference, i.e., the motion invariance feature loss. The calculation is as follows:

[0413] ;

[0414] Step 3: Temporal Significance Focusing Loss ( ):

[0415] Significant property centroid alignment loss:

[0416] Let the first Frame saliency map is The true bounding box size of the target is The center coordinates are ;

[0417] The centroid coordinates of the saliency plot are calculated as follows: :

[0418] ;

[0419] ;

[0420] in, These are small constants to ensure stability in numerical calculations;

[0421] The centroid alignment loss is:

[0422] ;

[0423] Significance distribution compactness loss:

[0424] Ideally, the significance distribution should be compactly concentrated in the key region, rather than uniformly dispersed; by constraining the second moment (variance) of the significance values, the significant response is forced to be concentrated, thus improving the signal-to-noise ratio;

[0425] The second central moment (variance) of the significance distribution is calculated as follows:

[0426] ;

[0427] The diagonal length of the target's true bounding box is The desired tightness level is above the limit. for:

[0428] ;

[0429] The loss of firmness is:

[0430] ;

[0431] in, The relaxation coefficient can be set to 0.1, indicating that the tightness is allowed to slightly exceed the upper bound; the normalization factor... for:

[0432] ;

[0433] The temporal significance focusing loss is:

[0434] ;

[0435] Among them, the weighting coefficient and Set them to 0.6 and 0.4 respectively;

[0436] Step 4: Time-Aware Adaptive Loss Integration

[0437] The total loss is calculated as follows:

[0438] ;

[0439] Among them, the weighting coefficient , , Set them to 0.4, 0.4, and 0.2 respectively.

[0440] The fifth term, spatiotemporal feature alignment loss ( ):

[0441] The SG-DAFE module and the SG-TFM module extract and enhance target features from the spatial and temporal dimensions, respectively. However, the two modules differ in their architecture design, optimization objectives, and feature processing procedures. Therefore, a spatiotemporal feature alignment loss is designed.

[0442] Let the length of the video sequence be... For the first The frame, the output feature map of the SG-DAFE module is The significance plot is as follows The output regression feature map of the SG-TFM module is ;

[0443] First, L2 normalization is performed on the features along the channel dimension to ensure scale invariance for feature comparison:

[0444] , ;

[0445] in, For position The feature vector output by the SG-DAFE module at that location; For position The feature vector output by the SG-TFM module at that location; These are small constants to ensure stability in numerical calculations;

[0446] For each spatial location Calculate the cosine similarity of the normalized feature vectors and weight them using significance values:

[0447] ;

[0448] in, Represents the dot product of vectors;

[0449] Calculate the first Weighted average similarity of frames:

[0450] ;

[0451] The average empty feature alignment loss is obtained by averaging over all frames:

[0452] ;

[0453] The sixth term, total loss function ( ):

[0454] The total loss function is designed as follows:

[0455] ;

[0456] in, , , , , , For the first The dynamic weight coefficients of the term loss are set to the number of training steps. The function;

[0457] Let the total number of training steps be... The dynamic weight formula is:

[0458] ;

[0459] in, These are learnable weight parameters; The initial weights were set to 1.0, 2.0, 0.5, 0.8, and 0.3.

[0460] Furthermore, the progressive adaptive training method:

[0461] To address the inherent challenges of scarce and difficult-to-annotate infrared small target data for ships, and to ensure that the multi-scale adaptive high-frequency boosting filter module (MAHF), saliency-guided dual-path attention feature enhancement module (SG-DAFE), and saliency-guided temporal fusion module (SG-TFM) described in this invention can be sufficiently and effectively trained, this invention proposes a three-stage progressive training paradigm. This paradigm achieves a stable and efficient conversion from a general visual base model to a dedicated infrared small target detector by combining domain-specific knowledge transfer and hierarchical optimization scheduling.

[0462] Phase One: Foundation of General Visual Representation

[0463] Objective: To acquire basic perception capabilities of object edges, shapes, textures, and contextual relationships, providing a high-quality input feature base for the subsequent dedicated processing modules of this invention;

[0464] Operation: The backbone network is pre-trained on a large-scale visible light image classification dataset; then, the complete detection architecture is trained on a general object detection dataset to learn preliminary object localization and classification knowledge.

[0465] Output: Obtain a generalized visual detector with strong generalization capabilities, whose parameters will serve as the starting point for initialization of all subsequent stages;

[0466] Phase Two: Infrared Imaging Domain Adaptation and Dedicated Module Initialization

[0467] Objective: To bridge the domain distribution differences between visible light and infrared imaging, and to initialize and warm up the core enhancement module of this invention so that it can initially adapt to the statistical characteristics of infrared data;

[0468] operate:

[0469] ①Data: Large-scale publicly available infrared thermal imaging scene datasets;

[0470] ② Network preparation: Load the model parameters from Phase 1; introduce and initialize the MAHF, SG-DAFE, and SG-TFM modules, and integrate them into the network front-end and feature paths;

[0471] ③ Differentiation optimization strategy:

[0472] Low-level general feature protection: Apply an extremely low learning rate (≤1e-5) or a small learning rate fluctuation to the front-end layer of the backbone network (responsible for extracting low-level features) to essentially freeze the general filter learned from visible light and avoid interference from infrared domain noise.

[0473] Mid-level semantic feature adaptation: A moderate learning rate (approximately 1e-4) is applied to the backend of the backbone network to allow it to adjust the feature combination method to better represent the contrast relationship between thermal radiation targets and the background in infrared images;

[0474] Rapid warm-up of dedicated modules: For the newly added modules of this invention, such as MAHF, SG-DAFE, and SG-TFM, the highest global learning rate (approximately 1e-3) is adopted. Utilizing the relatively abundant infrared data at this stage, these modules are driven to quickly learn their core functions—MAHF learns how to enhance the weak high-frequency signals of small infrared targets; SG-DAFE learns how to focus on potential target areas in infrared scenes; and SG-TFM learns the temporal correlation patterns between infrared sequences.

[0475] Phase Three: Specialized Fine-tuning of Small Ship Targets and Multi-Task Collaboration

[0476] Objective: To complete the final optimization of the model based on scarce real infrared ship small target data, with a focus on enhancing the collaborative enhancement capabilities of each patented module for small target features;

[0477] operate:

[0478] ① Data: Annotated real infrared ship sequence data (including a large number of small target samples);

[0479] ② Full unfreezing and fine-tuning: All network parameters are now trainable; more refined hierarchical learning rate scheduling is implemented.

[0480] Base layer: Maintain a near-frozen learning rate to solidify the visual foundation;

[0481] Semantic layer: Applying a targeted learning rate, focusing on modeling the specific semantic structure of "sea surface-sky-ship" and multi-scale representation of small targets;

[0482] Task Head and Patented Modules: Provides significantly higher learning rates, especially for the SG-DAFE and SG-TFM modules. The high learning rate drives their rapid optimization, enabling their attention mechanisms to accurately lock onto small targets on ships and achieve robust temporal tracking.

[0483] ③ Multi-task loss joint optimization: Fully activate and apply the multi-task loss function designed in this invention to guide the MAHF module, SG-DAFE module, SG-TFM module and detection task head to perform deep collaboration, ensuring the consistency of feature enhancement, attention focus, temporal fusion and the final detection target.

[0484] Another objective of this invention is to provide an infrared ship small target detection system based on multi-frame fusion video stream mode, comprising:

[0485] The multi-scale adaptive high-frequency boosting filter module is used to separate the internal texture details and contour edge information of infrared small targets through filters of different scales; learnable weight parameters are introduced to achieve intelligent fusion of multi-scale features under different scenarios; based on standard convolution operations, the module can be efficiently integrated into real-time detection systems.

[0486] A dual-path attention feature enhancement module is used to reduce computational complexity by utilizing local windows and sparse sampling mechanisms; a guided fusion mechanism ensures that small target features are given priority enhancement; and the contribution ratio of local and global attention is automatically adjusted according to the input features.

[0487] The saliency-guided temporal fusion module applies different pooling strategies to high and low saliency regions using "saliency-weighted feature compression," compressing background information while preserving target details, thus optimizing the computational cost and feature quality of temporal modeling. By encoding multi-dimensional saliency information into attention biases and using them as dynamic adjustment factors for temporal position encoding and attention computation, the fusion process can automatically focus on key frames and key information. Frame-level importance weights are derived in reverse using the bias matrix naturally generated in multi-head attention computation, replacing the traditional fixed pooling strategy and achieving dynamic temporal fusion that is highly relevant to the content.

[0488] The function design module is used for designing multi-task loss functions;

[0489] The adaptive training module is used for progressive adaptive training.

[0490] Another object of the present invention is to provide a computer device including a memory and a processor, the memory storing a computer program, which, when executed by the processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion video stream mode.

[0491] Another object of the present invention is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion video stream mode.

[0492] Another objective of this invention is to provide an information data processing terminal for implementing the infrared ship small target detection system based on multi-frame fusion video stream mode.

[0493] Based on the above technical solutions and the technical problems solved, please analyze the advantages and positive effects of the technical solution to be protected by this invention from the following aspects:

[0494] 1. High detection accuracy, especially sensitive to small targets.

[0495] Multi-dimensional feature enhancement: By introducing a saliency-guided dual-path attention feature enhancement module, global feature enhancement of single-frame features is performed from the spatial domain perspective. Combined with a saliency-guided temporal fusion module, multi-frame features are fused from the temporal domain perspective. This enables the full mining of potential information of weak targets from both spatial and temporal dimensions. This spatiotemporal dual-domain collaborative enhancement mechanism effectively overcomes the problems of low signal-to-noise ratio and weak features of targets in single-frame images, significantly improves the detection rate of small infrared targets of ships at long distances, and greatly reduces the risk of missed detection.

[0496] 2. The algorithm is robust and adaptable to various environments.

[0497] The carefully designed loss function, especially the multi-scale consistency loss, ensures that the network's perception of the same target is consistent across different processing stages and scales, further improving the model's stability and reliability in the face of complex and ever-changing sea environments.

[0498] 3. Overall performance is balanced, combining high efficiency and real-time performance.

[0499] By employing a lightweight image preprocessing module and a high-efficiency Transformer encoder, the computational complexity of the model was strictly controlled while ensuring performance improvement.

[0500] Compared to traditional dense computation methods such as optical flow, the multi-frame fusion strategy of this invention has higher computational efficiency, is easier to deploy on hardware platforms with limited resources, and meets the stringent requirements of real-time video stream processing.

[0501] 4. The training process is scientific, and the model has excellent generalization ability.

[0502] The innovative multi-task loss function dynamically adjusts the weights of each loss term, making the model training process smoother and more efficient, avoiding the tedious manual parameter tuning, and guiding the model to converge to a better performance balance point.

[0503] A progressive adaptive training method is employed, which fully leverages the knowledge from large-scale public datasets and effectively bridges the domain differences between public data and scarce real-world infrared ship data. By fine-tuning the learning rate differentially, lower learning rates are set for primary features and higher learning rates for advanced features, enabling the model to quickly and efficiently adapt to specific, data-scarce infrared ship small-target scenarios. This addresses the core pain point of data scarcity in this field and significantly improves the model's practicality and generalization ability.

[0504] 5. Systematically integrate innovation to create technological synergy.

[0505] This invention is not a simple aggregation of multiple technologies, but rather an organic and systematic integration of image preprocessing, spatial domain enhancement, temporal domain fusion, loss functions, and training strategies. The various modules support and collaborate with each other, forming a complete and efficient solution for infrared ship small target detection, resulting in a synergistic effect of "1+1>2". Attached Figure Description

[0506] Figure 1 This is a flowchart of an infrared ship small target detection method based on multi-frame fusion video stream mode provided by an embodiment of the present invention.

[0507] Figure 2This is a flowchart of the multi-scale adaptive high-frequency boosting filter module method provided in the embodiment of the present invention.

[0508] Figure 3 This is a structural block diagram of an infrared ship small target detection system based on multi-frame fusion video stream mode provided in an embodiment of the present invention. Detailed Implementation

[0509] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0510] like Figure 1 As shown in the figure, an infrared ship small target detection method based on multi-frame fusion video stream mode provided by an embodiment of the present invention includes the following steps:

[0511] S101 separates the internal texture details and contour edge information of infrared small targets through a multi-scale adaptive high-frequency boosting filter module using filters of different scales; it introduces learnable weight parameters to intelligently fuse multi-scale features in different scenarios; and it is based on standard convolution operations to ensure that the module can be efficiently integrated into a real-time detection system.

[0512] S102 utilizes a saliency-guided dual-path attention feature enhancement module with local windows and sparse sampling mechanisms to reduce computational complexity; the guided fusion mechanism ensures that small target features are enhanced in a focused manner; and the contribution ratio of local and global attention is automatically adjusted according to the input features.

[0513] S103 utilizes a saliency-guided temporal fusion module to apply different pooling strategies to high and low saliency regions using "saliency-weighted feature compression." This preserves target details while compressing background information, optimizing the computational cost and feature quality of temporal modeling. By encoding multi-dimensional saliency information into attention biases and using them as dynamic adjustment factors for temporal position encoding and attention computation, the fusion process can automatically focus on key frames and key information. Frame-level importance weights are derived in reverse using the bias matrix naturally generated in multi-head attention computation.

[0514] S104, Design of Multi-Task Loss Function;

[0515] S105, progressive adaptive training.

[0516] like Figure 2 As shown, the multi-scale adaptive high-frequency boosting filter module provided in this embodiment of the invention:

[0517] S201 Multi-scale Feature Separation:

[0518] Small-scale detail component extraction (for internal texture of the target):

[0519] ;

[0520] Mesoscale edge component extraction (for target contour boundaries):

[0521] ;

[0522] in,

[0523] : Represents the input infrared image; , Mean filter kernels representing 3×3 and 5×5; Represents the convolution operator;

[0524] S202 Adaptive Feature Fusion:

[0525] Establish a weighted fusion model:

[0526] ;

[0527] in, , The enhancement intensities of small-scale details and mesoscale edges are controlled separately as learnable parameters, satisfying the adaptive optimization conditions:

[0528] ; ;

[0529] It is the loss function of the downstream detection task;

[0530] S203 High Frequency Enhancement:

[0531] Construct the complete high-frequency boosting formula:

[0532] ;

[0533] in, These are learnable parameters that control the overall contrast enhancement intensity of the original input infrared image, satisfying the adaptive optimization conditions:

[0534] ;

[0535] The complete expression after expansion is:

[0536] ;

[0537] Among them, parameters , , Adaptive learning is adopted through end-to-end training, and the initialization strategy is designed as follows:

[0538] ; ; .

[0539] The dual-path attention feature enhancement module provided in this embodiment of the invention:

[0540] Step 1: Local window attention calculation:

[0541] Input feature map Divided into non-overlapping 1 window, calculate self-attention within each window:

[0542] (1) Window division operation:

[0543] ;

[0544] ;

[0545] in, For window size, Number of windows;

[0546] (2) Calculation of self-attention within the window:

[0547] For each window Flatten it from a 3D feature map into a 2D matrix Calculate the query, key, and value matrix, and perform multi-head partitioning:

[0548] , , ;

[0549] in, , , These are learnable projective weights; It is the attention head dimension;

[0550] Will , , Divided into Size:

[0551] , ;

[0552] , ;

[0553] , ;

[0554] in, Dimensions for each attention head;

[0555] Relative position offset calculation:

[0556] First, construct a relative position offset table:

[0557] Define a relative position offset table It can cover all possible relative positional offsets within the window; where, For window size, For the number of attention heads;

[0558] Relative position index:

[0559] For any two positions within the window and Its relative position offset is:

[0560] , ;

[0561] in, , The coordinates of the pixel within the window (range is) arrive );

[0562] Mapping two-dimensional offsets to one-dimensional indices:

[0563] ;

[0564] This mapping yields the relative position index matrix for all position pairs within the window. ;

[0565] Window relative position offset matrix generation:

[0566] First, set the bias table. Remodeled into a two-dimensional matrix :

[0567] ;

[0568] Then, based on the relative position index matrix From the bias table Extract the corresponding bias values ​​and construct the window relative position bias matrix:

[0569] ;

[0570] For the The formula for calculating the attention weight of a window is as follows:

[0571] ;

[0572] in, For the first The relative positional offset of each attention head For attention head dimension;

[0573] Feature aggregation:

[0574] ;

[0575] Multi-head fusion and output projection:

[0576] ;

[0577] in, To output the projection matrix;

[0578] (3) Window reorganization:

[0579] Reconstruct the window attention results into a complete feature map:

[0580] ;

[0581] Step 2: Global Sparse Attention Calculation

[0582] (1) Generation of saliency plot:

[0583] Feature compression and conversion:

[0584] ;

[0585] in, These are learnable 1×1 convolutional kernel weights; It is a learnable bias; It is the number of intermediate channels. Compression ratio (can be set between 4 and 8); represent Activation function;

[0586] Channel attention calculation:

[0587] ;

[0588] ;

[0589] ;

[0590] in, For channel descriptor vectors; These are learnable weights for fully connected layers; For learnable biases of fully connected layers; for function; Channel attention weights; Broadcast multiplication representing channel dimensions; Feature maps representing attention-weighted features;

[0591] Spatial saliency generation:

[0592] ;

[0593] ;

[0594] in, These are learnable 1×1 convolutional kernel weights; For learnable bias terms; The original score represents the significance. This is the final saliency plot;

[0595] (2) Key point selection:

[0596] Based on saliency map Before choosing The most prominent position:

[0597] ;

[0598] in, It is a significance threshold, used before screening. The most prominent point;

[0599] (3) Sparse attention computation:

[0600] Calculate the attention relationship between keypoints and their positions across the entire image only:

[0601] ;

[0602] ;

[0603] ;

[0604] ;

[0605] ;

[0606] in, This represents flattening a 3D feature map into a 2D matrix. This represents the feature vector corresponding to the key points extracted from the flattened features. , , It is a learnable projection matrix for global attention. It is a feature dimension of global attention;

[0607] Will , , Divided into Size:

[0608] , ;

[0609] , ;

[0610] , ;

[0611] in, Dimensions for each attention head;

[0612] For the The formula for calculating the sparse attention weights is as follows:

[0613] ;

[0614] Feature aggregation:

[0615] ;

[0616] Multi-head fusion and output projection:

[0617] ;

[0618] in, To output the projection matrix;

[0619] (4) Sparse attention output:

[0620] Global feature reconstruction:

[0621] Key features Based on the index Repositioning them to their corresponding spatial locations creates a global sparse augmentation feature:

[0622] ;

[0623] Shape restoration:

[0624] Restore the flattened global features to a 3D feature map with the original spatial dimensions:

[0625] ;

[0626] Step 3: Saliency-guided feature fusion

[0627] Adaptively fuse local attention results and global sparse attention results based on a saliency map:

[0628] ;

[0629] in, This indicates element-wise multiplication (Hadamard product).

[0630] Step 4: Channel Attention Recalibration

[0631] Channel recalibration of fused features is performed using a compression excitation mechanism:

[0632] ;

[0633] ;

[0634] in, and These are the weight matrices for dimensionality reduction and dimensionality increase, respectively. The compression ratio is... for Activation function;

[0635] ;

[0636] in, This represents broadcast multiplication at the channel dimension;

[0637] To ensure training stability, consider adding residual connections, and the final output... for:

[0638] ;

[0639] The saliency-guided timing fusion module provided in this embodiment of the invention:

[0640] Significance-weighted feature compression;

[0641] Significant region delineation;

[0642] For the input of the first Frame feature map and the corresponding saliency plot Based on the learnable saliency threshold The feature map is divided into highly significant regions and low significant regions:

[0643] ;

[0644] ;

[0645] in, It is an indicator function; the input is 1 if the condition is met, and 0 otherwise. The parameter is learnable and its initial value is set to 0.3; and These represent high-salience masks and low-salience masks, respectively.

[0646] Region-specific feature extraction;

[0647] Different feature extraction strategies are used for different salient regions:

[0648] Highly saliency region feature extraction:

[0649] ;

[0650] ;

[0651] Feature extraction of low saliency regions:

[0652] ;

[0653] ;

[0654] in, This indicates element-wise multiplication (Hadamard product). This represents global max pooling, which retains the most salient feature responses. This indicates global average pooling, which obtains overall contextual information about the background.

[0655] Enhancement in locally significant regions;

[0656] To preserve the spatial details of the target, local pooling is performed at the locations with the highest saliency:

[0657] Local area positioning:

[0658] ;

[0659] Local feature extraction:

[0660] ;

[0661] ;

[0662] The formula for calculating the boundary of a local region is as follows. The window size for the local pooling region is set to 1 / 8 of the feature map size.

[0663] ;

[0664] ;

[0665] ;

[0666] ;

[0667] Adaptive weight fusion;

[0668] The fusion weights of features in each region are dynamically adjusted based on their saliency intensity.

[0669] First, the average saliency score is calculated based on the saliency feature map of each frame:

[0670] ;

[0671] Then, based on the average significance score, the feature fusion results of the final high significance region, low significance region, and locally significant region are calculated:

[0672] ;

[0673] in, This is the weighting coefficient for local features, set to 0.2;

[0674] Significance-weighted temporal position coding;

[0675] Basic location encoding generation;

[0676] For length of The time sequence is used to generate the basic sinusoidal position coding matrix. :

[0677] ;

[0678] ;

[0679] in, For time step index; For dimension indexing; For feature dimensions;

[0680] Saliency encoding mapping;

[0681] The saliency intensity is mapped to an offset in the encoding space using a neural network, and the mapping function is as follows:

[0682] ;

[0683] in, This is the first layer weight matrix; This is the weight matrix for the second layer; For the first layer bias; For the second layer bias; For the first The average saliency score of the frames; To modify the activation function of the linear unit;

[0684] Adaptive fusion;

[0685] The basic positional encoding and saliency encoding are weighted and fused to obtain the final positional encoding vector:

[0686] ;

[0687] in, This represents the learnable weight fusion parameter, with an initial value set to 0.1;

[0688] Temporal feature enhancement;

[0689] The fused positional encoding is added to the compressed temporal features:

[0690] ;

[0691] in, Represents the compressed number Frame feature vector;

[0692] Attention computation based on bias generation using multi-dimensional saliency features;

[0693] Multidimensional salient feature extraction;

[0694] For the input sequence length is The infrared image sequence, with each frame corresponding to a saliency map, from which the following four features are extracted:

[0695] Significance intensity feature, representing the first The overall saliency level of the frame is calculated using the following formula:

[0696] ;

[0697] The saliency consistency feature represents the saliency pattern similarity between the current frame and adjacent frames, calculated using cosine similarity.

[0698] ;

[0699] in, and Representing the first Saliency maps of the previous and next frames; for two saliency maps and The formula for calculating cosine similarity is:

[0700] ;

[0701] The saliency stability feature represents the similarity between the current frame and the average saliency pattern of the entire sequence;

[0702] First, calculate the average significance plot for the entire sequence:

[0703] ;

[0704] Then, calculate the first... Cosine similarity between the frame and the average saliency map:

[0705] ;

[0706] The time decay feature assigns higher weights to recent frames based on the current frame's position in the sequence; for the th... The formula for calculating the time decay feature of a frame is:

[0707] ;

[0708] in, The index of the current frame; This is the total length of the sequence;

[0709] Attention bias matrix generation involves concatenating the four features into a feature vector, which is then mapped to an attention bias matrix via a neural network.

[0710] Feature concatenation, for position pairs The concatenated 8-dimensional feature vector is:

[0711] ;

[0712] Bias matrix calculation:

[0713] Using a multilayer perceptron ( ) will each pair of frames (the first) Frame and the The feature vector of a frame is mapped to a bias value; specifically, the attention bias matrix. elements in The calculation is as follows:

[0714] ;

[0715] in, Indicates the first Frame to the first Attention bias value of the frame; , , These are the weight matrices for the first layer, the second layer, and the output layer, respectively. , , These are the biases for the first layer, the second layer, and the output layer, respectively.

[0716] Bias-enhanced multi-head attention calculation:

[0717] First, design an offset scaling mechanism for the first... Size, scaled offset is:

[0718] ;

[0719] scaling factor The calculation formula is as follows:

[0720] ;

[0721] in, This is a learnable parameter, and its initial value is set to 0. Set the hyperparameter to 2;

[0722] Then, obtain the query matrix, key matrix, and value matrix, and perform multi-head splitting:

[0723] For the feature sequence after fusion and position encoding, a query matrix, key matrix, and value matrix are generated through linear transformation:

[0724] , , ;

[0725] in, , , The weight matrix is ​​a learnable weight matrix;

[0726] Will , , Divided into Size:

[0727] , ;

[0728] , ;

[0729] , ;

[0730] in, For each dimension of attention head, For attention head dimension;

[0731] For the Each attention head has an attention weight calculated using the following formula:

[0732] ;

[0733] Weighted aggregation:

[0734] ;

[0735] Finally, multi-head fusion and output projection are performed:

[0736] ;

[0737] in, To output the projection weight matrix;

[0738] To maintain gradient flow and training stability, residual connections and layer normalization operations are performed:

[0739] ;

[0740] ;

[0741] The formula for the layer normalization function is as follows:

[0742] ;

[0743] In the formula, , These are the mean and standard deviation of the input, respectively. , These are the learnable scaling and offset parameters, respectively;

[0744] To further improve model performance and enhance its nonlinear expressive power, a feedforward network and corresponding residual connections and layer normalization designs are added based on the above:

[0745] ;

[0746] in, Responsible for expanding the dimensions to ; Responsible for compressing the dimensions back ;

[0747] Residual connectivity and layer normalization:

[0748] ;

[0749] Temporal importance pooling guided by multi-head attention bias

[0750] Generation of temporal importance weights;

[0751] Bias aggregation generates initial frame-level scores:

[0752] By aggregating bias information from multiple heads and multiple query perspectives into a single initial importance score vector representing the global importance of each frame;

[0753] The aggregate function formula is as follows:

[0754] ;

[0755] in, For the number of attention heads; This represents the total number of frames in the sequence. This represents the multi-head attention bias value. For the first The initial importance score vector of the frame;

[0756] Nonlinear transformation and weight recalibration:

[0757] Based on the initial importance score vector, a lightweight, learnable transformation network is introduced to perform nonlinear transformations and weight recalibration, thereby increasing the model's expressive power and better adapting it to downstream detection tasks. The calculation formula is as follows:

[0758] ;

[0759] in, and These are the linear transformation weights for the first and second layers, respectively; and These are the biases for the first and second layers, respectively; the dimension reduction ratio is set to 4. It is a non-linear activation function; This is the transformed score vector;

[0760] Normalization yields the final fusion weights:

[0761] The transformed score application The function is normalized along the time dimension, resulting in the final temporal fusion weight vector. The calculation formula is:

[0762] , ;

[0763] The constraints are satisfied:

[0764] , ;

[0765] Time-weighted fusion:

[0766] Use the obtained weight vector For time-series augmentation vector sequences Weighted summation is performed to generate the final fused feature vector used for the classification task. The calculation formula is as follows:

[0767] ;

[0768] This is the fusion feature vector that the saliency-guided temporal fusion module ultimately outputs for the classification task;

[0769] Meanwhile, the deep temporal context information extracted by the SG-TFM module is efficiently and lightweightly injected into the feature stream of the regression task, enabling the regression task to share and utilize the temporal dependencies and contextual understanding learned by the classification task, thereby improving the accuracy and stability of target localization.

[0770] Global temporal context vector extraction:

[0771] Temporal Enhancement Feature Sequences Aggregate along the time dimension to extract a compact global temporal context vector. Global average pooling is used:

[0772] ;

[0773] Modulation parameter generation:

[0774] Global Temporal Context Vector Through a lightweight parameter generation network Generate scale parameters for feature modulation. and offset parameters ; It consists of two fully connected layers, with a modified linear unit in between. As an activation function:

[0775] ;

[0776] ;

[0777] in, and For the first layer weights and biases, For intermediate layer dimensions (set to) To reduce computational load); and For the second layer weights and biases; This indicates that the output vector is split into two parts, the first part... Each parameter is used as a scale parameter, and then... Each element is used as an offset parameter;

[0778] Temporal context-guided feature modulation:

[0779] Using the generated modulation parameters and Perform an affine transformation on the spatial enhancement features of each frame to generate modulated features:

[0780] ;

[0781] Among them, modulation parameters and Broadcast to dimension , This indicates multiplication by channel;

[0782] Timing importance weighted fusion:

[0783] The modulated feature sequences are weighted and summed using a temporal importance weight vector to obtain the final fused feature vector used for the regression task. :

[0784] ;

[0785] This is the final output fused feature vector used for the regression task.

[0786] The multi-task loss function design provided in this embodiment of the invention;

[0787] The first term, spectrum-guided perceptual loss ( ):

[0788] To accurately guide the multi-scale adaptive high-frequency enhancement filter module to learn the expected image enhancement characteristics and avoid it from getting stuck in a suboptimal solution when the optimization process backpropagates through a lengthy detection network, a corresponding spectrum-guided perceptual loss is designed.

[0789] Step 1: High-frequency energy enhancement loss, designed to maximize the energy of specific high-frequency components in the output image that are relevant to small targets;

[0790] A high-frequency component extraction operator is defined, using a Laplacian convolution kernel as a high-pass filter to approximately extract the second-order gradient of the image; for any image Its high-frequency components The calculation is as follows:

[0791] ;

[0792] in, Represents convolution operation; This indicates taking the absolute value, used to obtain the intensity of the high-frequency response;

[0793] The high-frequency component contrast loss is defined as the difference in high-frequency energy between the output image and the input image across the entire image range, and its calculation formula is as follows:

[0794] ;

[0795] in, The original input infrared image, Output images to the module;

[0796] Step 2: Multi-scale structural fidelity loss

[0797] To prevent over-processing of modules from causing image structure damage, contrast distortion, or the introduction of unnatural artifacts, the structural similarity between the output and input images is constrained at multiple scales. A multi-scale structural similarity index is used as a metric for fidelity loss. The formula for calculating the multi-scale structural fidelity loss is as follows:

[0798] ;

[0799] in, Represents scale level, Set the total number of scales to 5; subscript Indicates the coarsest scale; , , The index is used to adjust the importance of each component; the calculation formulas for each comparison function are as follows, including brightness, contrast, and structure comparison functions, where... , These are two images used for comparison. , Represents the local mean. , Represents standard deviation, Represents covariance, , , It is a small constant used to avoid division by zero and to ensure stable calculations;

[0800] ;

[0801] ;

[0802] ;

[0803] Step 3: Loss Integration

[0804] The spectrum-guided perceptual loss is a weighted sum of the two sub-losses mentioned above:

[0805] ;

[0806] in, and As adjustable hyperparameters, the balance between high-frequency enhancement strength and structural fidelity strength is controlled separately, with initial values ​​set to 0.7 and 0.3 respectively;

[0807] The second item is the detection loss that is sensitive to small targets ( ):

[0808] To accurately optimize the infrared ship small target detection task and solve the core problems of traditional detection loss when dealing with small-scale targets, such as gradient instability, positive and negative sample imbalance, and inaccurate positioning, a corresponding small target sensitive detection loss is designed.

[0809] Step 1: Classification Loss Due to Dynamic Focus Modulation and Class Balance :

[0810] To address the imbalance between background pixels (negative samples) and target pixels (positive samples) in images, a classification loss based on dynamic focus modulation and class balancing is proposed. The optimization process is rebalanced through two levels of weight adjustment.

[0811] Dynamic category weight calculation:

[0812] Let the total number of samples in the batch be Count the number of positive and negative samples in the current batch:

[0813] , ;

[0814] in, For the first The true label of each sample; For indicator functions;

[0815] Dynamically calculate category balance weights :

[0816] ;

[0817] ;

[0818] in, It is a minimal constant, thus maintaining numerical stability;

[0819] Focus modulation factor design:

[0820] Define sample The probability of a correct prediction :

[0821] ;

[0822] in, Predict the probability that a sample belongs to the positive class for the model;

[0823] Focus modulation factor is defined as ,in The adjustable focusing parameter can be set to 2. The corrected weighted cross-entropy loss is:

[0824] ;

[0825] Complete loss function and numerical stability handling:

[0826] To maintain the stability of numerical calculations and avoid logarithmic overflow caused by probabilities approaching zero, the predicted probabilities are truncated in actual calculations:

[0827] ;

[0828] in, ;

[0829] Used when calculating final loss Alternative The batch-average dynamic focus modulation classification loss is:

[0830] ;

[0831] Step 2: Regression loss based on normalized Wasserstein distance :

[0832] To address the fundamental problems of traditional IoU and its variants in small target detection, such as gradient instability, oversensitivity to small offsets, and excessive scale dependence, we adopt Normalized Wasserstein Distance (NWD) as the core metric and loss function for bounding box regression. This redefines bounding box matching from the perspective of probability distribution similarity, providing a smooth, stable, and scale-invariant optimized gradient for infrared ship small target detection.

[0833] Gaussian distribution modeling of the bounding box:

[0834] For any bounding box ,in With the center coordinates, and The width and height are respectively represented, and the model is a two-dimensional Gaussian distribution. :

[0835] The mean vector represents the central location of the distribution:

[0836] ;

[0837] The covariance matrix represents the extent to which the distribution is spread in the horizontal and vertical directions:

[0838] ;

[0839] Wasserstein distance calculation:

[0840] For two Gaussian distributions, the second-order Wasserstein distance has a closed-form solution.

[0841] Wasserstein distance squared decomposition:

[0842] ;

[0843] in, Denotes the Euclidean norm; Represents the trace of a matrix;

[0844] Simplification of the diagonal covariance matrix:

[0845] Since both and are diagonal matrices, the above equation can be simplified. Let:

[0846] , ;

[0847] in, , , , ;

[0848] but:

[0849] ;

[0850] Substituting the specific parameters yields the final expression:

[0851] ;

[0852] Normalized Wasserstein distance calculation:

[0853] The Wasserstein distance is mapped to the [0,1] interval, exponentially normalized, and a small constant is added. Avoid numerical issues:

[0854] ;

[0855] Where is a normalization constant, which can be taken as the average of the diagonal lengths of the target boxes in the dataset;

[0856] Constructing the regression loss function:

[0857] ;

[0858] Step 3: Loss Integration

[0859] The detection loss for small targets is a weighted sum of the two sub-losses mentioned above:

[0860] ;

[0861] in, and These are adjustable hyperparameters that control the balance between classification and regression tasks, with initial values ​​set to 1.0 and 2.0 respectively.

[0862] The third significant comparison guides the loss ( ):

[0863] Ensure that the saliency map generated by the SG-DAFE module can accurately focus on the real target area, enhance the distinguishability between the target and the background, and avoid attention being distracted by background noise;

[0864] Step 1: Region Mask Generation

[0865] First, a binary region mask is generated based on the true annotations of the current training samples; let the first... The bounding box of each real target is Its corresponding target region mask for:

[0866] ;

[0867] The background sampling region consists of two parts: the first is a ring-shaped region around each target bounding box, ranging from 0 to 2 times the target size; the second is a region within the entire image that does not overlap with the target bounding box. Random samples are taken from the total background sampling region. A background region block with an area equivalent to the target box. The mask for each background region is:

[0868] ;

[0869] The background mask for the entire image is:

[0870] ;

[0871] Step 2: Calculation of regional significance statistic:

[0872] The saliency plot generated for the SG-DAFE module has a value range of [range missing]. ;

[0873] For the There are 1 target, and the mean significance of the target region is:

[0874] ;

[0875] For the current image, the mean saliency of the background region is:

[0876] ;

[0877] in, and All are the total number of non-zero pixels within the mask;

[0878] Step 3: Significance Comparison Guided Loss Formula:

[0879] Employing a boundary-based contrastive loss form, for each image containing [various parameters]... For each real target, the loss is calculated as follows:

[0880] ;

[0881] in, This is a preset marginal hyperparameter, initially set to 0.4, indicating that the average significance of the target region and the background region should differ by at least [a certain value]. ; Indicates the first Significant differences between the target and the background;

[0882] The fourth term is time-aware adaptive loss ( ):

[0883] To enhance the temporal modeling capability of the SG-TFM module in dynamically changing environments, loss terms are constructed from three dimensions: scale change coherence, motion invariance features, and temporal saliency focus, to ensure stable and reliable detection performance in complex temporal sequences close to the target.

[0884] Step 1: Loss of coherence with scale changes ( ):

[0885] For the Frame, the predicted target box size is Calculate the pixel area of ​​the target for:

[0886] ;

[0887] For uniform or uniformly accelerated approach, the target area change rate should change smoothly, constraining the difference in area change rates between adjacent frames. The loss function is:

[0888] ;

[0889] in, It is the numerical stability constant;

[0890] Step 2: Motion Invariance Feature Loss ( );

[0891] Calculation of the feature relation matrix:

[0892] Given a feature sequence enhanced by the SG-TFM module Construct a relation matrix , of which elements Indicates the first Frame and the Cosine similarity of frame features:

[0893] ;

[0894] in, It is the numerical stability constant;

[0895] Parameterization of affine transformation:

[0896] Using affine transformations to simulate global motion:

[0897] ;

[0898] The parameterization form is as follows:

[0899] , ;

[0900] in, and Represent and The field-of-view scaling factor in the direction, ranging from ; This represents the angle by which the image rotates counterclockwise around the origin, and its range is... ; and These represent the horizontal and vertical shear deformation parameters, respectively, with a range of... ; and Represents image translation, with a range of and ;

[0901] Loss function calculation:

[0902] For the original feature sequence Apply random affine transformation Obtain enhanced feature sequences The relationship matrices of the original feature sequence and the enhanced feature sequence are respectively and The Frobenius norm is used to measure the matrix difference, i.e., the motion invariance feature loss. The calculation is as follows:

[0903] ;

[0904] Step 3: Temporal Significance Focusing Loss ( );

[0905] Significant property centroid alignment loss:

[0906] Let the first Frame saliency map is The true bounding box size of the target is The center coordinates are ;

[0907] The centroid coordinates of the saliency plot are calculated as follows: :

[0908] ;

[0909] ;

[0910] in, These are small constants to ensure stability in numerical calculations;

[0911] The centroid alignment loss is:

[0912] ;

[0913] Significance distribution compactness loss:

[0914] Ideally, the significance distribution should be compactly concentrated in the key region, rather than uniformly dispersed; by constraining the second moment (variance) of the significance values, the significant response is forced to be concentrated, thus improving the signal-to-noise ratio;

[0915] The second central moment (variance) of the significance distribution is calculated as follows:

[0916] ;

[0917] The diagonal length of the target's true bounding box is The desired tightness level is above the limit. for:

[0918] ;

[0919] The loss of firmness is:

[0920] ;

[0921] in, The relaxation coefficient can be set to 0.1, indicating that the tightness is allowed to slightly exceed the upper bound; the normalization factor... for:

[0922] ;

[0923] The temporal significance focusing loss is:

[0924] ;

[0925] Among them, the weighting coefficient and Set them to 0.6 and 0.4 respectively;

[0926] Step 4: Time-Aware Adaptive Loss Integration

[0927] The total loss is calculated as follows:

[0928] ;

[0929] Among them, the weighting coefficient , , Set them to 0.4, 0.4, and 0.2 respectively;

[0930] The fifth term, spatiotemporal feature alignment loss ( );

[0931] The SG-DAFE module and the SG-TFM module extract and enhance target features from the spatial and temporal dimensions, respectively. However, the two modules differ in their architecture design, optimization objectives, and feature processing procedures. Therefore, a spatiotemporal feature alignment loss is designed.

[0932] Let the length of the video sequence be... For the first The frame, the output feature map of the SG-DAFE module is The significance plot is as follows The output regression feature map of the SG-TFM module is ;

[0933] First, L2 normalization is performed on the features along the channel dimension to ensure scale invariance for feature comparison:

[0934] , ;

[0935] in, For position The feature vector output by the SG-DAFE module at that location; For position The feature vector output by the SG-TFM module at that location; These are small constants to ensure stability in numerical calculations;

[0936] For each spatial location Calculate the cosine similarity of the normalized feature vectors and weight them using significance values:

[0937] ;

[0938] in, Represents the dot product of vectors;

[0939] Calculate the first Weighted average similarity of frames:

[0940] ;

[0941] The average empty feature alignment loss is obtained by averaging over all frames:

[0942] ;

[0943] The sixth term, total loss function ( ):

[0944] The total loss function is designed as follows:

[0945] ;

[0946] in, , , , , , For the first The dynamic weight coefficients of the term loss are set to the number of training steps. The function;

[0947] Let the total number of training steps be... The dynamic weight formula is:

[0948] ;

[0949] in, These are learnable weight parameters; The initial weights were set to 1.0, 2.0, 0.5, 0.8, and 0.3.

[0950] The progressive adaptive training method provided in this embodiment of the invention:

[0951] To address the inherent challenges of scarce and difficult-to-annotate infrared small target data for ships, and to ensure that the multi-scale adaptive high-frequency boosting filter module (MAHF), saliency-guided dual-path attention feature enhancement module (SG-DAFE), and saliency-guided temporal fusion module (SG-TFM) described in this invention can be sufficiently and effectively trained, this invention proposes a three-stage progressive training paradigm. This paradigm achieves a stable and efficient conversion from a general visual base model to a dedicated infrared small target detector by combining domain-specific knowledge transfer and hierarchical optimization scheduling.

[0952] Phase One lays the foundation for general visual representation;

[0953] Objective: To acquire basic perception capabilities of object edges, shapes, textures, and contextual relationships, providing a high-quality input feature base for the subsequent dedicated processing modules of this invention;

[0954] Operation: The backbone network is pre-trained on a large-scale visible light image classification dataset; then, the complete detection architecture is trained on a general object detection dataset to learn preliminary object localization and classification knowledge.

[0955] Output: Obtain a generalized visual detector with strong generalization capabilities, whose parameters will serve as the starting point for initialization of all subsequent stages;

[0956] Phase Two: Infrared Imaging Domain Adaptation and Dedicated Module Initialization

[0957] Objective: To bridge the domain distribution differences between visible light and infrared imaging, and to initialize and warm up the core enhancement module of this invention so that it can initially adapt to the statistical characteristics of infrared data;

[0958] operate:

[0959] ①Data: Large-scale publicly available infrared thermal imaging scene datasets;

[0960] ② Network preparation: Load the model parameters from Phase 1; introduce and initialize the MAHF, SG-DAFE, and SG-TFM modules, and integrate them into the network front-end and feature paths;

[0961] ③ Differentiation optimization strategy:

[0962] Low-level general feature protection: Apply an extremely low learning rate (≤1e-5) or a small learning rate fluctuation to the front-end layer of the backbone network (responsible for extracting low-level features) to essentially freeze the general filter learned from visible light and avoid interference from infrared domain noise.

[0963] Mid-level semantic feature adaptation: A moderate learning rate (approximately 1e-4) is applied to the backend of the backbone network to allow it to adjust the feature combination method to better represent the contrast relationship between thermal radiation targets and the background in infrared images;

[0964] Rapid warm-up of dedicated modules: For the newly added modules of this invention, such as MAHF, SG-DAFE, and SG-TFM, the highest global learning rate (approximately 1e-3) is adopted. Utilizing the relatively abundant infrared data at this stage, these modules are driven to quickly learn their core functions—MAHF learns how to enhance the weak high-frequency signals of small infrared targets; SG-DAFE learns how to focus on potential target areas in infrared scenes; and SG-TFM learns the temporal correlation patterns between infrared sequences.

[0965] Phase Three: Specialized Fine-tuning of Small Ship Targets and Multi-Task Collaboration

[0966] Objective: To complete the final optimization of the model based on scarce real infrared ship small target data, with a focus on enhancing the collaborative enhancement capabilities of each patented module for small target features;

[0967] operate:

[0968] ① Data: Annotated real infrared ship sequence data (including a large number of small target samples);

[0969] ② Full unfreezing and fine-tuning: All network parameters are now trainable; more refined hierarchical learning rate scheduling is implemented.

[0970] Base layer: Maintain a near-frozen learning rate to solidify the visual foundation;

[0971] Semantic layer: Applying a targeted learning rate, focusing on modeling the specific semantic structure of "sea surface-sky-ship" and multi-scale representation of small targets;

[0972] Task Head and Patented Modules: Provides significantly higher learning rates, especially for the SG-DAFE and SG-TFM modules. The high learning rate drives their rapid optimization, enabling their attention mechanisms to accurately lock onto small targets on ships and achieve robust temporal tracking.

[0973] ③ Multi-task loss joint optimization: Fully activate and apply the multi-task loss function designed in this invention to guide the MAHF module, SG-DAFE module, SG-TFM module and detection task head to perform deep collaboration, ensuring the consistency of feature enhancement, attention focus, temporal fusion and the final detection target.

[0974] like Figure 3 As shown, an infrared ship small target detection system based on multi-frame fusion video stream mode provided by an embodiment of the present invention includes:

[0975] The multi-scale adaptive high-frequency boosting filter module is used to separate the internal texture details and contour edge information of infrared small targets through filters of different scales; learnable weight parameters are introduced to achieve intelligent fusion of multi-scale features under different scenarios; based on standard convolution operations, the module can be efficiently integrated into real-time detection systems.

[0976] A dual-path attention feature enhancement module is used to reduce computational complexity by utilizing local windows and sparse sampling mechanisms; a guided fusion mechanism ensures that small target features are given priority enhancement; and the contribution ratio of local and global attention is automatically adjusted according to the input features.

[0977] The saliency-guided temporal fusion module applies different pooling strategies to high and low saliency regions using "saliency-weighted feature compression," compressing background information while preserving target details, thus optimizing the computational cost and feature quality of temporal modeling. By encoding multi-dimensional saliency information into attention biases and using them as dynamic adjustment factors for temporal position encoding and attention computation, the fusion process can automatically focus on key frames and key information. Frame-level importance weights are derived in reverse using the bias matrix naturally generated in multi-head attention computation, replacing the traditional fixed pooling strategy and achieving dynamic temporal fusion that is highly relevant to the content.

[0978] The function design module is used for designing multi-task loss functions;

[0979] The adaptive training module is used for progressive adaptive training.

[0980] Another object of the present invention is to provide a computer device including a memory and a processor, the memory storing a computer program, which, when executed by the processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion video stream mode.

[0981] Another object of the present invention is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion video stream mode.

[0982] Another objective of this invention is to provide an information data processing terminal for implementing the infrared ship small target detection system based on multi-frame fusion video stream mode.

[0983] It should be noted that embodiments of the present invention can be implemented in hardware, software, or a combination of both. The hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated-design hardware. Those skilled in the art will understand that the above-described devices and methods can be implemented using computer-executable instructions and / or included in processor control code, for example, such code provided on a carrier medium such as a disk, CD, or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The devices and modules of the present invention can be implemented by hardware circuitry such as very large-scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field-programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of the above-described hardware circuitry and software, such as firmware.

[0984] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications, equivalent substitutions, and improvements made by those skilled in the art within the scope of the technology disclosed in the present invention, and within the spirit and principles of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for detecting small infrared targets on ships in a video stream mode based on multi-frame fusion, characterized in that, Includes the following steps: Step 1: The internal texture details and contour edge information of infrared small targets are separated by a multi-scale adaptive high-frequency boosting filter module through filters of different scales; learnable weight parameters are introduced to intelligently fuse multi-scale features in different scenarios; based on standard convolution operations, the module can be efficiently integrated into a real-time detection system. Step 2: The saliency-guided dual-path attention feature enhancement module utilizes local windows and sparse sampling mechanisms to reduce computational complexity; the guided fusion mechanism ensures that small target features are enhanced in a focused manner; and the contribution ratio of local and global attention is automatically adjusted according to the input features. Step 3: The saliency-guided time series fusion module uses "saliency-weighted feature compression" to apply different pooling strategies to high and low saliency regions, compressing background information while preserving target details, thereby optimizing the computational cost and feature quality of time series modeling. By encoding multidimensional saliency information into attention bias and using it as a dynamic adjustment factor for temporal position encoding and attention calculation, the fusion process can automatically focus on key frames and key information. By utilizing the bias matrix naturally generated in multi-head attention computation, frame-level importance weights are derived in reverse. Step 4, design the multi-task loss function; Step 5, progressive adaptive training.

2. The infrared ship small target detection method based on multi-frame fusion video stream mode as described in claim 1, characterized in that, The multi-scale adaptive high-frequency boosting filter module: Step 1: Multi-scale feature separation Small-scale detail component extraction: ; Mesoscale edge component extraction: ; in, : Represents the input infrared image; , Mean filter kernels representing 3×3 and 5×5; Represents the convolution operator; Step 2: Adaptive Feature Fusion Establish a weighted fusion model: ; in, , The enhancement intensities of small-scale details and mesoscale edges are controlled separately as learnable parameters, satisfying the adaptive optimization conditions: ; ; It is the loss function of the downstream detection task; Step 3: High-frequency enhancement Construct the complete high-frequency boosting formula: ; in, These are learnable parameters that control the overall contrast enhancement intensity of the original input infrared image, satisfying the adaptive optimization conditions: ; The complete expression after expansion is: ; Among them, parameters , , Adaptive learning is adopted through end-to-end training, and the initialization strategy is designed as follows: ; ; 。 3. The infrared ship small target detection method based on multi-frame fusion video stream mode as described in claim 1, characterized in that, The dual-path attention feature enhancement module: Step 1: Calculate local window attention; Step 2: Global sparse attention calculation; Step 3: Saliency-guided feature fusion.

4. The infrared ship small target detection method based on multi-frame fusion video stream mode as described in claim 3, characterized in that, Step 1: Local window attention calculation: Input feature map Divided into non-overlapping 1 window, calculate self-attention within each window: (1) Window division operation: ; ; in, For window size, Number of windows; (2) Calculation of self-attention within the window: For each window Flatten it from a 3D feature map into a 2D matrix Calculate the query, key, and value matrix, and perform multi-head partitioning: , , ; in, , , These are learnable projective weights; It is the attention head dimension; Will , , Divided into Size: , ; , ; , ; in, Dimensions for each attention head; Relative position offset calculation: First, construct a relative position offset table: Define a relative position offset table It can cover all possible relative positional offsets within the window; where, For window size, For the number of attention heads; Relative position index: For any two positions within the window and Its relative position offset is: , ; in, , The coordinates of the pixel within the window, with a range of arrive ; Mapping two-dimensional offsets to one-dimensional indices: ; This mapping yields the relative position index matrix for all position pairs within the window. ; Window relative position offset matrix generation: First, set the bias table. Remodeled into a two-dimensional matrix : ; Then, based on the relative position index matrix From the bias table Extract the corresponding bias values ​​and construct the window relative position bias matrix: ; For the The formula for calculating the attention weight of a window is as follows: ; in, For the first The relative positional offset of each attention head For attention head dimension; Feature aggregation: ; Multi-head fusion and output projection: ; in, To output the projection matrix; (3) Window reorganization: Reconstruct the window attention results into a complete feature map: 。 5. The infrared ship small target detection method based on multi-frame fusion video stream mode as described in claim 3, characterized in that, Step two, global sparse attention calculation: (1) Generation of saliency plot: Feature compression and conversion: ; in, These are learnable 1×1 convolutional kernel weights; It is a learnable bias; It is the number of intermediate channels. This refers to the compression ratio; represent Activation function; Channel attention calculation: ; ; ; in, For channel descriptor vectors; These are learnable weights for fully connected layers; For learnable biases of fully connected layers; for function; Channel attention weights; Broadcast multiplication representing channel dimensions; Feature maps representing attention-weighted features; Spatial saliency generation: ; ; in, These are learnable 1×1 convolutional kernel weights; For learnable bias terms; The original score represents the significance. This is the final saliency plot; (2) Key point selection: Based on saliency map Before choosing The most prominent position: ; in, It is a significance threshold, used before screening. The most prominent point; (3) Sparse attention computation: Calculate the attention relationship between keypoints and their positions across the entire image only: ; ; ; ; ; in, This represents flattening a 3D feature map into a 2D matrix. This represents the feature vector corresponding to the key points extracted from the flattened features. , , It is a learnable projection matrix for global attention. It is a feature dimension of global attention; Will , , Divided into Size: , ; , ; , ; in, Dimensions for each attention head; For the The formula for calculating the sparse attention weights is as follows: ; Feature aggregation: ; Multi-head fusion and output projection: ; in, To output the projection matrix; (4) Sparse attention output: Global feature reconstruction: Key features Based on the index Repositioning them to their corresponding spatial locations creates a global sparse augmentation feature: ; Shape restoration: Restore the flattened global features to a 3D feature map with the original spatial dimensions: 。 6. The infrared ship small target detection method based on multi-frame fusion video stream mode as described in claim 3, characterized in that, Step three: Saliency-guided feature fusion: Adaptively fuse local attention results and global sparse attention results based on a saliency map: ; in, This indicates element-wise multiplication; Step 4: Channel Attention Recalibration Channel recalibration of fused features is performed using a compression excitation mechanism: ; ; in, and These are the weight matrices for dimensionality reduction and dimensionality increase, respectively. The compression ratio is... for Activation function; ; in, This represents broadcast multiplication at the channel dimension; To ensure training stability, consider adding residual connections, and the final output... for: 。 7. An infrared ship small target detection system based on multi-frame fusion in a video stream mode, implementing the method as described in any one of claims 1-6, characterized in that, The infrared small target detection system for ships based on multi-frame fusion video stream mode includes: The multi-scale adaptive high-frequency boosting filter module is used to separate the internal texture details and contour edge information of infrared small targets through filters of different scales; learnable weight parameters are introduced to achieve intelligent fusion of multi-scale features under different scenarios; based on standard convolution operations, the module can be efficiently integrated into real-time detection systems. A dual-path attention feature enhancement module is used to reduce computational complexity by utilizing local windows and sparse sampling mechanisms; a guided fusion mechanism ensures that small target features are given priority enhancement; and the contribution ratio of local and global attention is automatically adjusted according to the input features. The saliency-guided temporal fusion module utilizes "saliency-weighted feature compression" to apply different pooling strategies to high and low saliency regions, compressing background information while preserving target details, thus optimizing the computational overhead and feature quality of temporal modeling. By encoding multi-dimensional saliency information into attention biases and using them as dynamic adjustment factors for temporal position encoding and attention computation, the fusion process can automatically focus on key frames and key information. The module also uses the bias matrix naturally generated in multi-head attention computation to derive frame-level importance weights, replacing the traditional fixed pooling strategy and achieving dynamic temporal fusion that is highly relevant to the content. The function design module is used for designing multi-task loss functions; The adaptive training module is used for progressive adaptive training.

8. A computer device, characterized in that, The computer device includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion video stream mode as described in any one of claims 1-6.

9. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the infrared ship small target detection method based on multi-frame fusion in a video stream mode as described in any one of claims 1-6.

10. An information data processing terminal, characterized in that, The information data processing terminal is used to implement the infrared ship small target detection system based on multi-frame fusion video stream mode as described in claim 7.