Multi-frame infrared small target detection method based on attention fusion coding-decoding network

By designing a multi-frame infrared small target detection method based on attention fusion encoder-decoder network, the problems of insufficient utilization of temporal features and excessive computational resources are solved, achieving efficient infrared small target detection and improving detection accuracy and detailed feature extraction capabilities.

CN122244526APending Publication Date: 2026-06-19NANJING UNIV OF SCI & TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING UNIV OF SCI & TECH
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multi-frame infrared small target detection algorithms suffer from insufficient utilization of temporal features, inadequate extraction of target detail features, and high computational resource requirements, making it difficult to effectively suppress background interference and improve detection performance.

Method used

The design incorporates an attention-based fusion encoder-decoder network, including a motion enhancement module based on inter-frame difference, an attention fusion module based on convolutional long short-term memory network, and a spatial feature reconstruction module based on gradient information. Through multi-scale feature fusion and gradient information constraints, the target feature extraction capability is improved.

Benefits of technology

It significantly improves the performance of infrared small target detection, balances detection accuracy with computational resource consumption, and achieves accurate identification of moving targets and preservation of detailed features.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244526A_ABST
    Figure CN122244526A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-frame infrared small target detection method based on an attention fusion encoder-decoder network, comprising the following steps: acquiring paired infrared small target images and ground truth image datasets; constructing a U-shaped encoder-decoder network model based on attention fusion, including designing a motion enhancement module based on inter-frame difference to explicitly express the local motion intensity of the target, designing an attention fusion module based on a convolutional long short-term memory network to enable the network to focus on key information related to the small target in three dimensions: time, channel, and space, and designing a spatial feature reconstruction module based on gradient information to enhance the prediction of the target edge contour; determining a structure-aware hybrid loss function to update the network parameters; and training and testing the designed network using the dataset. This invention was tested on a publicly available multi-frame infrared small target dataset. Compared to advanced comparative algorithms, this invention achieved the best overall detection performance, with an IoU index improvement of 3.16% compared to the second-best algorithm, and achieved a balance between detection performance and speed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multi-frame infrared small target detection, specifically a multi-frame infrared small target detection method based on an attention fusion encoding / decoding network. Background Technology

[0002] Multi-frame infrared small target detection technology, with its advantages of all-weather operation, anti-interference, and strong concealment, has broad application prospects in both military and civilian fields. In the civilian field, it can serve scenarios such as urban security monitoring, forest fire prevention, power line inspection, and maritime rescue, enabling early fire identification, high-voltage line fault detection, and search and rescue of people who have fallen into the water under low visibility conditions.

[0003] In most infrared small target detection applications, the distance between the target and the detection sensor is usually far, and the grayscale value of infrared small targets is typically low, resulting in a lack of shape and texture features, which poses a challenge to infrared small target detection. Furthermore, infrared small target detection is often affected by complex backgrounds and noise interference, making it difficult to effectively extract characteristic information of infrared small targets and easily causing false alarms. Suppressing the background and accurately detecting infrared small targets is a challenging problem. Currently, deep learning-based multi-frame infrared small target detection technology can more effectively learn the features of infrared small targets, capture the subtle grayscale differences and local spatial distribution features between infrared small targets and the background, and combine inter-frame temporal correlation information to enhance the dynamic feature representation of the target, effectively distinguish moving targets from various background clutter interferences, and thus improve detection performance.

[0004] In recent years, deep learning methods have been widely applied to infrared small target detection due to their powerful generalization ability and efficiency. These methods effectively cope with various complex environments through data-driven approaches. Wu et al. proposed a learnable local saliency kernel network, L2SKNet (WU FY, LIU AR, ZHANG TF, et al. Saliency at thehelm: steering infrared small target detection with learnable kernels[J].IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 1-14.). This method designs a learnable local saliency kernel module, using the idea of ​​center-neighbor subtraction to guide the network to capture target features, and enhances the ability to capture multi-scale infrared features through a hierarchical structure. However, since it does not consider temporal information, the network is prone to false alarms when facing background interference with similar grayscale distribution characteristics to the target. To achieve an effective balance between detection performance and computational resources, Yu et al. proposed a lightweight and robust network, LR-Net (YU C, LIU YP, ZHAO JM, et al. LR-Net: a lightweight and robust network for infrared small target detection[C] / / Lecture Notes in Computer Science. 2025: 19-33.). This method constructs a refined feature transfer module, which, compared to direct cross-layer connections, can improve the network's ability to extract detailed features while reducing computational resource consumption. However, its ability to extract target features still needs improvement, and it performs poorly when handling targets with rich detailed features. With the increasing demands for detection performance in practical applications, single-frame algorithms have limitations in handling small targets accompanied by strong clutter interference. Multi-frame detection algorithms, because they can correlate information from multiple frames, can significantly suppress noise and clutter interference, thus achieving more accurate and robust detection performance. Therefore, multi-frame detection algorithms are becoming increasingly popular and have become a major research direction in the field of infrared small target detection.He et al. proposed a dual encoder-decoder multi-frame infrared small target detection network called DEMNet (HE F, ZHANG QR, LI YC, et al.DEMNet: dual encoder–decoder multi-frame infrared small target detection network with motion encoding[J]. Remote Sensing, 2025, 17(17): 4106.). This method designs two encoder-decoder modules. The first encoder-decoder module fuses features from multiple levels through spatial and channel attention mechanisms to extract spatial feature information from each frame image. The second encoder-decoder module simultaneously extracts intra-frame target position information and inter-frame target motion information. However, this network design is relatively complex and requires a large amount of computational resources. To address the relative motion problem between the detection platform and the background, Huang et al. proposed the LMAFormer network (HUANG YX, ZHI XY, HU JM, et al. LMAFormer: Local motion aware transformer for small moving infrared target detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 1-17.). This method introduces a local motion-aware spatiotemporal attention mechanism, which can align and enhance features across multiple frames and avoid interference from moving backgrounds. However, the network still lacks the ability to extract detailed features of the target. Summary of the Invention

[0005] To overcome the shortcomings of the prior art, this invention provides a multi-frame infrared small target detection method based on attention fusion encoding and decoding network, which improves the problems of insufficient utilization of temporal features, insufficient extraction of target detail features, and large computational resource requirements in multi-frame infrared small target detection algorithms when detecting multi-frame infrared small target images.

[0006] The technical solution to achieve the purpose of this invention is: a multi-frame infrared small target detection method based on attention fusion encoding and decoding network, comprising the following steps:

[0007] Step 1: Obtain a multi-frame infrared small target dataset, which consists of a continuous sequence of infrared small target images and corresponding ground truth images;

[0008] Step 2: Construct a U-shaped encoder-decoder network model based on attention fusion, including a motion enhancement module based on inter-frame difference, an attention fusion module based on convolutional long short-term memory network, and a spatial feature reconstruction module based on gradient information;

[0009] Step 3: Determine the structure-aware hybrid loss function to update the network parameters;

[0010] Step 4: Train the designed network structure using a multi-frame infrared small target dataset until a network model with good validation results is obtained.

[0011] Step 5: Input the multi-frame infrared small target test set into the trained network model to obtain the detection result image of the network model.

[0012] Compared with the prior art, the present invention has the following features: (1) The present invention designs a motion enhancement module based on inter-frame difference. This module calculates the absolute difference between two adjacent frames, explicitly models the temporal motion weights, and applies them to the input image sequence to effectively highlight the moving target and suppress static or slowly changing background interference. (2) The present invention designs an attention fusion module based on convolutional long short-term memory network. This module introduces channel attention and spatial attention to perform weighted modulation of temporal gate features, so that the network can focus on the feature information related to small targets in the three dimensions of time, channel and space, and fully extract the target features. (3) The present invention designs a spatial feature reconstruction module based on gradient information. This module first performs spatial scale changes and multi-scale feature fusion on the input feature map through multi-scale feature encoding and hierarchical decoding structure, and then fuses the feature information extracted by encoding and decoding with the image edge information to further improve the network's ability to extract image detail information.

[0013] The present invention will now be further described with reference to the accompanying drawings. Attached Figure Description

[0014] Figure 1 This is a flowchart illustrating a multi-frame infrared small target detection method based on an attention fusion encoder-decoder network according to the present invention.

[0015] Figure 2 This is a network structure diagram of the present invention.

[0016] Figure 3 This is a structural diagram of the attention fusion module based on convolutional long short-term memory network of the present invention.

[0017] Figure 4The images shown are partial results of the detection performed by the algorithm of this invention and the comparison algorithm on the test set. The images are (a) the original image, (b) the ground truth image, (c) the L2SKNet detection result image, (d) the LRNet detection result image, (e) the DEMNet detection result image, (f) the LMAFormer detection result image, and (g) the detection result image of this invention. Detailed Implementation Plan

[0018] In the field of multi-frame infrared small target detection, infrared small targets are typically characterized by their small size, weak contrast, and susceptibility to interference from complex backgrounds. This makes it difficult for existing algorithms to effectively extract feature information from infrared small targets, resulting in problems such as false alarms, missed detections, and incomplete target detection. To address this challenge, this invention proposes a multi-frame infrared small target detection method based on an attention fusion encoder-decoder network. By designing the network structure, this invention constructs a neural network model with a U-shaped encoder-decoder network as the main architecture, combined with a multi-attention fusion module based on a convolutional long short-term memory network. A motion enhancement module is designed for initial weighting of moving targets, and gradient features are introduced at the end to refine the target prediction results. In the motion enhancement module, this invention constructs temporal feature weights through inter-frame difference and weights the input sequence frame by frame. This design achieves initial enhancement of moving target features with low computational requirements. In the attention fusion module, this invention uses a convolutional long short-term memory structure combined with channel attention and spatial attention mechanisms to complete the extraction and fusion of infrared small target features in three dimensions: time, channel, and space. In the spatial feature reconstruction module, this invention effectively fuses feature information from different spatial scales through a multi-scale encoding and decoding structure, and introduces gradient information to constrain the reconstruction results, thereby better enhancing the network's ability to extract detailed target features. Finally, the detection results of this invention are qualitatively and quantitatively compared and analyzed with those obtained by other comparative algorithms. The analysis results show that this invention has significant performance advantages in infrared small target detection, effectively balancing the relationship between detection performance and time consumption, and achieving the expected research objectives.

[0019] This invention proposes a multi-frame infrared small target detection method based on an attention fusion encoder-decoder network, comprising the following steps:

[0020] Step 1: To verify the detection performance of the designed network, this invention selected the publicly available multi-frame infrared small target detection dataset NUDT-MIRSDT for network training and testing. This dataset contains 120 image sequences, each consisting of 100 consecutive frames. Background types include diverse scenes such as sky, ocean, and land. The infrared small targets exhibit dynamic morphological changes within the sequences. The dataset provides pixel-level ground truth annotations for the targets. This invention divides 84 image sequences from the 120 image sequences into a training set and the remaining 36 image sequences into a test set. The designed network is trained using the training set, and then tested using the test set.

[0021] A multi-frame infrared small target dataset is obtained, which consists of a continuous sequence of infrared small target images and corresponding ground truth images;

[0022] Step 2: Construct a U-shaped encoder-decoder network model based on attention fusion, including a motion enhancement module based on inter-frame difference, an attention fusion module based on convolutional long short-term memory network, and a spatial feature reconstruction module based on gradient information;

[0023] Step 2.1: Construct the motion enhancement module. The motion enhancement module characterizes local motion intensity by calculating the absolute difference between adjacent frames. This design requires no additional learnable parameters, effectively highlighting moving targets and suppressing static or slowly changing background interference, thus providing more discriminative input features for subsequent feature extraction and significantly improving the distinguishability of small infrared targets. The motion enhancement module includes inter-frame difference blocks, temporal alignment blocks, and feature weighting blocks, used to weight the input temporal features. The specific structure of the motion enhancement module is as follows:

[0024] Inter-frame difference block: Input a feature map of size T×M×N×1; perform element-wise difference operation on two adjacent frames in the time dimension T, take the absolute value of the difference result, and output a difference feature map of size (T−1)×M×N×1.

[0025] Temporal alignment block: The differential feature map output by the inter-frame differential block is used as input, and the differential feature map is padded at the front end of the temporal dimension so that the differential feature map can be aligned with the original size input feature map in the temporal dimension, and the output temporal alignment feature map has a size of T×M×N×1.

[0026] Feature weighted block: The feature map output from the time-aligned block is taken as input, and after a linear mapping operation, it is multiplied element-wise with the original input feature map to output a feature map of size T×M×N×1.

[0027] Step 2.2: Construct the attention fusion module. This module introduces channel attention and spatial attention mechanisms based on the traditional gating mechanism of convolutional long short-term memory (LSTM) networks. Channel attention compresses the spatial size, focusing on channels containing important information, while spatial attention focuses on the spatial location of the target. The LSTM network, based on the gate mechanism, retains key temporal information relevant to the target while ignoring unimportant background information. Combining these three mechanisms allows the network to focus on key information related to the small target in three dimensions: time, channel, and space. The attention fusion module includes a feature concatenation block, a gated convolutional block, a channel attention block, a spatial attention block, an input gate, a forget gate, an output gate, a candidate state calculation block, a memory state update block, a hidden state update block, and an output layer. The specific structure of the attention fusion module is as follows:

[0028] Feature stitching block: The obtained T×M×N×1 feature map is used as input. The input features at the current time step are stitched together with the hidden state at the previous time step in the channel dimension. The output is a stitched feature map of size M×N×(1+H), where H represents the number of hidden state channels.

[0029] Gated convolutional block: The spliced ​​feature map output by the feature splicing block is used as input, passed through a two-dimensional convolutional layer (3×3 kernel size, 32 kernels, 1 stride, 1 padding), and outputs an M×N gated feature map with 4H channels;

[0030] Channel Attention Block: The channel attention block first performs pooling operations in the spatial dimension to compress the spatial size, facilitating the subsequent learning of channel features. Then, it learns the features and importance of each channel dimension, finally obtaining the channel attention weights. The channel attention block takes the feature map output from the gated convolutional block as input, passes it through a global average pooling layer (compressing the spatial size to 1), a two-dimensional convolutional layer (1×1 kernel size, 4 kernels), a non-linear activation function layer (ReLU activation function), then another two-dimensional convolutional layer (1×1 kernel size, 32 kernels), and a non-linear activation function layer (Sigmoid activation function), generating a channel weight feature map with the same number of channels as the gated feature map. A channel-by-channel weighting operation is then performed on the gated feature map, outputting an M×N feature map with 4H channels after channel attention weighting.

[0031] Spatial attention block: The spliced ​​feature map output by the feature splicing block is taken as input, passed through a two-dimensional convolutional layer (7×7 kernel size, 1 kernel, 1 stride, 3 padding) and a non-linear activation function layer (Sigmoid activation function) to generate a spatial weighted feature map. This is then multiplied element-wise with the feature map after channel attention weighting to output an M×N feature map with 4H channels after spatial attention weighting.

[0032] Input gate: The first set of M×N feature maps with H channels from the spatial attention block output is taken as input, passed through an element-wise Sigmoid activation layer, and outputs an input gate weight feature map of size M×N×H. The Sigmoid activation function transforms the weight range into [0,1], thereby determining which information is important and its degree of importance, and adding it to the memory state. 0 means all information is discarded and the memory state is not updated at all, 1 means all information is retained and added to the memory state, and values ​​between 0 and 1 indicate that some information is added to the memory state.

[0033] Forget Gate: The second set of M×N feature maps with H channels from the spatial attention block output is taken as input, passed through an element-wise Sigmoid activation layer, and outputs a forget gate weight feature map of size M×N×H. The Sigmoid activation function transforms the weight range into [0,1], thereby determining which information is important and to what extent, and thus deciding whether to forget or retain certain information in the memory state. 0 indicates that all information is forgotten, 1 indicates that all information is retained, and values ​​between 0 and 1 indicate that some information is retained.

[0034] Output gate: The third set of M×N feature maps with H channels from the spatial attention block output is taken as input, passed through an element-wise Sigmoid activation layer, and outputs an output gate weight feature map with size M×N×H. The Sigmoid activation function transforms the weight value range into [0,1], thereby determining which information is important and to what extent, and thus deciding which information to select as the hidden state output.

[0035] Candidate state computation block: Taking the fourth group of M×N feature maps with H channels output from the spatial attention block as input, after passing through an element-wise hyperbolic tangent activation layer, the output is a candidate state feature map of size M×N×H. The candidate state feature map is then added to the memory state after being modulated by the input gate weights.

[0036] The memory state update block takes the input gate weight feature map, forget gate weight feature map, and candidate state feature map output by the input gate, forget gate, and candidate state calculation block as input, performs element-wise multiplication of the memory state feature map of the previous time step with the forget gate weight feature map, performs element-wise multiplication of the candidate state feature map with the input gate weight feature map, and then performs element-wise addition of the results of the two operations, outputting a memory state feature map of size M×N×H for the current time step.

[0037] Hidden State Update Block: The memory state feature map of the current time step output by the memory state update block is taken as input, passed through an element-wise hyperbolic tangent activation layer, and then multiplied element-wise with the output gate weight feature map to output a hidden state feature map of size M×N×H at the current time step.

[0038] Output layer: The feature map output by the hidden state update block is used as input, and the hidden state feature maps output at all time steps are stacked sequentially in the time dimension to output a feature map of size T×M×N×H.

[0039] Step 2.3: Constructing the Spatial Feature Reconstruction Module. This module first performs spatial scale transformation through an encoding / decoding structure. Furthermore, it introduces cross-scale feature stitching to achieve multi-scale fusion, fusing high-level semantic information with low-level detail information. In the output stage, it calculates the spatial gradient features of the original input image and fuses them with the features reconstructed from the encoding / decoding structure, thereby further enhancing the model's ability to detect details of small infrared targets. The spatial feature reconstruction module includes three encoding layers, two decoding layers, and one output layer, which operate on feature maps at different resolutions, as well as a temporal dimension aggregation block for compressing the temporal dimension and a gradient-based edge refinement block. The edge refinement block mainly includes a gradient extraction layer, a feature fusion layer, an attention modulation layer, and a refined output layer. The specific structure of the spatial feature reconstruction module is as follows:

[0040] Encoding layer 1: The obtained feature map of size T×M×N×H is taken as input, and after passing through a three-dimensional convolutional layer (convolution kernel size 3×3×3, number of convolution kernels 32, convolution stride 1×1×1, padding 1), a batch normalization layer, and a non-linear activation function layer (activation function is ReLU), the output feature map of size T×M×N×32 is obtained.

[0041] Encoding Layer 2: The feature map output from Encoding Layer 1 is used as input and passed through a 3D convolutional layer (3×3×3 kernel size, 48 kernels, 1×2×2 stride, padding 1) to perform downsampling in the spatial dimension, outputting a feature map with 48 channels. After passing through a batch normalization layer and a non-linear activation function layer (ReLU activation function), the output feature map has a size of T×(M / 2)×(N / 2)×48.

[0042] Encoding layer 3: The feature map output from encoding layer 2 is used as input, and then passed through a three-dimensional convolutional layer (3×3×3 kernel size, 64 kernels, 1×2×2 stride, padding 1). Further downsampling is performed in the spatial dimension to output a feature map with 64 channels. After passing through a batch normalization layer and a non-linear activation function layer (ReLU activation function), a feature map with size T×(M / 4)×(N / 4)×64 is output.

[0043] Temporal aggregation block: The feature maps output from coding layer 1, coding layer 2, and coding layer 3 are taken as input and weighted aggregation is performed in the temporal dimension (the weights are learnable parameters, which are weighted in the temporal dimension after passing through the Softmax activation function). The temporal dimension T is compressed to 1, and three feature maps with sizes of 1×M×N×32, 1×(M / 2)×(N / 2)×48, and 1×(M / 4)×(N / 4)×64 are output respectively.

[0044] Decoding Layer 1: The feature map from Encoding Layer 3, output from the temporal aggregation block, is taken as input. After passing through a bilinear upsampling layer, the spatial size is restored to (M / 2)×(N / 2). Then, it is concatenated with the feature map from Encoding Layer 2, output from the temporal aggregation block, in the channel dimension. The concatenated feature map has 112 channels. Then, it passes through a deconvolution layer (2×2 kernel size, 48 kernels, stride of 2) and a non-linear activation function layer (ReLU activation function). Then, it passes through a two-dimensional convolution layer (3×3 kernel size, 48 kernels, stride of 1, padding of 1) and a non-linear activation function layer (ReLU activation function). The output feature map has a size of (M / 2)×(N / 2)×48.

[0045] Decoding Layer 2: The feature map output from Decoding Layer 1 is used as input and concatenated with the feature map from Encoding Layer 1 output from the temporal aggregation block in the channel dimension. The concatenated feature map has 80 channels. Then, it passes through a deconvolution layer (2×2 kernel size, 32 kernels, stride of 2) and a non-linear activation function layer (ReLU activation function), then through a 2D convolution layer (3×3 kernel size, 32 kernels, stride of 1, padding of 1) and a non-linear activation function layer (ReLU activation function), and finally through a bilinear upsampling layer to restore the spatial size to M×N, outputting a feature map with a size of M×N×32.

[0046] Output layer: The feature map output from decoding layer 2 is used as input, passed through a two-dimensional convolutional layer (3×3 kernel size, 16 kernels, 1 stride, 1 padding) and a channel compression convolutional layer (1×1 kernel size, 1 kernel, 1 stride), and outputs a feature map of size M×N×1.

[0047] Gradient Extraction Layer: Taking the feature map output by the output layer as input, it calculates the difference between adjacent pixels in the horizontal and vertical directions to obtain horizontal and vertical gradient feature maps. Then, it first processes the horizontal and vertical gradient feature maps... Figure 2 The average operation is performed first, and then the average operation is performed again along the channel dimension, resulting in a feature map of size M×N×1.

[0048] Feature fusion layer: The feature map output by the gradient extraction layer is taken as input and concatenated with the feature map output by the output layer in the channel dimension, and the output is a fused feature map with a size of M×N×2.

[0049] Attention Modulation Layer: The feature map output from the feature fusion layer is used as input. It first passes through a two-dimensional convolutional layer (3×3 kernel size, 16 kernels, 1 stride, 1 padding) and a non-linear activation function layer (ReLU activation function). Then it passes through another two-dimensional convolutional layer (1×1 kernel size, 1 kernel, 1 stride) and a non-linear activation function layer (Sigmoid activation function) to generate corresponding weighted feature maps. The fused feature map is then weighted element-wise using the weighted feature maps, and a feature map of size M×N×2 is output.

[0050] Refined output layer: The feature map output by the attention modulation layer is taken as input, passed through a two-dimensional convolutional layer (3×3 kernel size, 1 kernel, 1 stride, and 1 padding), and the output is a refined feature map of size M×N×1.

[0051] Step 3: Determine the structure-aware hybrid loss function used to update network parameters. The loss function designed in this invention is a weighted combination of multiple losses, used to impose multi-dimensional constraints on the network model's output during training. This combined loss function constrains and optimizes four aspects: pixel-level error, region overlap, false positive to false negative ratio, and structural similarity. The formula for the loss function used in this invention is expressed as follows:

[0052]

[0053] in, This represents the weighted binary cross-entropy loss, which measures the difference between the probability distribution predicted by the model and the probability distribution of the true labels. This invention addresses the imbalance between the foreground and background of small infrared targets by introducing weighting coefficients to positive samples. The formula is as follows:

[0054]

[0055] in, and These represent the predicted image and the ground truth image, respectively. Represents the pixel index of the image. These are the weighting coefficients for positive samples; This represents the Dice loss, which evaluates the overlap between the model's predictions and the true labels. It constructs a constraint term based on the region matching degree to optimize model performance. The formula is as follows:

[0056]

[0057] in, To represent a very small constant, to prevent the denominator from being 0; This represents the Focal Tversky loss, which introduces weighting coefficients for false positives and false negatives, modulated exponentially. By reducing the weight of easily classified samples and emphasizing difficult-to-classify samples, it increases the model's focus on the target. The formula is as follows:

[0058]

[0059] in, and These are the weighting parameters for false positives and false negatives, respectively. For exponential modulation coefficients; This represents the structural similarity loss, which, based on local window statistical features, measures the relationship between the mean, variance, and covariance of the model's predicted results and the ground truth labels within a local region. It is a constraint term based on local structural consistency. The formula is as follows:

[0060]

[0061] in, and These represent the mean values ​​of the predicted image and the ground truth image within the local window, respectively. and These represent the corresponding standard deviations. and They are set to and The constant should be used to avoid the denominator being zero.

[0062] Step 4: Train the designed network structure using a multi-frame infrared small target dataset until a network model with good validation results is obtained. The training process is as follows:

[0063] During the training phase, mini-batch sampling was used to iteratively update the network parameters, with a batch size of 4. The input image sequences were uniformly scaled to a resolution of 256×256. The AdamW optimization algorithm was used for network parameter updates, with an initial learning rate of 0.001 and a weight decay coefficient of 1×10⁻⁶. -4 During training, a cosine annealing learning rate scheduling strategy is used to dynamically adjust the learning rate, with the minimum learning rate set to 1×10. -5 The scheduling period parameter is set to 150, and the total number of training rounds is set to 150. During the training process, the model is validated and saved in each round until the maximum number of training rounds is reached. Then, the optimal model is selected as the final model.

[0064] Step 5: Input the multi-frame infrared small target test set into the trained network model to obtain the detection result image of the network model, and then compare and evaluate it with the ground truth image.

[0065] Example

[0066] This experimental example compares the proposed method with two deep learning-based single-frame infrared small target detection algorithms, L2SKNet and LRNet, and two deep learning-based multi-frame infrared small target detection algorithms, DEMNet and LMAFormer, using the public dataset NUDT-MIRSDT mentioned in step 1 for qualitative and quantitative analysis. L2SKNet is an infrared small target detection method based on local saliency, guiding the network to capture salient features through the idea of ​​center-neighbor subtraction. LRNet employs a low-level feature distribution strategy, using low-level features to supplement high-level feature information to address the problem of small target loss in high-level feature maps. DEMNet uses two encoder-decoder modules to extract spatial features and motion information respectively, and finally fuses the two types of feature information. LMAFormer introduces a local motion-aware spatiotemporal attention mechanism to align and enhance multi-frame features, thereby extracting the local spatiotemporal salient features of the target while avoiding interference from moving backgrounds.

[0067] The test results for each method are as follows: Figure 4 As shown, the images, in order, are (a) the original image, (b) the ground truth image, (c) the L2SKNet detection result, (d) the LRNet detection result, (e) the DEMNet detection result, (f) the LMAFormer detection result, and (g) the detection result of this invention. L2SKNet can detect targets well when the local contrast is high, but it often fails to detect targets when they are submerged in the background due to low grayscale values. LRNet can detect targets in almost all images, but it is prone to generating many false alarms, and the completeness of the detected targets needs improvement. DEMNet achieves excellent detection performance in most images, but in some images, the shape of the detected targets still differs significantly from the ground truth image. LMAFormer can accurately detect targets in most images, but it is affected by background clutter when the target is weak, resulting in false alarms. The network designed in this invention can accurately detect targets while preserving target details to the maximum extent, and it does not generate false alarms in any images, achieving the best intuitive results.

[0068] In addition, commonly used objective evaluation metrics for infrared small target detection include precision, recall, and intersection-over-union (IoU). The test results are shown in Table 1.

[0069] Table 1 Quantitative Analysis Results

[0070] Precision Recall IoU L2SKNet 0.5584 0.5407 0.4412 LRNet 0.5676 0.5176 0.4662 DEMNet 0.7696 0.7958 0.6593 LMAFormer 0.8907 0.8244 0.7608 This invention 0.8830 0.8854 0.7924

[0071] Among them, Precision is used to evaluate the accuracy of positive samples predicted by the model; a higher value indicates fewer false alarms. Recall is used to evaluate the model's ability to detect targets; a higher value indicates fewer false negatives. IoU is used to evaluate the degree of overlap between the target region predicted by the model and the target region in the ground truth image; a higher value indicates a higher degree of matching between the predicted target region and the target region in the ground truth image. As shown in Table 1, the model of this invention performs best in both Recall and IoU, showing a significant improvement over the second-best. It ranks second in Precision, only slightly lower than LMAFormer. In summary, this invention achieves the best overall detection performance.

[0072] Finally, the time consumption of the model and the comparison algorithm in a single inference was tested on the test set, and the results are shown in Table 2. Table 2 shows that the model of this invention has the fastest detection speed among all comparison algorithms.

[0073] Table 2 Time Consumption for a Single Inference Attempt

[0074] algorithm Time taken (ms) L2SKNet 13.36 LRNet 29.00 DEMNet 85.75 LMAFormer 4656.06 This invention 10.92

Claims

1. A multi-frame infrared small target detection method based on an attention fusion coding-decoding network, characterized in that, Includes the following steps: Step 1: Obtain a multi-frame infrared small target dataset, which consists of a continuous sequence of infrared small target images and corresponding ground truth images; Step 2: Construct a U-shaped encoder-decoder network model based on attention fusion, including a motion enhancement module based on inter-frame difference, an attention fusion module based on convolutional long short-term memory network, and a spatial feature reconstruction module based on gradient information. The motion enhancement module is used to perform weighted processing on the input temporal features. Step 3: Determine the structure-aware hybrid loss function to update the network parameters; Step 4: Train the designed network structure using a multi-frame infrared small target dataset until a network model that meets the set conditions is obtained. Step 5: Input the multi-frame infrared small target images into the trained network model to obtain the detection result images of the network model.

2. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 1, characterized in that, The motion enhancement module includes an inter-frame difference block, a temporal alignment block, and a feature weighting block. The input of the inter-frame difference block is a feature map of size T×M×N×1. The inter-frame difference block is used to perform element-wise difference operations on two adjacent frames in the temporal dimension T, and take the absolute value of the difference result to output a difference feature map of size (T−1)×M×N×1. The input to the temporal alignment block is the differential feature map output by the inter-frame differential block. The temporal alignment block performs a padding operation on the differential feature map at the front end of the temporal dimension and outputs a temporal alignment feature map with a size of T×M×N×1. The feature weighting block takes the feature map output by the time-aligned block as input, performs a linear mapping operation, and then multiplies it element-wise with the original input feature map to output a feature map of size T×M×N×1.

3. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 1, characterized in that, The attention fusion module includes a feature concatenation block, a gated convolution block, a channel attention block, a spatial attention block, an input gate, a forget gate, an output gate, a candidate state calculation block, a memory state update block, a hidden state update block, and an output layer. The specific processing procedure is as follows: Feature stitching block: The T×M×N×1 feature map obtained by the motion enhancement module is used as input. The input features at the current time step are stitched together with the hidden state at the previous time step in the channel dimension, and the output stitched feature map with size M×N×(1+H) is output, where H represents the number of hidden state channels. Gated convolutional block: The spliced ​​feature map output by the feature splicing block is taken as input, passed through a two-dimensional convolutional layer, and outputs an M×N gated feature map with 4H channels; Channel attention block: The feature map output by the gated convolutional block is taken as input, passed through a global average pooling layer, a two-dimensional convolutional layer, a non-linear activation function layer, and then through another two-dimensional convolutional layer and a non-linear activation function layer to generate a channel weight feature map with the same number of channels as the gated feature map. Then, a channel-by-channel weighting operation is performed on the gated feature map to output an M×N feature map with 4H channels after channel attention weighting. Spatial attention block: The spliced ​​feature map output by the feature splicing block is taken as input, and after passing through a two-dimensional convolutional layer and a non-linear activation function layer, a spatial weighted feature map is generated. This is then multiplied element-wise with the feature map after channel attention weighting, and the output is an M×N feature map with 4H channels after spatial attention weighting. Input gate: The first set of M×N feature maps with H channels output by the spatial attention block is taken as input, and after passing through an element-wise sigmoid activation layer, the output is an input gate weight feature map with size M×N×H. Forget gate: The second set of M×N feature maps with H channels output by the spatial attention block is taken as input, and after passing through an element-wise sigmoid activation layer, the output is a forget gate weight feature map with size M×N×H. Output gate: The third set of M×N feature maps with H channels output by the spatial attention block is taken as input, and after passing through an element-wise sigmoid activation layer, the output gate weight feature map with size M×N×H is output. Candidate state computation block: The fourth group of M×N feature maps with H channels output by the spatial attention block is taken as input, and after passing through an element-wise hyperbolic tangent activation layer, the output is a candidate state feature map with size M×N×H. Memory state update block: It takes the input gate weight feature map, forget gate weight feature map and candidate state feature map output by the input gate, forget gate and candidate state calculation block as input, performs element-wise multiplication of the memory state feature map of the previous time step with the forget gate weight feature map, performs element-wise multiplication of the candidate state feature map with the input gate weight feature map, and then performs element-wise addition of the results of the two operations, and outputs the memory state feature map of the current time step with a size of M×N×H. Hidden State Update Block: The memory state feature map of the current time step output by the memory state update block is taken as input, passed through an element-wise hyperbolic tangent activation layer, and then multiplied element-wise with the output gate weight feature map to output a hidden state feature map of size M×N×H at the current time step. Output layer: The feature map output by the hidden state update block is used as input, and the hidden state feature maps output at all time steps are stacked sequentially in the time dimension to output a feature map of size T×M×N×H.

4. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 1, characterized in that, The spatial feature reconstruction module includes three encoding layers, two decoding layers, and one output layer, which operate on feature maps at different resolutions. It also includes a temporal aggregation block for compressing the temporal dimension and an edge refinement block based on gradient information. The edge refinement block includes a gradient extraction layer, a feature fusion layer, an attention modulation layer, and a refinement output layer. The specific processing steps of the spatial feature reconstruction module are as follows: Encoding layer 1: The feature map of size T×M×N×H obtained by the attention fusion module is taken as input, and after passing through a three-dimensional convolutional layer, a batch normalization layer and a non-linear activation function layer, the output feature map of size T×M×N×32 is generated. Encoding layer 2: The feature map output from encoding layer 1 is used as input, and after passing through a three-dimensional convolutional layer, a downsampling operation is performed in the spatial dimension to output a feature map with 48 channels. After passing through a batch normalization layer and a non-linear activation function layer, a feature map with a size of T×(M / 2)×(N / 2)×48 is output. Encoding layer 3: The feature map output from encoding layer 2 is used as input, and after passing through a three-dimensional convolutional layer, a downsampling operation is performed in the spatial dimension to output a feature map with 64 channels. After passing through a batch normalization layer and a non-linear activation function layer, a feature map with a size of T×(M / 4)×(N / 4)×64 is output. Temporal dimension aggregation block: The feature maps output from coding layer 1, coding layer 2 and coding layer 3 are taken as input and weighted aggregation is performed in the temporal dimension to compress the temporal dimension T to 1. The output features are three feature maps with sizes of 1×M×N×32, 1×(M / 2)×(N / 2)×48 and 1×(M / 4)×(N / 4)×64 respectively. Decoding layer 1: The feature map from encoding layer 3, output from the temporal aggregation block, is taken as input. After passing through a bilinear upsampling layer, the spatial size is restored to (M / 2)×(N / 2). Then, it is concatenated with the feature map from encoding layer 2, output from the temporal aggregation block, in the channel dimension. The concatenated feature map has 112 channels. Then, it passes through a deconvolution layer and a non-linear activation function layer, followed by a two-dimensional convolution layer and a non-linear activation function layer, outputting a feature map with a size of (M / 2)×(N / 2)×48. Decoding Layer 2: The feature map output from Decoding Layer 1 is taken as input and concatenated with the feature map from Encoding Layer 1 output from the temporal aggregation block in the channel dimension. The concatenated feature map has 80 channels. Then it goes through a deconvolution layer and a non-linear activation function layer, then a two-dimensional convolution layer and a non-linear activation function layer, and finally a bilinear upsampling layer to restore the spatial size to M×N. The output feature map has a size of M×N×32. Output layer: The feature map output from decoding layer 2 is taken as input, passed through a two-dimensional convolutional layer and a channel compression convolutional layer, and outputs a feature map of size M×N×1; Gradient extraction layer: The feature map output by the output layer is used as input. The difference between adjacent pixels in the horizontal and vertical directions is calculated to obtain two sets of gradient feature maps. Then, they are fused to output a feature map with a size of M×N×1. Feature fusion layer: The feature map output by the gradient extraction layer is taken as input and concatenated with the feature map output by the output layer in the channel dimension, and the output is a fused feature map with a size of M×N×2. Attention Modulation Layer: The feature map output from the feature fusion layer is taken as input, first passed through a two-dimensional convolutional layer and a non-linear activation function layer, then through another two-dimensional convolutional layer and a non-linear activation function layer to generate the corresponding weighted feature map. The fused feature map is then weighted element-wise using the weighted feature map, and the output feature map is of size M×N×2. Refined output layer: The feature map output by the attention modulation layer is taken as input, passed through a two-dimensional convolutional layer, and the output is a refined feature map with a size of M×N×1.

5. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 1, characterized in that, The structure-aware hybrid loss function is specifically as follows: in, This represents the weighted binary cross-entropy loss. This indicates Dice's loss. This indicates Focal Tversky's loss. This represents the structural similarity loss.

6. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 5, characterized in that, The weighted binary cross-entropy loss is specifically as follows: in, and These represent the predicted image and the ground truth image, respectively. Represents the pixel index of the image. These are the weighting coefficients for positive samples.

7. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 5, characterized in that, Dice's losses are specifically as follows: in, This represents a very small constant to prevent the denominator from being 0.

8. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 5, characterized in that, The Focal Tversky loss is specifically as follows: in, and These are the weighting parameters for false positives and false negatives, respectively. This is the exponential modulation coefficient.

9. The multi-frame infrared small target detection method based on attention fusion encoder-decoder network according to claim 5, characterized in that, The structural similarity loss is specifically as follows: in, and These represent the mean values ​​of the predicted image and the ground truth image within the local window, respectively. and These represent the corresponding standard deviations. and They are set to and The constant should be used to avoid the denominator being zero.