A method for detecting faint targets based on non-steady-state clutter suppression
By using a frequency domain decoupling method with learnable filters and bidirectional attention modules, the problem of background clutter interference in infrared small target detection is solved, achieving higher detection accuracy and better retention of target information.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2024-03-29
- Publication Date
- 2026-06-30
AI Technical Summary
In infrared small target detection, existing methods struggle to effectively distinguish targets from complex backgrounds, especially since infrared images often contain background clutter with similar energy and structure to the target, leading to decreased detection accuracy.
Two learnable filters are used to extract the target-specific spectrum and the target-background consistency spectrum. Frequency domain decoupling is achieved through a bidirectional interactive attention module and a contrast loss function. Features are extracted by combining a densely connected U-Net network, and target detection is performed using the branch structure of the contrast head and the detection head.
It effectively suppresses background clutter, improves the accuracy and performance of infrared small target detection, and is especially able to better preserve target information in complex backgrounds.
Smart Images

Figure CN118212401B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of target detection and recognition technology, specifically relating to a method for detecting faint targets based on non-steady-state clutter suppression. Background Technology
[0002] Infrared imaging is less affected by lighting and weather conditions, allowing infrared payloads to operate continuously around the clock and in all weather conditions. It has been widely applied in remote sensing mapping, disaster detection, and small target detection. Infrared small target detection plays a crucial role in imaging guidance, target surveillance, and weather forecasting. Due to the random distribution of relative motion between the detection platform and the target, single-frame-based infrared small target detection has broader application prospects, but it still faces challenges such as complex real-world application scenarios and abundant interference clutter.
[0003] Compared with surface targets in natural visible light scenes, infrared small target detection has the following difficulties: (1) Few target features: Infrared small targets are small (the number of target pixels is several to dozens), weak (the signal-to-clutter ratio of the target is low), and feature-scarce (lacking shape and texture features). Therefore, the target does not have the same visual salience as visible light surface targets; (2) Complex imaging scene: Infrared imaging relies only on the radiation and reflection of the target and background in the scene. The image lacks color channels, and the imaging scene has a large number of structural backgrounds that are similar to the target's radiation energy and apparent scale, making it difficult to effectively distinguish the target from the complex background.
[0004] Currently, infrared small target detection methods are mainly divided into traditional methods based on manually designed features and data-driven deep learning methods. Traditional methods make assumptions about the target's energy and scale, and distinguish the target from the background based on these assumptions. However, in actual application scenarios, the shape, size, and energy of the target are dynamically changing, and various backgrounds also have different characteristics. Therefore, it is difficult to effectively distinguish the target from the background with limited prior information.
[0005] Deep learning methods focus on the high-level abstract semantic features of small infrared targets, enhancing target information at a deeper semantic level. This allows for the extraction of deep features that are difficult to extract using traditional methods, thereby improving detection probability. Previous deep learning methods focused on the inherent characteristics of targets, primarily considering target enhancement. However, infrared images contain background clutter with similar energy and structure to the target. During target enhancement, clutter signals are easily amplified and propagated to deeper features during multi-layer feature fusion, leading to the detection of background clutter with similar characteristics to the target. Furthermore, it is difficult to balance the shallow local details of small infrared targets with the deep global semantic information of the background during feature fusion. Summary of the Invention
[0006] To address the challenge of distinguishing between small targets and complex backgrounds in infrared images due to their high spatial similarity, this invention provides a method for detecting faint targets based on non-stationary clutter suppression. This method uses two learnable filter templates to extract the unique spectrum of the infrared small target and the target-background consistency spectrum by filtering the Fourier domain of the original image. The latter represents interference noise, primarily non-stationary clutter, present in most small target extraction methods. Building upon this, a bidirectional interactive module is used to modulate deep global features and shallow local features to extract both target-specific and target-background consistency features. Then, a dual-branch network structure—a contrast head and a detection head—is used to train the network. The former uses the difference between the spectra as a contrast loss function to fully decouple the two spectra, while the latter enables target detection within the fused feature module, improving the accuracy of infrared small target detection.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] A method for detecting faint targets based on non-steady-state clutter suppression, the method being:
[0009] Step 1: Use two learnable filters to extract the target-specific spectrum and the target-background consistency spectrum of the input infrared image;
[0010] Step 2: A densely connected U-Net network with residual connection modules is used as the backbone network. This network uses a bidirectional interactive attention mechanism module to extract features from the spectral components obtained in Step 1, and obtain target-specific features and target-background consistency features.
[0011] Step 3: Apply an instance-level contrastive loss to the two features obtained in Step 2 to assist the learning of the filter, thereby decoupling the target-specific features from the target-background consistency features;
[0012] Step 4: Supervise the target-specific features obtained in Step 2 using real-world annotations, and perform a low-light target detection task.
[0013] Furthermore, step one specifically includes:
[0014] For the input infrared image I∈R H×W×C It is then transformed to the frequency domain using a fast Fourier transform:
[0015]
[0016] Where I represents the input infrared image, R represents the real number field, H represents the height of the image, W represents the width of the image, C represents the number of channels of the image, the superscript F represents the Fourier transform, x and y represent the coordinate positions in the image, e represents the base of the natural logarithm, j represents the imaginary number field, and u and v represent the coordinate positions of the image after the Fourier transform.
[0017] The above formula can be written as:
[0018] I F =F(I)(u,v)=T(I)(u,v)+jC(I)(u,v)
[0019] Where F(I)(u,v) represents the result of the Fourier transform of image I, and T(I)(u,v) and C(I)(u,v) represent the real and imaginary parts of F(I)(u,v) respectively. Therefore, the amplitude spectrum I of infrared image I is... A ∈R H×W×C and phase spectrum I P ∈R H×W×C The expression is as follows:
[0020] I A =[T 2 (I)(u,v)+C 2 (I)(u,v)] 1 / 2
[0021]
[0022] The amplitude spectrum contains the number of signals of different frequencies, and there are differences in frequency signals between infrared small targets and the image background. Therefore, this invention designs two learnable filters ξ. ts and ξ tg These components are used to extract target-specific components from the amplitude spectrum of the original image, which are beneficial for infrared small target detection tasks. And target-background consistency components that are detrimental to infrared small target detection tasks The values in these two filters are continuous variables from 0 to 1; after filtering, the amplitude spectrum I A It is broken down into the following two parts:
[0023]
[0024]
[0025] in, This indicates element-wise multiplication; after obtaining the target-specific amplitude spectrum and the target-background consistent amplitude spectrum, they are each multiplied by the phase spectrum I. P Perform an inverse Fourier transform together to obtain the target-specific component and the target-background consistency component I. ts and Itg :
[0026]
[0027]
[0028] Among them, F -1 denoted as inverse Fourier transform, and 1j represents the imaginary number.
[0029] Furthermore, step two specifically involves:
[0030] (1) A densely connected U-Net network with residual connection modules is used as the backbone network (UDN) for feature extraction;
[0031] (2) The specific structure of this backbone network UDN is as follows:
[0032] For a given node (i,j) (i,j = 1,2,3), its features originate from skip connection layers and adjacent layers. The skip connection layers directly fuse features using a concatenation operation, while adjacent layers modulate each other, ultimately concatenating the features to form the feature X of node (i,j). (i,j) ;
[0033] (3) For the input target-specific spectral component I ts ∈R H×W×C and target background consistency spectrum component I tg ∈R H×W×C After passing through the backbone network, the feature PTi∈R is extracted. h×w×c and PBi∈R h×w×c (i = 0, 1, 2, 3, 4), where h, w, and c represent the height, width, and number of channels, respectively:
[0034] PTi=UDN(I ts ) i ,PBi=UDN(I tg ) i
[0035] (4) The features extracted by the backbone network are divided into two branches, namely the comparison head and the detection head. The comparison head is used to ensure that the difference between the features of each layer of infrared small targets and the background features is maximized, thereby guiding the learning of the two filter templates. The detection head is supervised by real annotations to perform the infrared small target detection task.
[0036] Furthermore, in (2),
[0037] ① Skip-layer feature fusion: Features from skip connections contain semantic information at different granularities. By directly concatenating features through connection operations, information is not lost, and more global contextual information is provided. The result of skip-layer fusion is... The expression is as follows:
[0038]
[0039] In the above formula, cat represents the concatenation operation; X (i,j-2) X (i,j-3) X (i,0) These refer to features from different layers. After concatenating these three features, the output is a feature with skip connections.
[0040] ②Feature fusion between adjacent layers: Features X of the same level of node (i,j) in adjacent layers. (i,j-1) Deep features X (i+1,j-1) and shallow features X (i-1,j) Shallow features modulate deep features through a local channel attention mechanism, enhancing the details of small infrared targets in the deep semantic information. Deep features, in turn, modulate shallow features through a global self-attention mechanism, enhancing the high-level semantics in the shallow information. The result of fusing features from adjacent layers... It is composed of 3 parts, which are:
[0041]
[0042] in, express Features originating from shallow layers express Features originating from deep within express In the diagram, features originate from the same layer. `conv` with different subscripts represents 1×1 convolutions with different weights, `cat` represents a concatenation operation, `Trans` represents global attention encoding, `M` represents max pooling with a stride of 2, `σ` is a function, `B` represents batch normalization, `δ` represents a non-linear activation function, and `U` represents upsampling. This indicates point-by-point multiplication; therefore, It can be represented as:
[0043]
[0044] The output features of nodes (i,0) and (0,j) They are represented as follows:
[0045]
[0046] In the formula, R represents the residual connection, and the final node X (i,j) The output features are:
[0047]
[0048] Furthermore, step three specifically includes:
[0049] After obtaining the target-specific features PTi and target-background consistency features PBi of the i-th layer, an upsampling operation is performed on these features to make them have the same height and width as the original input. Then, based on the ground truth annotations, instance-level features and background features of each target from PTi and PBi are extracted. and n represents the number of targets in the image. and These represent the respective target characteristics after frequency domain decoupling. and These represent the respective background features after frequency domain decoupling. Since different targets have different spatial scales, their corresponding feature dimensions differ. Therefore, a feature alignment operation is used to align all instance-level features from the target and background.
[0050]
[0051]
[0052] In the formula, Where RoIAlign represents feature alignment, d is the output dimension of the feature alignment operation, and c is the number of channels; then, a multilayer perceptron projection function g(·) containing two hidden layers is used to map the features to the space for calculating the contrastive loss:
[0053]
[0054]
[0055] This refers to the target features used to calculate the contrastive loss in the target-specific components. This refers to the background features used to calculate the contrastive loss in the target-specific components. This refers to the target features used in the target-background consistency component to calculate the contrastive loss. This refers to the background features used to calculate the contrast loss in the target-background consistency component; the left side of the above formula refers to the feature space used to calculate the contrast loss. The left side of the first formula refers to the instance-level features in the target-specific features, and the left side of the second formula refers to the instance-level features in the target-background consistency features.
[0056] In practice, we aim to achieve a high similarity between target features and dissimilarity between the target and other features within the target-specific component, thereby decoupling the target from the background in the frequency domain. Therefore, for the feature vector in the above equation, we rearrange it into the target feature T and other features O within the target-specific component:
[0057]
[0058]
[0059] To bring features within T closer together and push features between T and O further apart, the following contrast loss is designed:
[0060]
[0061] Where n represents the number of targets. This refers to the feature vectors in T. This refers to T being different from eigenvectors, This refers to the difference between T and O. The feature vectors are denoted by sim, which represents the cosine similarity metric. For feature vectors a and b, sim(a,b) = (a·b) / (||a||·||b||), where a and b represent the two inputs to the sim function, and τ is the temperature hyperparameter. Each feature in the feature space T is a positive sample pair, and features in T and features in O are negative sample pairs. The contrastive loss is calculated for each level of features output by the U-shaped dense network, and the final total contrastive loss is as follows:
[0062]
[0063] By using the above comparative loss, the learning of two learnable filter templates is guided, thereby achieving the decoupling of target-specific features and target-background consistency features.
[0064] Furthermore, step four specifically involves:
[0065] For the detection head, to address the imbalance between positive and negative samples between small infrared targets and the background, soft-IOU loss is used as the model's loss function:
[0066]
[0067] Where G(x,y) represents the confidence map of infrared small targets inferred by the model, and P(x,y) represents the ground truth label; combined with the contrast loss L con The final overall loss function is obtained as follows:
[0068] L det =λ1L soft-IOU +λ2L con
[0069] Wherein, λ1 and λ2 are two coefficients used to balance the detection and decoupling tasks;
[0070] An end-to-end training method is adopted during the training process. The comparison head branch is removed during the inference process, and only the detection head branch is retained to perform infrared small target detection tasks on the target-specific components.
[0071] Compared with the prior art, the present invention has the following advantages:
[0072] (1) To address the problem that background clutter similar to the target shape and structure is difficult to suppress, this invention proposes a frequency domain decoupling method. This method uses two learnable filters to extract the target-specific spectrum that is beneficial to target detection and the target background consistency spectrum that is not beneficial to target detection. Since different spectra have different effects on the infrared small target detection results, retaining the effective spectrum can improve the target detection performance and achieve effective removal of complex background.
[0073] (2) To address the problem of uncertain learning of filter templates, this invention constructs an instance-level contrastive loss function. This loss function maximizes the instance-level feature difference between the outputs of the two spectra, thereby guiding the learning direction of the two filter templates and obtaining high-precision frequency domain decoupling results, making the decoupling more complete and accurate.
[0074] (3) To address the problem of loss of target detail information in deep features in infrared small target detection networks, this invention proposes a bidirectional attention mechanism module. It utilizes the mutual modulation of deep and shallow features to achieve effective fusion of deep global semantic information and shallow local detail information, thereby balancing the differentiated semantic information between the target and the background while retaining the infrared small target information in the deep features. This achieves effective retention of infrared small target information in each level of features, thereby improving target detection performance. Attached Figure Description
[0075] Figure 1 This is a flowchart of a method for detecting faint targets based on non-steady-state clutter suppression;
[0076] Figure 2 A detailed structural diagram of a U-shaped dense network;
[0077] Figure 3 This is a diagram of the bidirectional feature fusion module. Detailed Implementation
[0078] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments, but it is not limited thereto. Any modifications or equivalent substitutions to the technical solution of the present invention that do not depart from the spirit and scope of the technical solution of the present invention should be covered within the protection scope of the present invention.
[0079] Example 1:
[0080] This invention provides a method for detecting faint targets based on non-steady-state clutter suppression, such as... Figure 1As shown, the specific implementation steps of the method are as follows:
[0081] Step 1: Extract the target-specific spectrum and target-background consistency spectrum from the input infrared image using two learnable filters. The specific steps are as follows:
[0082] For the input infrared image I∈R H×W×C It is then transformed to the frequency domain using a fast Fourier transform:
[0083]
[0084] The above formula can be written as:
[0085] I F =F(I)(u,v)=T(I)(u,v)+jC(I)(u,v)
[0086] Where T(I)(u,v) and C(I)(u,v) represent the real and imaginary parts of F(I)(u,v) respectively, therefore, the amplitude spectrum I of infrared image I A ∈R H×W×C and phase spectrum I P ∈R H×W×C This can be expressed as follows:
[0087] I A =[T 2 (I)(u,v)+C 2 (I)(u,v)] 1 / 2
[0088]
[0089] The amplitude spectrum contains a number of signals of different frequencies, and there are differences in frequency signals between infrared small targets and the image background. Therefore, this invention designs two learnable filters ξ. ts and ξ tg These components are used to extract target-specific components from the amplitude spectrum of the original image, which are beneficial for infrared small target detection tasks. And the target-background common component that is unfavorable for infrared small target detection tasks The values in these two filters are continuous variables from 0 to 1. After filtering, the amplitude spectrum I... A It is broken down into the following two parts:
[0090]
[0091]
[0092] In the formula, This indicates element-wise multiplication. After obtaining the target-specific amplitude spectrum and the target-background common amplitude spectrum, each is multiplied by the phase spectrum I. P Perform an inverse Fourier transform together to obtain the target-specific component and the target-background consistency component I. ts and I tg :
[0093]
[0094]
[0095] In the formula, F -1 denoted as inverse Fourier transform, and 1j represents the imaginary number.
[0096] Step 2: A densely connected U-Net network with residual connection modules is used as the backbone network. This network employs a bidirectional interactive attention mechanism to extract features from the spectral components obtained in Step 1, resulting in target-specific features and target-background consistency features. The specific steps are as follows:
[0097] (1) As Figure 2 As shown, we first use a densely connected U-Net network with residual connection modules as the backbone network (UDN) for feature extraction.
[0098] (2) Figure 3 As shown, the specific structure of the network is as follows:
[0099] For a given node (i,j) (i,j = 1,2,3), its features originate from skip connection layers and adjacent layers. The skip connection layers directly fuse features using a concatenation operation, while adjacent layers modulate each other, ultimately concatenating the features to form the feature X of node (i,j). (i,j) The specific description is as follows:
[0100] ① Skip-layer feature fusion: Features from skip connections contain semantic information at different granularities. By directly concatenating features through connection operations, information is not lost, and more global contextual information is provided. The result of skip-layer fusion is... This can be expressed as follows:
[0101]
[0102] In the above formula, cat represents the concatenation operation.
[0103] ②Feature fusion between adjacent layers: Features X of the same level of node (i,j) in adjacent layers. (i,j-1) Deep features X (i+1,j-1) and shallow features X (i+1,j-1)Shallow features modulate deep features through a local channel attention mechanism, enhancing the details of small infrared targets in the deep semantic information. Deep features, in turn, modulate shallow features through a global self-attention mechanism, enhancing the high-level semantics in the shallow information. The result of fusing features from adjacent layers... It is obtained by splicing together 3 parts. Figure 2 In These three parts are:
[0104]
[0105] In the formula, cat represents the concatenation operation, conv represents the 1×1 convolution, Trans represents the global attention encoding, M represents the max pooling operation with a stride of 2, σ is a function, B represents batch normalization, δ represents the non-linear activation function, and U represents the upsampling operation. This indicates point-by-point multiplication. Therefore, It can be represented as:
[0106]
[0107] for Figure 3 The red and purple dashed lines in the diagram indicate that the fused feature sources are relatively few. The features in the red dashed lines originate from the downsampling operation of the previous layer, while the features in the purple dashed lines originate from skip connections and upsampling of deep features. Therefore, the output features of nodes (i,0) and (0,j) are relatively few. They can be represented as:
[0108]
[0109] In the formula, R represents the residual connection, and the final node X (i,j) The output features are:
[0110]
[0111] (3) For the input target-specific spectral component I ts ∈R H×W×C and target background consistency spectrum component I tg ∈R H×W×C After passing through the backbone network, the feature PTi∈R is extracted. h×w×c and PBi∈R h×w×c (i = 0, 1, 2, 3, 4), where h, w, and c represent the height, width, and number of channels, respectively:
[0112] PTi=UDN(I ts ) i ,PBi=UDN(I tg ) i
[0113] (4) The features extracted by the backbone network are divided into two branches, namely the comparison head and the detection head. The comparison head is used to ensure that the difference between the features of each layer of infrared small targets and the background features is maximized, thereby guiding the learning of the two filter templates. The detection head is supervised by real annotations to perform the infrared small target detection task.
[0114] Step 3: Apply an instance-level contrastive loss to the two features obtained in Step 2 to assist in filter learning, thereby decoupling the target-specific features from the target-background consistency features. The specific steps are as follows:
[0115] After obtaining the target-specific features PTi and target-background consistency features PBi of the i-th layer, an upsampling operation is performed on these features to make them have the same height and width as the original input. Then, based on the ground truth annotations, instance-level features and background features of each target from PTi and PBi are extracted. and n represents the number of targets in the image. and These represent the respective target characteristics after frequency domain decoupling. and These represent the respective background features after frequency domain decoupling. Since different targets have different spatial scales, their corresponding feature dimensions differ. Therefore, a feature alignment operation is used to align all instance-level features from the target and background:
[0116]
[0117]
[0118] In the formula, Where RoIAlign represents feature alignment, d is the output dimension of the feature alignment operation, and c is the number of channels. Then, a multilayer perceptron projection function g(·) with two hidden layers is used to map the features to the space for calculating the contrastive loss.
[0119]
[0120]
[0121] In practice, we aim to achieve a high similarity between target features and dissimilarity between the target and other features within the target-specific component, thereby decoupling the target from the background in the frequency domain. Therefore, for the feature vector in the above equation, we rearrange it into the target feature T and other features O within the target-specific component:
[0122]
[0123]
[0124] To bring features within T closer together and push features between T and O further apart, we designed the following contrast loss:
[0125]
[0126] In the formula, sim represents the cosine similarity measure. For feature vectors a and b, sim(a,b) = (a·b) / (||a||·||b||). τ is the temperature hyperparameter. Each feature in the feature space T is a positive sample pair, and features in T and features in O are negative sample pairs. For each level of features output by the U-shaped dense network, we calculate the contrastive loss. The final total contrastive loss is as follows:
[0127]
[0128] By using the above comparative loss, the learning of two learnable filter templates can be guided, thereby achieving the decoupling of target-specific features and target-background consistency features.
[0129] Step 4: Supervise the target-specific features obtained in Step 2 using real-world annotations, and perform the infrared small target detection task. The specific steps are as follows:
[0130] For the detection head, to address the imbalance between positive and negative samples between small infrared targets and the background, we use soft-IOU loss as the model's loss function:
[0131]
[0132] In the formula, G(x,y) represents the confidence map of the infrared small target inferred by the model, and P(x,y) represents the ground truth label. Combined with the contrast loss L... con We obtained the final overall loss function as follows:
[0133] L det =λ1L soft-IOU +λ2L con
[0134] In the formula, λ1 and λ2 are two coefficients used to balance the detection and decoupling tasks.
[0135] During training, we adopted an end-to-end training approach. In the inference process, we removed the comparison head branch and retained only the detection head branch to perform infrared small target detection tasks on the target-specific components.
Claims
1. A method for detecting faint targets based on non-steady-state clutter suppression, characterized in that: The method is as follows: Step 1: Use two learnable filters to extract the target-specific spectrum and the target-background consistency spectrum of the input infrared image; Step Two: A densely connected U-Net network with residual connection modules is used as the backbone network. This network employs a bidirectional interactive attention mechanism module to extract features from the spectral components obtained in Step One, and obtains target-specific features and target-background consistency features. Specifically, Step Two involves: (1) A densely connected U-Net network with residual connection modules is used as the backbone network for feature extraction; (2) The specific structure of this backbone network UDN is as follows: For a certain node , =1,2,3, whose feature sources include skip connection layers and adjacent layers. The skip connection layers directly fuse features through a splicing operation, while adjacent layers modulate each other to ultimately complete feature splicing and form nodes. Features ; ① Skip-layer feature fusion: Features from skip connections contain semantic information at different granularities. By directly concatenating features through connection operations, information is not lost, and more global contextual information is provided. The result of skip-layer fusion is... The expression is as follows: In the above formula, cat represents the concatenation operation; , , These refer to features from different layers. After concatenating these three features, the output is a feature with skip connections. ②Feature fusion between adjacent layers: adjacent layers contain nodes Same level features Deep features and shallow features Shallow features modulate deep features through a local channel attention mechanism, enhancing the details of small infrared targets in the deep semantic information. Deep features, in turn, modulate shallow features through a global self-attention mechanism, enhancing the high-level semantics in the shallow information. The result of fusing features from adjacent layers... It is composed of 3 parts, which are: in, express Features originating from shallow layers express Features originating from deep within express Features from the same layer are represented by conv with different subscripts, indicating 1×1 convolutions with different weights. cat indicates a concatenation operation, and Trans represents global attention encoding. This is a max pooling operation with a step size of 2. For functions, normalization Indicates batch normalization, Represents a non-linear activation function. For upsampling operation, This indicates point-by-point multiplication; therefore, It can be represented as: For nodes Output characteristics , , They are represented as follows: In the formula Represents a residual connection, the final node. The output features are: (3) For the input target-specific spectral components and target background consistency spectral components Features are extracted after passing through the backbone network. and , =0,1,2,3,4, where , 'c' represents the height, width, and number of channels, respectively. (4) The features extracted by the backbone network are divided into two branches, namely the comparison head and the detection head. The comparison head is used to ensure that the difference between the features of each layer of infrared small targets and the background features is maximized, thereby guiding the learning of the two filter templates. The detection head is supervised by real annotations and is used to perform the infrared small target detection task. Step 3: Apply an instance-level contrastive loss to the two features obtained in Step 2 to assist the learning of the filter, thereby decoupling the target-specific features from the target-background consistency features; Step 4: Supervise the target-specific features obtained in Step 2 using real-world annotations, and perform a low-light target detection task.
2. The method for detecting faint targets based on non-steady-state clutter suppression according to claim 1, characterized in that: Step one specifically involves: For the input infrared image It is then transformed to the frequency domain using a fast Fourier transform: in, This represents the input infrared image. Represents the real number field. Indicates the height of the image. Indicates the width of the image. This indicates the number of channels in the image, and the superscript F indicates the Fourier transform. , Indicates the position of coordinates in the image. Represents the base of the natural logarithm. represents the imaginary number field, , This indicates the coordinate position of the image after the Fourier transform; The above formula can be written as: in, This represents the result of the Fourier transform of image I. and They represent The real and imaginary parts, therefore, infrared images Amplitude spectrum and phase spectrum The expression is as follows: Design two learnable filters and These components are used to extract target-specific components from the amplitude spectrum of the original image, which are beneficial for infrared small target detection tasks. And target-background consistency components that are detrimental to infrared small target detection tasks The values in these two filters are continuous variables from 0 to 1; after filtering, the amplitude spectrum... It is broken down into the following two parts: in, This indicates element-wise multiplication; after obtaining the target-specific amplitude spectrum and the target-background common amplitude spectrum, they are respectively multiplied by the phase spectrum. Perform an inverse Fourier transform together to obtain the target-specific component and the target-background consistency component. and : in, This represents the inverse Fourier transform. Represents an imaginary number.
3. The method for detecting faint targets based on non-steady-state clutter suppression according to claim 1, characterized in that: Step three specifically involves: After obtaining the first Target-specific characteristics of the layer Target background consistency features Then, the feature is upsampled to have the same height and width as the original input, and the feature is extracted based on the ground truth annotations. and Instance-level features and background features of each target and , Represents the number of targets in the image. and These represent the respective target characteristics after frequency domain decoupling. and This represents the background characteristics of each element after frequency domain decoupling; due to Different targets have different spatial scales, and therefore their corresponding feature dimensions differ. Here, a feature alignment operation is used to align all instance-level features from the target and the background: In the formula, ,in Indicates feature alignment. For the output dimension of the feature alignment operation, The number of channels is used; then a multilayer perceptron projection function with two hidden layers is employed. Map the features to the space where the contrastive loss is computed: This refers to the target features used to calculate the contrastive loss in the target-specific components. This refers to the background features used to calculate the contrastive loss in the target-specific components. This refers to the target features used in the target-background consistency component to calculate the contrastive loss. This refers to the background features used to calculate the contrast loss in the target-background consistency component; For the feature vector in the above equation, rearrange it into the target features in the target-specific components. Other features : In order to The internal features are brought closer together, and To further explore the features between them, the following contrast loss is designed: in, Indicates the target quantity. It refers to The feature vector in It refers to China is different eigenvectors, It refers to and China is different Let be the feature vectors, and sim denote the cosine similarity measure. For feature vectors a and b, the following condition is satisfied: a and b represent the two inputs to the sim function. For temperature hyperparameters, feature space Each feature in the sample is a positive sample pair with the others. Chinese characteristics and The features are negative sample pairs. For each level of features output by the U-shaped dense network, the contrastive loss is calculated, and the final total contrastive loss is as follows: By using the above comparative loss, the learning of two learnable filter templates is guided, thereby achieving the decoupling of target-specific features and target-background consistency features.
4. The method for detecting faint targets based on non-steady-state clutter suppression according to claim 1, characterized in that: Step four specifically involves: For the detection head, soft-IOU loss is used as the loss function of the model: in, The confidence map of infrared small targets inferred by the representative model. Represents true annotation; combined with comparative loss The final overall loss function is obtained as follows: in, and These are two coefficients used to balance detection and decoupling tasks; An end-to-end training method is adopted during the training process. The comparison head branch is removed during the inference process, and only the detection head branch is retained to perform infrared small target detection tasks on the target-specific components.