A method and apparatus for single-frame infrared small target detection

By introducing a multi-scale local contrast enhancement module and a pyramid max pooling module into the Unet network, the feature extraction capability is enhanced, solving the problems of high false alarm rate and low detection rate in infrared small target detection, and achieving high accuracy and robust single-frame infrared small target detection.

CN115690536BActive Publication Date: 2026-06-30TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date
2022-10-26
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for infrared small target detection suffer from high false alarm rates, low detection rates, and weak robustness, especially in complex backgrounds where it is difficult to achieve high-precision single-frame detection.

Method used

We employ a deep learning network model based on a multi-scale local contrast enhancement module and a pyramid max pooling module to enhance feature extraction capabilities. Leveraging the generalization and robustness of deep learning methods, we construct a Unet network and design a multi-scale local contrast enhancement module at skip connections and a pyramid max pooling module at the deepest feature layer.

Benefits of technology

It improves the accuracy and robustness of single-frame infrared small target detection, overcomes the problems of high false alarm rate and low detection rate in complex backgrounds, and achieves high-precision target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115690536B_ABST
    Figure CN115690536B_ABST
Patent Text Reader

Abstract

This invention provides a method for detecting small infrared targets in a single frame, comprising the following steps: acquiring a dataset of single-frame infrared small target images and preprocessing the dataset to obtain a training set; constructing a deep learning network model based on a multi-scale local contrast enhancement module and a pyramid max pooling module; training the deep learning model based on the training set of single-frame infrared small targets to obtain a single-frame infrared small target detection model; acquiring the single-frame infrared small target image to be detected; inputting the image into the single-frame infrared small target detection model and outputting the detection result image. This invention utilizes the good generalization and robustness of deep learning methods, applying a deep learning network model to single-frame infrared small target detection, and using a multi-scale local contrast enhancement module and a pyramid max pooling module to enhance the feature extraction capability of the deep learning network for single-frame infrared small targets, thereby achieving high accuracy and high robustness in single-frame infrared small target detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of digital image processing, and in particular to a method and apparatus for single-frame infrared small target detection. Background Technology

[0002] Infrared detection systems have long been widely used due to their advantages such as long imaging distance and minimal susceptibility to weather conditions. With the rapid development of image processing technology, infrared small target detection technology based on infrared detection systems is increasingly being applied in military fields such as infrared early warning and infrared guidance, as well as in civilian fields such as security detection.

[0003] Unlike larger targets with more texture in natural optical images, infrared small targets generally have the following characteristics: (1) Small size and weak features: Due to the long imaging distance and low image resolution, most infrared small targets lack color and texture features, and their shape and edges are relatively uncertain. (2) Complex and variable background: The background environment for small target detection is often very complex, including various backgrounds such as ground, sky, and ocean. The variable background places high demands on the robustness of the algorithm. (3) Low signal-to-noise ratio: Since small targets are similar in nature to noise and image clutter, and the reliability of some imaging devices is not high, resulting in a lot of noise, small targets are easily submerged in clutter and noise and are difficult to distinguish and detect. These characteristics lead to a significant difference between infrared small target detection and target detection in ordinary optical images, and also place higher demands on the algorithm. In addition, since many small targets are highly sensitive, and there are few open-source image datasets, this is also a disadvantage.

[0004] Existing infrared small target detection methods are divided into single-frame-based and sequence-based methods. Because sequence-based detection assumes a static background, only single-frame-based infrared small target detection techniques can be used on platforms with rapidly changing backgrounds. The patent application "Infrared Small Target Detection Method Based on Local Contrast and Gradient" (application number CN202210545219.4) introduces an infrared small target detection method based on local contrast and gradient. While this method effectively suppresses background clutter and preserves small targets, it still suffers from high false alarm rates and low detection rates in rapidly changing backgrounds and with multi-scale targets.

[0005] It should be noted that the information disclosed in the background section above is only for understanding the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] The purpose of this invention is to solve the problem of improving the accuracy and robustness of single-frame infrared small target detection, and to provide a single-frame infrared small target detection method and apparatus.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] This invention provides a single-frame infrared small target detection method, comprising the following steps:

[0009] S1: Obtain a dataset of single-frame infrared small target images and preprocess the dataset to obtain a training set;

[0010] S2: Construct a deep learning network model based on a multi-scale local contrast enhancement module and a pyramid max pooling module;

[0011] S3: Based on the training set of single-frame infrared small targets, the deep learning model is trained to obtain a single-frame infrared small target detection model;

[0012] S4: Acquire a single-frame infrared image of the small target to be detected;

[0013] S5: Input the image into the single-frame infrared small target detection model and output the detection result image.

[0014] In some embodiments, the multi-scale local contrast enhancement module can calculate local contrast enhancement maps of different scales on the feature map using contrast with different dilation coefficients, and then stitch them together in the channel dimension.

[0015] In some embodiments, the multi-scale local contrast enhancement module enhances or suppresses the corresponding contrast features through 1×1 convolution.

[0016] In some embodiments, the structure of the multi-scale local contrast enhancement module is expressed by the following formula:

[0017]

[0018] Wherein, MLCE(·) represents the multi-scale local contrast enhancement module. To calculate the expansion coefficient d n The feature map obtained after local contrast is F, where F∈R is the original feature map. C×H×W B is the batch normalization operation, concat is the concatenation operation in the channel dimension, δ is the ReLU activation function, and conv is a 1×1 convolution.

[0019] In some embodiments, the local contrast enhancement operator employed by the multi-scale local contrast enhancement module includes a depth-separable, fixed dilatational eight-direction Laplacian operator.

[0020] In some embodiments, the pyramid max pooling module in step S2 first uses the pyramid pooling module to expand the receptive field, aggregates the contextual information of the input feature map, and after obtaining the pooled features, performs a 1×1 convolution and adjusts the size of the feature map, while setting the output channel to be equal to the input channel; then, the input feature map and the aggregated feature map are concatenated in dimension, and a 3×3 convolution is applied to finally generate refined features.

[0021] In some embodiments, the deep learning network model described in step S2 is a Unet network.

[0022] In some embodiments, a multi-scale local contrast enhancement module is designed at the jump connection of the Unet network, and a pyramid max pooling module is designed at the deepest feature layer.

[0023] The present invention also provides a single-frame infrared small target detection device, which stores a computer program that, when executed by a processor, implements the steps of the above-described method.

[0024] The present invention also provides a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described above.

[0025] The present invention has the following beneficial effects:

[0026] This invention applies a deep learning network model to the detection of small infrared targets in a single frame. It uses a multi-scale local contrast enhancement module and a pyramid max pooling module to enhance the feature extraction capability of the deep learning network for small infrared targets in a single frame. It leverages the good generalization and robustness of deep learning methods to overcome the problems of high false alarm rate, low detection rate and weak robustness of traditional detection methods in complex backgrounds, thereby achieving high accuracy and high robustness in the detection of small infrared targets in a single frame.

[0027] Other beneficial effects of the embodiments of the present invention will be further described below. Attached Figure Description

[0028] Figure 1 This is a flowchart of a single-frame infrared small target detection method according to an embodiment of the present invention;

[0029] Figure 2 This is a schematic diagram of the overall architecture of the deep network in an embodiment of the present invention;

[0030] Figure 3 This is a schematic diagram of the overall architecture of the multi-scale local contrast enhancement module in an embodiment of the present invention;

[0031] Figure 4 This is a schematic diagram of the pyramid max pooling module in an embodiment of the present invention. Detailed Implementation

[0032] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and not intended to limit the scope and application of the present invention.

[0033] Furthermore, in the description of the embodiments of the present invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0034] Deep learning methods exhibit good generalization and robustness, and their application in infrared small target detection can achieve higher accuracy. To overcome the problems of high false alarm rate, low detection rate, and weak robustness of traditional detection methods in complex backgrounds, this invention proposes a lightweight deep learning-based single-frame infrared small target detection method to improve the accuracy and robustness of single-frame infrared small target detection.

[0035] The flowchart of the single-frame infrared small target detection method in this embodiment of the invention is as follows: Figure 1 This includes the following steps:

[0036] S1: Obtain a dataset of single-frame infrared small target images and preprocess the dataset to obtain a training set;

[0037] S2: Construct a deep learning network model based on a multi-scale local contrast enhancement module and a pyramid max pooling module;

[0038] Preferably, the multi-scale local contrast enhancement module can calculate local contrast enhancement maps of different scales on the feature map using contrast with different dilation coefficients, and then stitch them together in the channel dimension.

[0039] Preferably, the multi-scale local contrast enhancement module enhances or suppresses the corresponding contrast features through 1×1 convolution. The local contrast enhancement operator used by the multi-scale local contrast enhancement module includes a depth-separable, fixed, dilated eight-direction Laplacian operator.

[0040] Preferably, the structure of the multi-scale local contrast enhancement module is expressed by the following formula:

[0041]

[0042] Wherein, MLCE(·) represents the multi-scale local contrast enhancement module. To calculate the expansion coefficient d n The feature map obtained after local contrast is F, where F∈R is the original feature map. C×H×W B is the batch normalization operation, concat is the concatenation operation in the channel dimension, δ is the ReLU activation function, and conv is a 1×1 convolution.

[0043] Preferably, the pyramid max pooling module first uses the pyramid pooling module to expand the receptive field and aggregate the contextual information of the input feature map. After obtaining the pooled features, a 1×1 convolution is performed and the size of the feature map is adjusted, while the output channel is set to be equal to the input channel. Then, the input feature map and the aggregated feature map are concatenated in dimension, and a 3×3 convolution is applied to finally generate refined features.

[0044] Preferably, the deep learning network model is a Unet network. Multi-scale local contrast enhancement modules are designed at the skip connections of the Unet network, and pyramid max-pooling modules are designed at the deepest feature layer.

[0045] S3: Based on the training set of single-frame infrared small targets, the deep learning model is trained to obtain a single-frame infrared small target detection model;

[0046] S4: Acquire a single-frame infrared image of the small target to be detected;

[0047] S5: Input the image into the single-frame infrared small target detection model and output the detection result image.

[0048] Steps S1, S2, and S3 are the processes for training the network model.

[0049] This invention also provides a single-frame infrared small target detection device, which stores a computer program that, when executed by a processor, implements the steps of the above-described method.

[0050] This invention also provides a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described above.

[0051] This invention uses UNet (an image segmentation network) as the backbone network of a deep learning method. Multi-scale local contrast enhancement modules and pyramid max-pooling modules are designed at the skip connections and deepest feature layers of the UNet network to enhance the feature extraction capability of the deep learning network. The overall structure of the deep network in this invention is as follows: Figure 2 As shown.

[0052] The structure of the Unet backbone network can be summarized as follows: downsampling for encoding, then upsampling for decoding, with skip connections added during the encoding and decoding process. Skip connections are designed to accelerate network convergence and supplement lost details; these connections directly connect encoded feature maps of the same resolution to the decoding process. The left side of this structure represents the encoding structure, employing a typical combination of convolutional networks: including repeated use of two 3x3 convolutions (without padding), each followed by a non-linear activation function (ReLU) and a 2x2 max-pooling layer with a stride of 2 for downsampling. In each downsampling step, the number of channels in the feature map doubles. The right side represents the decoding structure, where each step includes a 2x2 transposed convolution, halving the number of channels in the feature map and concatenating it with the encoded feature map along the channel dimension, followed by two 3x3 convolutions, each also followed by a ReLU activation function. In the final layer, 1x1 convolutions map each 64-channel feature map to the required number of classes.

[0053] The overall structure of the multi-scale local contrast enhancement module in this embodiment of the invention is as follows: Figure 3 As shown, the local contrast enhancement operator used in this module is implemented using a depth-separable, fixed-dilation eight-direction Laplacian operator. Contrast enhancement maps of different scales are calculated on the feature map using different dilation coefficients, and then stitched together along the channel dimension. Finally, an appropriate local contrast enhancement value is adaptively adjusted. Given an intermediate feature map F∈R... C×H×W This method can be defined as:

[0054]

[0055] In the formula, MLCE(·) defines the overall expression of the multi-scale local contrast enhancement module. To calculate the expansion coefficient d n The feature map obtained after local contrast is F, which is the original feature map, B is the batch normalization operation, concat is the concatenation operation in the channel dimension, δ is the ReLU activation function, and conv is the 1×1 convolution.

[0056] The structure of the pyramid max pooling module in this embodiment of the invention is as follows: Figure 4 As shown, this module performs four max pooling operations, resulting in feature map sizes of 1×1, 4×4, 8×8, and 32×32. After obtaining the pooled features, convolution and upsampling operations are performed. The convolution kernel size is 1×1, and to reduce feature loss, the output channel is set to equal the input channel. Then, the module uses channel-dimensional concatenation to combine the original feature map and the feature map after pyramid pooling, and applies a 3×3 convolution to finally produce refined features.

[0057] The advantages of the embodiments of the present invention are as follows:

[0058] The designed multi-scale local contrast enhancement module utilizes the prior knowledge that infrared small targets are small and bright to improve the network's ability to extract features from small targets, help locate small targets, and make the network more robust when detecting targets in complex backgrounds.

[0059] The designed multi-scale local contrast enhancement module also considers the scale variation of small targets, aggregates local contrast maps of multiple scales, and can enhance or suppress the corresponding contrast features through 1×1 convolution, thereby extracting local features of small targets more strongly and avoiding the mismatch between the contrast feature extraction range and the target range caused by single scale, thus solving the problem of possible contrast reduction and greatly improving the detection accuracy of small targets.

[0060] The designed pyramid max pooling module combines multi-scale characteristics to aggregate contextual information from different regions, enhancing the network's ability to utilize global information by aggregating contextual information from different regions, thus significantly improving the accuracy of infrared target detection and segmentation.

[0061] Example:

[0062] The number of parameters in this embodiment of the invention is only 1.1M, and it can be deployed on embedded devices, personal PCs, and servers.

[0063] S1. Preprocessing the infrared small target image: First, obtain the dataset of infrared small target images, which is divided into training set and test set. Then, perform vertical and horizontal flipping, random cropping to 256×256 size, and Gaussian blur on the images to obtain the preprocessed images.

[0064] S2. Construct a deep learning network model based on multi-scale local contrast enhancement and pyramid max pooling, the structure of which is as follows: Figure 2 As shown;

[0065] First, the downsampling encoding structure of the Unet network is constructed, adopting a typical convolutional network structure: it includes two repeated 3x3 convolutions (without padding), each followed by a non-linear activation function (ReLU) and a 2x2 max-pooling layer with a stride of 2 for downsampling. A total of four complete downsampling encoding structures are constructed. In each downsampling step, the number of channels in the feature map doubles, from 16, 32, 64, 128, and 256 channels from the shallowest to the deepest layer. The decoding structure on the right includes a 2x2 transposed convolution at each step, where the number of channels in the feature map is halved, and it is concatenated along the channel dimension with the feature map from the encoded portion after multi-scale local contrast enhancement and pyramid max-pooling. This is followed by two 3x3 convolutions, each also followed by a ReLU activation function. In the final layer, a 1x1 convolution maps each 64-channel feature map to 1.

[0066] Then, a multi-scale local contrast enhancement module and a pyramid max-pooling module are constructed respectively, with the multi-scale local contrast enhancement module as follows: Figure 3 As shown, the pyramid max pooling module is as follows: Figure 4 As shown, the multi-scale local contrast enhancement module first employs a depth-separable dilation eight-direction Laplacian operator with dilation coefficients of 2, 4, and 6 to enhance the bright features of small targets. Then, the enhanced feature maps are concatenated along the channel dimension, with the channel dimension being three times that of the input feature map. Next, a 1x1 convolution is used to refine the feature map, reducing the channel dimension to the number of input channels, and batch normalization and ReLU activation are applied. The pyramid max-pooling module first utilizes the pyramid pooling module to expand the receptive field and aggregate contextual information of the input features. The resulting feature maps have sizes of 1×1, 4×4, 8×8, and 32×32. After obtaining the pooled features, a 1×1 convolution is performed, and the feature map size is adjusted. To reduce feature loss, the output channel is set to equal the input channel. Then, the input feature map is concatenated with the aggregated feature map along the dimension using skip connections, and a 3×3 convolution is applied to finally produce the refined features.

[0067] S3. Based on the training set of infrared small targets, the deep learning model is trained using gradient descent, the overall loss function L is calculated, and the network parameters are updated. Training stops when the maximum number of iterations is reached, resulting in a trained infrared small target detection model. In this embodiment, the epoch (number of training rounds) is set to 500, the batch size is set to 4, and the Adam (adaptive moment estimation) optimizer is used to train the designed model, with an initial learning rate of 10. -3 The final learning rate is 10. -5 The cosine descent method is adopted.

[0068] S4. Acquire a single-frame infrared image of the small target to be detected.

[0069] S5. Input the tested infrared small target image into the deep network and output the detection result image.

[0070] Experimental example:

[0071] The technical effects of the embodiments of the present invention can be further illustrated by the following simulation experiments:

[0072] The experimental examples of this invention use the PyTorch framework and an NVIDIA GeForce GTX1080Ti GPU to implement network training and inference. The Python version is 3.7 and the PyTorch version is 1.7.1. The training and testing dataset is the SIRST dataset, which contains 427 infrared small target images.

[0073] The existing technology achieves an Intersection over Union (IoU) of 75.0% and a normalized IoU of 73.1% on the SIRST dataset, while the present invention achieves an IoU of 79.3% and a normalized IoU of 76.6% on the SIRST dataset, as shown in Table 1. Compared to the existing technology, the detection accuracy of the embodiments of the present invention is significantly improved.

[0074] Table 1

[0075] Evaluation methods Existing technology Single-frame infrared small target detection method Intersection and Union 75.0% 79.3% Normalized intersection and union ratio 73.1% 76.6%

[0076] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0077] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1A device that provides the functions specified in one or more boxes.

[0078] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0079] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0080] The above description provides a further detailed explanation of the present invention in conjunction with specific / preferred embodiments, and it should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various substitutions or modifications can be made to these described embodiments without departing from the concept of the present invention, and all such substitutions or modifications should be considered within the scope of protection of the present invention. In the description of this specification, the reference to terms such as "an embodiment," "some embodiments," "preferred embodiment," "example," "specific example," or "some examples," etc., indicates that the specific features, structures, materials, or characteristics described in connection with that embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification and the features of different embodiments or examples. Although the embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions, and modifications can be made herein without departing from the scope of protection of the patent application.

Claims

1. A single-frame infrared small target detection method, characterized in that, Includes the following steps: S1: Obtain a dataset of single-frame infrared small target images and preprocess the dataset to obtain a training set; S2: Construct a deep learning network model based on a multi-scale local contrast enhancement module and a pyramid max pooling module; the multi-scale local contrast enhancement module uses a depth-separable, fixed dilated eight-direction Laplacian operator to enhance the bright features of small targets; the pyramid max pooling module uses max pooling operations to expand the receptive field and aggregate contextual information of input features to enhance the network's ability to utilize global information; the deep learning network model is a Unet network, with the multi-scale local contrast enhancement module designed at the skip connections of the Unet network and the pyramid max pooling module designed at the deepest feature layer; S3: Based on the training set of single-frame infrared small targets, the deep learning model is trained to obtain a single-frame infrared small target detection model; S4: Acquire a single-frame infrared image of the small target to be detected; S5: Input the image into the single-frame infrared small target detection model and output the detection result image; The multi-scale local contrast enhancement module uses an eight-direction Laplacian operator with dilation coefficients of 2, 4, and 6 to stitch the enhanced feature maps together in the channel dimension. The multi-scale local contrast enhancement module enhances or suppresses the corresponding contrast features through 1×1 convolution; The pyramid max pooling module has four max pooling operations, resulting in feature map sizes of 1×1, 4×4, 8×8, and 32×32, respectively. The pyramid max pooling module first expands the receptive field using the pyramid pooling module, aggregating the contextual information of the input feature map. After obtaining the pooled features, a 1×1 convolution is performed, and the feature map size is adjusted, while setting the output channel to equal the input channel. Then, the input feature map and the aggregated feature map are concatenated dimensionally, and a 3×3 convolution is applied to finally generate refined features.

2. The method as described in claim 1, characterized in that, The structure of the multi-scale local contrast enhancement module is expressed by the following formula: , in, This indicates a multi-scale local contrast enhancement module. To calculate the coefficient of thermal expansion The feature map obtained after local contrast optimization. This is the original feature map. , For batch normalization operations, This refers to the splicing operation at the channel dimension. It is the ReLU activation function. It is a 1×1 convolution.

3. A single-frame infrared small target detection device, wherein a computer program is stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1-2.

4. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1-2.