A weak and small target intelligent detection system based on salient feature information fusion
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI AEROSPACE CONTROL TECH INST
- Filing Date
- 2022-12-26
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies are insufficient in the detection of small infrared targets, and intelligent detection systems suffer from issues with confidence level and bounding box instability.
A weak target intelligent detection system based on salient feature information fusion is adopted, including a multi-scale salient feature extraction module, a low-resolution feature extraction module, and a multi-feature fusion module. Feature information is extracted and fused through dilated convolution, spatial attention module, and channel attention module to predict targets.
It improves the accuracy of weak target detection, outperforming traditional and other intelligent processing systems, and enhances stability and confidence.
Smart Images

Figure CN116342988B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to target detection using infrared guided imaging, specifically an intelligent detection system for small targets based on saliency feature information fusion. Background Technology
[0002] The purpose of weak target detection is to identify the location and category of weak targets in images or videos. Its main applications include security monitoring, aerial photography, medical cytology, and industrial flaw detection, where it has achieved significant success. In these fields, high detection rates and low false alarm rates are essential requirements for weak target detection. Although the rise of intelligent technologies such as deep learning has led to the rapid development of conventional target detection systems, these systems cannot be directly applied to the detection of weak targets due to their inherent characteristics.
[0003] Small targets often occupy fewer pixels and have lower contrast in images. Combined with complex and varied backgrounds, this makes detection difficult. Most systems can only utilize features such as grayscale distribution, motion characteristics, and direction of motion. According to the definition of the International Society for Optics and Photonics (ISO), small targets generally occupy no more than 9×9 pixels, or less than 81 pixels. In some cases with extremely long detection distances, the target may even appear as a single pixel in the image; such targets are called "point targets." Furthermore, small targets have low signal-to-noise ratios, with local signal-to-noise ratios less than 5 dB, which further increases the difficulty of detecting them.
[0004] To date, many papers have conducted research on weak target detection. In the traditional direction, paper 1: Marvasti FS, Mosavi MR, Nasiri M.2018. Flying small target detection in IRimages based on adaptive toggle operator. IET Computer Vision,12(4):527-534. By defining a new opening operation, a new Top-Hat transform is obtained. This algorithm solves the problem that the traditional Top-Hat transform cannot distinguish real targets when used for weak target detection, but the detection effect is poor. Paper 2: Hou XD and Zhang LQ.2007. Saliency detection: a spectral residual approach / / Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Minneapolis:IEEE:1-8. Detection is achieved by preserving the characteristics of the canonical model and suppressing the other characteristics. This algorithm performs well in detecting small infrared targets that do not require prior information and whose texture, shape, and other features are not obvious. It is simple and easy to implement, but it cannot effectively suppress background clutter and has a low detection rate.
[0005] In the field of deep learning, reference 3: Shi, M., Wang, H. Infrared Dim and Small Target Detection Based on Denoising Autoencoder Network. Mobile Netw Appl 25, 1469–1483 (2020). Small targets are treated as noise, and a denoising autoencoder is used for small target detection. First, simulated infrared small targets are superimposed on the infrared background as network input. Then, the network is used for denoising to obtain a clean background image. Finally, the clean background image is subtracted from the output image to obtain the final small target detection result. Reference 4: B. Zhao, C. Wang, Q. Fu and Z. Han, "A Novel Pattern for Infrared Small Target Detection With Generative Adversarial Network," in IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 4481-4492, May 2021, doi:10.1109 / TGRS.2020.3012981. This paper argues that infrared small targets are a special type of noise, and can be predicted from input images based on the data distribution and hierarchical features learned by a GAN. Utilizing the idea of image-image style transfer based on generative adversarial networks, and with the help of the U-Net network, fake images containing only the target are generated and compared with real target images, improving the network's detection capability. Compared with traditional algorithms, deep learning detection performance is significantly improved, but it also suffers from problems such as unstable confidence of detection results and unstable bounding boxes. Summary of the Invention
[0006] The technical problem to be solved by this invention is to address the insufficient performance of infrared weak target detection in existing technologies and the instability of confidence and bounding boxes in intelligent detection systems. This invention proposes a weak target intelligent detection system based on saliency feature information fusion.
[0007] This invention discloses an intelligent detection system for weak targets based on salient feature information fusion, comprising: a multi-scale salient feature extraction module, a low-resolution feature extraction module, and a multi-feature fusion module; wherein:
[0008] Multi-scale salient feature extraction module: Extracts feature information from the input image and sends it to the multi-feature fusion module;
[0009] Low-resolution feature extraction module: Extracts auxiliary features from the input image and sends them to the multi-feature fusion module;
[0010] Multi-feature fusion module: After fusing the feature information sent by the multi-scale saliency feature extraction module and the auxiliary features sent by the low-resolution feature extraction module, the fused feature information is output; the fused feature information is decoded to obtain the target prediction result.
[0011] In the aforementioned intelligent detection system, the specific method for decoding the fused feature information to obtain the target prediction result is as follows:
[0012]
[0013] Among them, b x b y b represents the center coordinates of the predicted result box. w b h To predict the width and height of the result box, t x t y Each target result includes the center coordinates of the bounding box; t w t h Each target result includes the width and height of the bounding box; c x c y p represents the coordinates of the top-left corner of the feature pixel where the center point is located. w and p h These are the width and height of the current prior bounding box, respectively.
[0014] In the aforementioned intelligent detection system, the multi-scale saliency feature extraction module includes dilated convolution, a spatial attention module, and a channel attention module. The dilated convolution performs multi-scale feature extraction on the input features, yielding features at several different scales. Each scale feature serves as the input to the corresponding spatial attention module, and through an attention propagation mechanism, feature maps and attention weights with added attention at different scales are calculated. The channel attention module concatenates the feature maps to obtain the output features.
[0015] In the aforementioned intelligent detection system, the spatial attention module includes a first spatial attention module and a second spatial attention module. The first spatial attention module includes one input item, and the second spatial attention module includes two input items.
[0016] In the aforementioned intelligent detection system, the first spatial attention module obtains a size H from the dilated convolution. sa1 ×W sa1 ×C sa1 The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1, respectively, to obtain new feature maps. Max pooling or average pooling operations are then performed along the channel dimension, with a pooling size of 1×C. sa1 After applying the Sigmoid activation function, we obtain H.sa1 ×W sa1 The spatial attention weights are multiplied by 1 (×1) and then multiplied element-wise with the input to obtain the output features; where H sa1 W represents the height of the feature map of the first spatial attention module. sa1 C is the width of the feature map of the first spatial attention module. sa1 This represents the number of channels in the feature map of the first spatial attention module.
[0017] In the aforementioned intelligent detection system, the second spatial attention module receives inputs of size H obtained from dilated convolution. sa2 ×W sa2 ×C sa2 The features and attention weights (SAM) generated by other spatial attention modules are processed. These are then subjected to 3×3 convolutions, followed by element-wise summation, and finally a 1×1 convolution to obtain a new feature map. Max pooling or average pooling is then performed along the channel dimension, with a pooling size of 1×C. sa2 After applying the Sigmoid activation function, we obtain H. sa2 ×W sa2 The spatial attention weights are multiplied by 1, and finally, these weights are multiplied element-wise with the input to obtain the output features; where H sa2 For the high-order feature map of the second spatial attention module, W sa2 C is the width of the feature map of the second spatial attention module. sa2 This represents the number of channels in the feature map of the second spatial attention module.
[0018] In the aforementioned intelligent detection system, the attention propagation mechanism is as follows: the spatial attention module corresponding to the dilated convolution with an expansion rate of d=1 outputs its attention weights as input to the spatial attention module corresponding to the dilated convolution with an expansion rate of d=2, and so on, with the spatial attention module corresponding to the dilated convolution with the largest expansion rate outputting a feature map.
[0019] In the aforementioned intelligent detection system, the channel attention module concatenates the feature maps to obtain output features. Specifically, the method involves concatenating the feature maps of size H. ca ×W ca ×C ca The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1 to obtain new feature maps. These new feature maps are then subjected to max pooling or average pooling operations in the spatial dimension, with a pooling size of H. ca ×W ca After applying the Sigmoid activation function, we get 1×1×C. ca The channel attention weights are multiplied channel by channel by the input to obtain the output features; where H caFor the high-resolution feature map of the channel attention module, W ca C is the width of the feature map of the channel attention module. ca This represents the number of channels in the feature map of the channel attention module.
[0020] In the aforementioned intelligent detection system, the specific method for extracting auxiliary features from the input image is as follows:
[0021] The input features are subjected to max pooling with a pooling size of 2×2, followed by a convolution operation with a size of 3×3 to obtain the first feature map.
[0022] The input features are convolved with a size of 3×3 and a stride of 2 to obtain the second feature map;
[0023] The first feature map and the second feature map are concatenated by channel to obtain the concatenated feature map;
[0024] Perform a 3×3 convolution operation on the stitched feature map to obtain auxiliary features of the input image.
[0025] In the aforementioned intelligent detection system, the method for fusing the feature information sent by the multi-scale saliency feature extraction module and the auxiliary features sent by the low-resolution feature extraction module to output fused feature information is as follows:
[0026] We take three features of adjacent size from the feature information and auxiliary features as input, and define them as M respectively. low M mid M high ;
[0027] For M low After deconvolution upsampling, a convolution operation with a kernel size of 1×1 is performed to obtain the first fused feature map;
[0028] For M mid By performing convolution operations with kernel sizes of 3×3 and 1×1 consecutively, a second fused feature map is obtained;
[0029] For M high After performing a convolution with a stride of 2 and a kernel size of 3×3, a convolution operation with a kernel size of 1×1 is performed to obtain the third fused feature map.
[0030] The first fused feature map, the second fused feature map, and the third fused feature map are added element by element to obtain the summed feature map M. add ;
[0031] For M mid After performing a convolution operation with a kernel size of 3×3 and M add The fused feature information is obtained by splicing the data along the channels.
[0032] The advantages of this invention compared to the prior art are:
[0033] This invention extracts feature information from the input image through a multi-scale salient feature extraction module, adds a low-resolution feature extraction module to the deepest layer of the intelligent detection system to extract even lower-resolution auxiliary features, uses a multi-feature fusion module to fuse feature information from different levels as output, and finally decodes the output of the intelligent detection system to obtain the target prediction result. The target detection accuracy of the weak target intelligent detection system based on salient feature information fusion is generally better than that of traditional and other intelligent processing systems. Attached Figure Description
[0034] Figure 1 This is a structural diagram of a multi-scale salient feature extraction module in a weak target intelligent detection system based on salient feature information fusion according to an embodiment of the present invention.
[0035] Figure 2 This is a structural diagram of the first spatial attention module of a weak target intelligent detection system based on saliency feature information fusion according to an embodiment of the present invention;
[0036] Figure 3 This is a structural diagram of a second spatial attention module in an intelligent detection system for weak targets based on saliency feature information fusion, according to an embodiment of the present invention.
[0037] Figure 4 This is a structural diagram of the channel attention module of a weak target intelligent detection system based on saliency feature information fusion according to an embodiment of the present invention;
[0038] Figure 5 This is a structural diagram of a low-resolution auxiliary feature extraction module in a weak target intelligent detection system based on saliency feature information fusion according to an embodiment of the present invention.
[0039] Figure 6 This is a structural diagram of a multi-feature fusion module in an intelligent detection system for weak targets based on salient feature information fusion according to an embodiment of the present invention.
[0040] Figure 7 This is a structural diagram of an intelligent detection system for weak targets based on saliency feature information fusion, according to an embodiment of the present invention. Detailed Implementation
[0041] The working principle and process of the present invention will be further explained and described below with reference to the accompanying drawings.
[0042] This invention discloses an intelligent detection system for weak targets based on salient feature information fusion, comprising: a multi-scale salient feature extraction module, a low-resolution feature extraction module, and a multi-feature fusion module; wherein:
[0043] Multi-scale salient feature extraction module: Extracts feature information from the input image and sends it to the multi-feature fusion module;
[0044] Low-resolution feature extraction module: Extracts auxiliary features from the input image and sends them to the multi-feature fusion module;
[0045] Multi-feature fusion module: After fusing the feature information sent by the multi-scale saliency feature extraction module and the auxiliary features sent by the low-resolution feature extraction module, the fused feature information is output; the fused feature information is decoded to obtain the target prediction result.
[0046] In the multi-feature fusion module, the fused feature information is decoded to obtain the target prediction result. The specific method is as follows:
[0047]
[0048] Among them, b x b y b represents the center coordinates of the predicted result box. w b h To predict the width and height of the result box, t x t y Each target result includes the center coordinates of the bounding box; t w t h Each target result includes the width and height of the bounding box; c x c y p represents the coordinates of the top-left corner of the feature pixel where the center point is located. w and p h These are the width and height of the current prior bounding box, respectively.
[0049] The multi-scale salient feature extraction module includes dilated convolution, a spatial attention module, and a channel attention module. Dilated convolution extracts features from the input features at multiple scales, resulting in features at several different scales. Each scale feature serves as input to the corresponding spatial attention module, and through an attention propagation mechanism, feature maps and attention weights with attention added at different scales are calculated. The channel attention module concatenates the feature maps to obtain the output features. The spatial attention module includes a first spatial attention module and a second spatial attention module. The first spatial attention module has one input term, and the second spatial attention module has two input terms.
[0050] The first spatial attention module obtains a size H from the dilated convolution. sa1 ×Wsa1 ×C sa1 The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1, respectively, to obtain new feature maps. Max pooling or average pooling operations are then performed along the channel dimension, with a pooling size of 1×C. sa1 After applying the Sigmoid activation function, we obtain H. sa1 ×W sa1 The spatial attention weights are multiplied by 1, and finally, the spatial attention weights are multiplied element-wise with the input to obtain the output features. Where H... sa1 W represents the height of the feature map of the first spatial attention module. sa1 C is the width of the feature map of the first spatial attention module. sa1 This represents the number of channels in the feature map of the first spatial attention module.
[0051] The second spatial attention module takes as input the size H obtained from dilated convolution. sa2 ×W sa2 ×C sa2 The features and attention weights (SAM) generated by other spatial attention modules are processed. These are then subjected to 3×3 convolutions, followed by element-wise summation, and finally a 1×1 convolution to obtain a new feature map. Max pooling or average pooling is then performed along the channel dimension, with a pooling size of 1×C. sa2 After applying the Sigmoid activation function, we obtain H. sa2 ×W sa2 The spatial attention weights are multiplied by 1, and finally, these weights are multiplied element-wise with the input to obtain the output features, where H... sa2 For the high-order feature map of the second spatial attention module, W sa2 C is the width of the feature map of the second spatial attention module. sa2 This represents the number of channels in the feature map of the second spatial attention module.
[0052] The attention propagation mechanism is as follows: the output attention weights of the spatial attention module corresponding to the dilated convolution with an expansion rate of d=1 are used as the input of the spatial attention module corresponding to the dilated convolution with an expansion rate of d=2, and so on, with the spatial attention module corresponding to the dilated convolution with the largest expansion rate outputting the feature map.
[0053] The channel attention module in the scale saliency feature extraction module concatenates the feature maps to obtain the output features. Specifically, it performs the following steps on a feature map of size H: ca ×W ca ×C ca The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1 to obtain new feature maps. These new feature maps are then subjected to max pooling or average pooling operations in the spatial dimension, with a pooling size of H. ca ×W caAfter applying the Sigmoid activation function, we get 1×1×C. ca The channel attention weights are multiplied channel by channel by the input to obtain the output features; where H ca For the high-resolution feature map of the channel attention module, W ca C represents the width of the feature map of the channel attention module. ca This represents the number of channels in the feature map of the channel attention module.
[0054] In the low-resolution feature extraction module, auxiliary features of the input image are extracted using the following method: Max pooling is performed on the input features with a pooling size of 2×2, followed by a 3×3 convolution operation to obtain the first feature map; a 3×3 convolution operation with a stride of 2 is performed on the input features to obtain the second feature map; the first and second feature maps are concatenated by channel to obtain the concatenated feature map; and a 3×3 convolution operation is performed on the concatenated feature map to obtain the auxiliary features of the input image.
[0055] In the multi-feature fusion module, the feature information sent by the multi-scale saliency feature extraction module and the auxiliary features sent by the low-resolution feature extraction module are fused together to output fused feature information. The specific method is as follows:
[0056] We take three features of adjacent size from the feature information and auxiliary features as input, and define them as M respectively. low M mid M high ;
[0057] For M low After deconvolution upsampling, a convolution operation with a kernel size of 1×1 is performed to obtain the first fused feature map;
[0058] For M mid By performing convolution operations with kernel sizes of 3×3 and 1×1 consecutively, a second fused feature map is obtained;
[0059] For M high After performing a convolution with a stride of 2 and a kernel size of 3×3, a convolution operation with a kernel size of 1×1 is performed to obtain the third fused feature map.
[0060] The first fused feature map, the second fused feature map, and the third fused feature map are added element by element to obtain the summed feature map M. add ;
[0061] For M mid After performing a convolution operation with a kernel size of 3×3 and M add The fused feature information is obtained by splicing the data along the channels.
[0062] Example
[0063] This invention proposes an intelligent target detection system based on salient feature information fusion. It extracts feature information from the input image through a designed multi-scale salient feature extraction module, adds a low-resolution feature extraction module at the deepest layer of the intelligent detection system to extract even lower-resolution auxiliary features, and uses a multi-feature fusion module to fuse feature information from different levels as output. Finally, the output of the intelligent detection system is decoded to obtain the target prediction result. The target detection accuracy of the intelligent target detection system based on salient feature information fusion is generally superior to traditional and other intelligent processing systems.
[0064] This embodiment is implemented based on the technical solution of the present invention, and provides detailed implementation methods and specific operation processes. However, the scope of protection of the present invention is not limited to the following embodiment.
[0065] Firstly, this includes extracting features from an input image of size H×W×1 using a multi-scale salient feature extraction module. The multi-scale salient feature extraction module is as follows: Figure 1 As shown, it includes dilated convolution, spatial attention module, attention relay mechanism, and channel attention module. Dilated convolution is denoted by dConv, where dConv3,1 represents a dilated convolution with a kernel size of 3×3, a stride of 1, and a dilation rate of 1. SA1 and SA2 represent two types of spatial attention modules, and CA represents the channel attention module. For the input of the multi-scale saliency feature extraction module, dilated convolutions with different dilation rates are first used to extract multi-scale features from the input features. For example, dilated convolutions with dilation rates of d=1, d=2, d=3, and d=4 are used to perform convolution operations on the input features respectively, resulting in four features at different scales, each with the same size as the input features.
[0066] Features at each scale serve as input to the corresponding spatial attention module. Through an attention propagation mechanism, features at different scales after attention are added are calculated to extract salient features. The first type of spatial attention module is as follows: Figure 2 As shown, it includes convolution, channel pooling, and sigmoid activation. Convolution is denoted by Conv, where Conv3,1 represents a convolution operation with a kernel size of 3×3 and a stride of 1. Channel pooling represents global pooling along the channel directions of the features. The size obtained by dilated convolution is H. sa1 ×W sa1 ×C sa1 The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1 to obtain new feature maps. Finally, max pooling or average pooling operations are performed along the channel dimension, with a pooling size of 1×C. sa1 After applying the Sigmoid activation function, we obtain H. sa1 ×W sa1The spatial attention weights are multiplied by 1, and finally, these weights are multiplied element-wise with the input to obtain the output features.
[0067] The second type of spatial attention module is as follows: Figure 3 As shown, this module shares similarities with the first spatial attention module in that it includes convolution, channel pooling, and sigmoid activation. This module has two input terms: one is the dilated convolution with a size of H. sa2 ×W sa2 ×C sa2 The features and attention weights (SAM) generated by other spatial attention modules are processed. These are then subjected to 3×3 convolutions, followed by element-wise summation, and finally a 1×1 convolution to obtain a new feature map. Max pooling or average pooling is then performed along the channel dimension, with a pooling size of 1×C. sa2 After applying the Sigmoid activation function, we obtain H. sa2 ×W sa2 The spatial attention weights are multiplied by 1, and finally, these weights are multiplied element-wise with the input to obtain the output features. The attention weights generated by spatial attention are output as appropriate in the attention propagation mechanism.
[0068] For features obtained from dilated convolutions with a dilation rate of d=1, the first type of spatial attention is used; for dilated convolution features with other dilation rates, the second type of spatial attention is used. The output attention weights of the spatial attention module corresponding to the dilated convolution with a dilation rate of d=1 are used as the input of the spatial attention module corresponding to the dilated convolution with a dilation rate of d=2, and so on. The spatial attention module corresponding to the dilated convolution with the largest dilation rate only outputs the feature map and does not output the attention weights.
[0069] After adding attention mechanisms to features at multiple scales, they are concatenated along the channel axis, and channel downsampling is performed using a 1×1 convolution kernel to ensure the number of channels matches the number of input feature channels in the multi-scale salient feature extraction module. Then, a channel attention module is added, with the structure as follows: Figure 4 As shown, for a size H ca ×W ca ×C ca The features are then subjected to convolution operations with kernel sizes of 3×3 and 1×1 to obtain new feature maps. Finally, max pooling or average pooling operations are performed in the spatial dimension, with a pooling size of H. ca ×W ca After applying the Sigmoid activation function, we get 1×1×C. ca The channel attention weights are calculated, and finally, these weights are multiplied by the input channel by channel to obtain the output features, highlighting the feature channels that contain more information.
[0070] Next, auxiliary features are extracted using a low-resolution auxiliary feature extraction module, including: the low-resolution feature extraction module as follows: Figure 5 As shown, for the features extracted by the multi-scale saliency feature extraction module, on the one hand, a max pooling operation is performed on the input features with a pooling size of 2×2, followed by a convolution operation with a size of 3×3; on the other hand, a convolution operation with a size of 3×3 and a stride of 2 is performed on the input features. The two feature maps are then concatenated by channel, and then a convolution operation with a size of 3×3 is performed on the concatenated feature map to obtain the output of the low-resolution feature extraction module, with an output size of [missing information]. Where C lr The number of channels for auxiliary features extracted by the low-resolution feature extraction module.
[0071] The multi-feature fusion module includes: taking three adjacent features from the features obtained by the multi-scale salient feature extraction module and the low-resolution auxiliary feature extraction module as input, and defining them as M respectively. low M mid M high , for M wol After deconvolution upsampling, a convolution operation with a kernel size of 1×1 is performed on M. mid Perform convolution operations with kernel sizes of 3×3 and 1×1 consecutively on M. high After performing a convolution with a stride of 2 and a kernel size of 3×3, a convolution operation with a kernel size of 1×1 is performed. The resulting three feature maps of the same size are then summed element-wise, denoted as M. add Furthermore, regarding M mid After performing a convolution operation with a kernel size of 3×3 and M add The output features are obtained by concatenating channels. The structure of the multi-feature fusion module is as follows: Figure 6 As shown.
[0072] The output features are then convolutionally processed to obtain the output of the intelligent detection system. The overall structure of the intelligent detection system is as follows: Figure 7 As shown, the intelligent detection system outputs three feature maps with dimensions of [sizes to be filled in].
[0073] Finally, decoding the output of the intelligent detection system to obtain the target prediction results includes: the three-dimensional feature map representation of the intelligent detection system's output, where each output feature pixel predicts three target results, and each target result includes the bounding box center coordinates and width and height (t). x ,t y ,t w ,t h ), confidence level (conf), class probability (p) cls1 ,p cls2 Decoding uses the following formula:
[0074]
[0075] c x and c y p represents the coordinates of the top-left corner of the feature pixel where the center point is located. w and p h Represents the width and height of the current prior bounding box.
[0076] Existing technologies have room for improvement in detecting small targets, and intelligent detection systems suffer from instability in confidence levels and bounding boxes. To address these issues, this invention proposes an intelligent detection system for small targets based on salient feature information fusion. First, a multi-scale salient feature extraction module extracts feature information from the input image. Then, a low-resolution feature extraction module is added to the deepest layer of the intelligent detection system to extract even lower-resolution auxiliary features. A multi-feature fusion module then fuses feature information from different levels as the output. Finally, the output of the intelligent detection system is decoded to obtain the target prediction result.
[0077] Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make possible changes and modifications to the technical solutions of the present invention by utilizing the disclosed system and technical content without departing from the spirit and scope of the present invention. Therefore, any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the content of the technical solutions of the present invention shall fall within the protection scope of the technical solutions of the present invention.
Claims
1. A weak target intelligent detection system based on saliency feature information fusion, characterized in that, include: The module includes a multi-scale saliency feature extraction module, a low-resolution feature extraction module, and a multi-feature fusion module; among which: Multi-scale salient feature extraction module: Extracts feature information from the input image and sends it to the multi-feature fusion module; Low-resolution feature extraction module: Extracts auxiliary features from the input image and sends them to the multi-feature fusion module; Multi-feature fusion module: It fuses the feature information sent by the multi-scale salient feature extraction module and the auxiliary features sent by the low-resolution feature extraction module, and outputs fused feature information; it decodes the fused feature information to obtain the target prediction result; The multi-scale saliency feature extraction module includes dilated convolution, spatial attention module, and channel attention module. The dilated convolution performs multi-scale feature extraction on the input features, yielding features at several different scales. Each scale feature serves as the input to the corresponding spatial attention module, and through an attention propagation mechanism, feature maps and attention weights with added attention at different scales are calculated. The channel attention module concatenates the feature maps to obtain the output features. The spatial attention module includes a first spatial attention module and a second spatial attention module. The first spatial attention module includes one input item, and the second spatial attention module includes two input items. The first spatial attention module obtains a size of [size] from the dilated convolution. The features are convolved sequentially with kernel size of and The convolution operation yields a new feature map, which is then subjected to max pooling or average pooling along the channel dimension, with a pooling size of [value missing]. After applying the Sigmoid activation function, we get The spatial attention weights are calculated, and finally, the spatial attention weights are multiplied element-wise with the input to obtain the output features; where H sa1 W represents the height of the feature map of the first spatial attention module. sa1 C is the width of the feature map of the first spatial attention module. sa1 The number of channels in the feature map of the first spatial attention module; The second spatial attention module takes as input the size obtained from dilated convolution. The features and attention weights (SAM) generated by other spatial attention modules are used to perform a scalarization of the two. After the convolution operation, element-wise addition is performed, and then the result is passed through a loop of size 1. The convolutions yield new feature maps, which are then subjected to max pooling or average pooling operations along the channel dimension, with a pooling size of [size missing]. After applying the Sigmoid activation function, we get The spatial attention weights are then multiplied element-wise with the input to obtain the output features; where H... sa2 For the high-order feature map of the second spatial attention module, W sa2 C is the width of the feature map of the second spatial attention module. sa2 This represents the number of channels in the feature map of the second spatial attention module.
2. The intelligent detection system for weak targets based on saliency feature information fusion as described in claim 1, characterized in that: The specific method for decoding the fused feature information to obtain the target prediction result is as follows: Among them, b x b y b represents the center coordinates of the predicted result box. w b h To predict the width and height of the result box, t x t y Each target result includes the center coordinates of the bounding box; t w t h Each target result includes the width and height of the bounding box; , The coordinates of the top-left corner of the feature pixel where the center point is located. and These are the width and height of the current prior bounding box, respectively.
3. The intelligent detection system for weak targets based on saliency feature information fusion as described in claim 1, characterized in that: The attention propagation mechanism is specifically as follows: the expansion rate is... The spatial attention module corresponding to dilated convolution uses its output attention weights as the dilation rate. The input to the spatial attention module corresponding to the dilated convolution is the same as the input to the spatial attention module corresponding to the dilated convolution with the largest dilation rate, and the output feature map is the spatial attention module corresponding to the dilated convolution.
4. The intelligent detection system for weak targets based on saliency feature information fusion as described in claim 1, characterized in that: The channel attention module concatenates the feature maps to obtain output features. Specifically, it concatenates features of size [size missing]. The features are convolved sequentially with kernel size of and The convolution operation yields a new feature map. This new feature map is then subjected to max pooling or average pooling in its spatial dimension, with a pooling size of [size missing]. After applying the Sigmoid activation function, we get The channel attention weights are multiplied channel by channel by the input to obtain the output features; where H ca For the high-resolution feature map of the channel attention module, W ca C is the width of the feature map of the channel attention module. ca This represents the number of channels in the feature map of the channel attention module.
5. The intelligent detection system for weak targets based on saliency feature information fusion as described in claim 1, characterized in that: The method for extracting auxiliary features from the input image is as follows: Max pooling is performed on the input features, with a pooling size of . Then, the size is The convolution operation is performed to obtain the first feature map; The input features are sized as A convolution operation with a stride of 2 is performed to obtain the second feature map; The first feature map and the second feature map are concatenated by channel to obtain the concatenated feature map; The spliced feature map is sized as follows The convolution operation is used to obtain auxiliary features of the input image.
6. The intelligent detection system for weak targets based on saliency feature information fusion as described in claim 1, characterized in that: The method for fusing the feature information sent by the multi-scale saliency feature extraction module and the auxiliary features sent by the low-resolution feature extraction module to output fused feature information is as follows: We take three features of adjacent size from the feature information and auxiliary features as input, and define them as follows: , , ; right After deconvolution upsampling, the kernel size is... The convolution operation yields the first fused feature map; right The kernel size is continuously performed. and The convolution operation is used to obtain the second fused feature map; right Perform convolution with a stride of 2 and a kernel size of 1. After convolution, the kernel size is... The convolution operation yields the third fused feature map; The first fused feature map, the second fused feature map, and the third fused feature map are added element by element to obtain the summed feature map. ; right Perform convolution kernel size is After the convolution operation and The fused feature information is obtained by splicing the data along the channels.