Target detection method based on multi-scale attention

By constructing a multi-scale semantic feature fusion detection network and combining it with multi-scale attention generation and feature fusion modules, the problems of poor detection performance and slow computation speed of small-sized targets in infrared ship target detection are solved, achieving efficient and accurate target detection.

CN115620127BActive Publication Date: 2026-06-19HEBEI HANGUANG HEAVY IND

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEBEI HANGUANG HEAVY IND
Filing Date
2022-09-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing infrared ship target detection methods based on deep neural networks perform poorly on small targets and targets with indistinct features, while target segmentation-based methods are slow to compute and require high-quality data labeling.

Method used

A multi-scale semantic feature fusion detection network is constructed, including an object detection module, a multi-scale attention generation module, and a multi-scale feature fusion module. The multi-scale attention generation module generates attention maps for each layer and multiplies them with the basic feature maps to suppress background regions. Combined with the multi-scale feature fusion module, attention to small-sized targets is improved. The network is trained using pixel-level mask labels to optimize the object detection network.

Benefits of technology

It improves the detection performance of small targets, simplifies the data annotation process, and enhances detection speed and accuracy. The algorithm is simple and computationally efficient.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115620127B_ABST
    Figure CN115620127B_ABST
Patent Text Reader

Abstract

This disclosure provides a multi-scale attention-based object detection method, comprising: constructing a multi-scale semantic feature fusion detection network, including an object detection module, a multi-scale attention generation module, and a multi-scale feature fusion module; acquiring a target dataset containing rectangular bounding boxes for training; calculating multiple pixel-level mask labels for each training image based on the rectangular bounding boxes; training the multi-scale semantic feature fusion detection network using the dataset containing rectangular bounding boxes and pixel-level mask labels; acquiring the image to be detected and inputting it into the trained detection network to obtain the result. This disclosure, by fusing a multi-scale attention module incorporating image segmentation features with the original convolutional features in the object detection network, can suppress background regions, improve attention to small-sized targets, and enhance detection performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision, and in particular to a target detection method based on multi-scale attention. Background Technology

[0002] Traditional infrared ship target detection algorithms are mostly based on constant false alarm rate (CFAR) models under non-stationary Gaussian noise. These models integrate adaptive thresholding strategies and sea clutter statistical models, achieving high accuracy for high-resolution, large-scale targets, but lacking the ability to detect low-contrast targets in complex backgrounds. In recent years, deep neural networks, with their unique feature representation capabilities, have significantly improved the accuracy of target detection tasks, achieving excellent performance in natural images. However, they are difficult to directly transfer to infrared ship detection tasks. Long-range imaging of ship targets lacks features such as size, shape, and texture, and is easily affected by sea clutter and cloud cover, making target detection very difficult. Currently, there are two main methods for infrared ship target detection based on deep neural networks: one uses target detection networks such as YOLO as the backbone, and the other uses target segmentation networks such as FCN as the backbone. Method one has low requirements for dataset annotation and fast computation speed, but low detection accuracy and poor performance for small targets and targets with indistinct features; method two has high detection accuracy, but requires more complex datasets with pixel-level annotation and is slower in computation. Summary of the Invention

[0003] To address the issues of poor detection performance for small-sized targets and targets with indistinct features in existing target detection methods, and the slow computation speed and high data annotation requirements of target segmentation methods, this disclosure proposes a target detection method based on image segmentation. This method improves the attention to small-sized targets and enhances the detection performance, making it applicable to the detection of infrared ship targets.

[0004] The object detection method based on image segmentation disclosed herein includes:

[0005] A multi-scale semantic feature fusion detection network is constructed, comprising: an object detection module, a multi-scale attention generation module, and a multi-scale feature fusion module, wherein:

[0006] The target detection module extracts feature maps of the target at different levels;

[0007] The multi-scale attention generation module generates an attention map for each layer based on the basic feature map, and multiplies it with the basic feature map of each layer to generate a feature map with attention added to each layer.

[0008] The multi-scale feature fusion module fuses the feature maps after attention is added to each layer to generate the final feature;

[0009] The target detection module performs target detection based on the final features;

[0010] Obtain the target training dataset containing rectangular annotation information, and calculate multiple pixel-level mask label images corresponding to each training image based on the rectangular annotation boxes;

[0011] The multi-scale semantic feature fusion detection network is trained using a target training dataset that includes the rectangular annotation information and pixel-level mask labels.

[0012] The image to be detected is obtained and input into the trained detection network to obtain the detection result.

[0013] Furthermore, the method for obtaining the pixel-level mask label includes the following steps:

[0014] Obtain a target training dataset containing rectangular annotation information. The rectangular annotation information refers to the position and size information of the target in the image, represented by a quadruple Box = (x, y, w, h), where (x, y) refers to the coordinates of the upper left corner of the target's bounding rectangle, w refers to the width of the target's bounding rectangle, and h refers to the height of the target's bounding rectangle.

[0015] For a single training image Image, generate N images of the same size, denoted as M = {m1, m2, ..., m...} N};

[0016] Set the largest target rectangle bounding box size to the maximum value and the smallest target rectangle bounding box size to the minimum value in all training set images. Divide the size range into N intervals by average distribution and set the division threshold.

[0017] Determine the size range of all rectangular bounding boxes in the Image. If a rectangular bounding box Box = (x, y, w, h) belongs to interval i (i∈N), then m i It can be represented as:

[0018]

[0019] Where, p k,j m i If the pixel value at coordinate position (k,j) is the value of the Image, then M is the number of pixel mask labels generated for targets of different sizes.

[0020] Generate corresponding pixel mask label images for all training set images using the method described above.

[0021] Furthermore, the target detection module uses a two-stage target detection network Faster-RCNN as the baseline network, wherein VGG16 is used as the target basic feature extraction network, and the rectangular annotation information is used as the training label in this module.

[0022] Furthermore, the target detection network is optimized as follows:

[0023] The fourth pooling operation in VGG-16 is removed, reducing the feature resolution scaling factor from 16 to 8.

[0024] Furthermore, the multi-scale attention generation module generates feature maps for each layer after attention has been added, including the following steps:

[0025] Max pooling is performed on the low-level feature maps, and deconvolution is performed on the high-level feature maps to make the feature map sizes uniform across all layers.

[0026] For each layer, perform the following calculations:

[0027] After further feature extraction through 3 layers of 3×3 convolution, features with 2 channels are obtained, and normalization processing is performed on them in the (0,1) interval to obtain a probability map representing the positive and negative categories of each pixel.

[0028] The probability map representing the probability that a pixel is a positive sample is used as attention. It is then multiplied pixel-wise with the feature map obtained after unifying the size of the feature map of this layer to obtain the feature map after adding attention.

[0029] Furthermore, the backbone network of the multi-scale feature fusion module is a VGG-16 network, and four feature layers—conv2-2, conv3-3, conv4-3, and conv5-3—are used as feature fusion layers; the method for generating the final features includes:

[0030] The number of channels in the feature maps of each layer is unified by 1×1 convolution;

[0031] For each layer, pooling or deconvolution is used to unify it to the same scale as conv3-3;

[0032] The final fused feature is obtained by adding each pixel individually.

[0033] Furthermore, in the step of training the multi-scale semantic feature fusion detection network, the pixel mask label map is used as the training label for the multi-scale attention generation module. The specific method includes:

[0034] Small-sized target mask labels are used for low-level features, and large-sized target mask labels are used for high-level features. The size of the mask label is consistent with that of the output attention map, and each pixel corresponds to it.

[0035] The pixel-level cross-entropy loss is used as the loss function, and the calculation formula is as follows:

[0036]

[0037] Where i refers to the i-th pixel in the image, n refers to the total number of pixels in the image, and y i y′ represents the pixel value at that location in the pixel-level mask image. i This refers to the pixel value at that location in the network's output attention map.

[0038] The detection method disclosed herein addresses the problems of poor detection performance for small-sized targets and targets with indistinct features in existing target detection methods, as well as the problems of slow computation speed and high data annotation requirements in target segmentation methods. By designing a multi-scale attention module that incorporates image segmentation features into the target detection network and fusing it with the original convolutional features, background regions are suppressed, attention to small-sized targets is improved, and detection performance is enhanced.

[0039] Compared with existing technologies, the beneficial effects of this disclosure are: ① By fusing multi-scale image segmentation features into the target detection network, background features are suppressed and small target features are strengthened, effectively improving the detection effect of small targets; ② Through supervision, feature maps of different scales can focus on targets of different scales during fusion, achieving the goal of retaining the advantages of each layer of feature maps in the fusion; ③ The features of each layer are finally unified to the same scale as conv3-3, which not only retains spatial information of a certain scale, but also has certain channel information, while not slowing down the training speed by adding too many parameters; ④ The algorithm is simple and computationally efficient. Attached Figure Description

[0040] The above and other objects, features and advantages of this disclosure will become more apparent from the more detailed description of exemplary embodiments of this disclosure taken in conjunction with the accompanying drawings, in which the same reference numerals generally represent the same components.

[0041] Figure 1 Showing an overall flowchart according to an exemplary embodiment;

[0042] Figure 2 Here is an example flowchart of the attention generation module;

[0043] Figure 3 This example illustrates the process from basic feature extraction to attention generation to multi-scale feature fusion. Detailed Implementation

[0044] Preferred embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be thorough and complete, and will fully convey the scope of the present disclosure to those skilled in the art.

[0045] This invention provides a target detection method based on image segmentation, such as... Figure 1 See the exemplary flowchart for details. Figure 1 Taking the detection of infrared ship targets as an example, the following steps will be used for further explanation:

[0046] S101 constructs a multi-scale semantic feature fusion detection network, mainly including: an object detection module, a multi-scale attention generation module, and a feature fusion module. Among them:

[0047] The object detection module uses the two-stage object detection network Faster-RCNN as the base network, with VGG16 as the basic feature extraction network. Because this network performs four pooling operations, the image resolution is reduced by a factor of 16 after feature extraction, leading to the loss of small-scale object features. Therefore, as a preferred solution, the fourth pooling operation in VGG-16 is removed, reducing the feature resolution reduction ratio from 16 to 8. Simultaneously, the input image size is increased through sampling, and the shortest side length of the input image is specified to further improve the detection performance of small objects.

[0048] Multi-scale attention generation module, see Figure 2 The input is the feature maps F extracted by the target detection module at each layer. i The output is the attention map of each layer with attention added. i , where i represents the i-th layer feature. First, max pooling is performed on the lower-level feature maps, and deconvolution is performed on the higher-level feature maps to unify the size of each feature map. Then, for each layer feature map, three 3×3 convolutions are performed to further extract features, resulting in features with two channels. These features are then normalized using softmax (0, 1), ultimately yielding a probability map representing the positive or negative class of each pixel. The probability map representing the probability of a pixel being a positive sample is used as attention and multiplied pixel-level with the feature map obtained after unifying the size of the feature map of that layer. This gives higher weight to pixel regions classified as positive, thus suppressing background regions.

[0049] In the multi-scale feature fusion module, the backbone network is a VGG-16 network, see [link / reference]. Figure 3In this example, conv1 only extracts superficial features such as edges and has virtually no representational power. Therefore, four feature layers—conv2-2, conv3-3, conv4-3, and conv5-3—are selected as feature fusion layers. Then, a 1×1 convolution is used to unify the number of channels in each feature map, and pooling or deconvolution is used to unify the scale. The final fused feature is obtained by adding each layer pixel-by-pixel. Ideally, the features of each layer are ultimately unified to the same scale as conv3-3, preserving both spatial and channel information at a certain scale, without adding too many parameters and slowing down the training speed. The scale changes of each feature fusion layer in this example are shown in the table below:

[0050] Table 1. Variation of Feature Map Size in Feature Fusion Network

[0051]

[0052] Step S102: Obtain the ship dataset containing rectangular annotation information. Rectangular annotation information refers to the position and size information of the target in the image. It is generally represented by a quadruple Box = (x, y, w, h), where (x, y) refers to the coordinates of the top-left corner of the target's bounding rectangle, w refers to the width of the target's bounding rectangle, and h refers to the height of the target's bounding rectangle.

[0053] Calculate multiple pixel-level mask labels for each training image based on the bounding box. The pixel-level mask labels are generated as follows: for a single training image, N images of the same size are generated, denoted as M = {m1, m2, ..., m...}. N}; Set the largest target rectangle bounding box size to the maximum value and the smallest target rectangle bounding box size to the minimum value in all training set images. Divide the size intervals into N intervals using an average distribution method, and set a division threshold; Determine the size interval to which all rectangle bounding boxes in the Image belong. If a rectangle bounding box Box = (x, y, w, h) belongs to interval i (i∈N), then m i It can be represented as:

[0054]

[0055] Where, p k,j m i If the pixel value at coordinate position (k,j) is the value of the Image, then M is the number of pixel mask labels generated for targets of different sizes.

[0056] Generate corresponding pixel mask label images for all training set images using the method described above.

[0057] In this embodiment, N=4 is preferred.

[0058] Step S103: Train the multi-scale semantic feature fusion detection network using a training dataset containing rectangular bounding boxes and pixel-level mask labels:

[0059] The rectangular annotation information is used as training labels in the object detection module. The loss function is calculated as follows:

[0060]

[0061] Where i is the index of the candidate bounding box to be detected, and t i The quadruple (tx, ty, tw, th) represents the coordinates and size of the candidate bounding box output by the object detection module. A quadruple of (tx*, ty*, tw*, th*) represents the coordinates and size of the ground truth label corresponding to the candidate box, p i This represents the probability that the candidate box is predicted to be a positive sample. This is the true category label corresponding to the candidate box. cls and N reg It is a normalized constant. L cls The classification loss uses cross-entropy loss, L reg The regression loss is SmoothL1 loss.

[0062] In the multi-scale attention generation module, pixel mask label maps of different sizes obtained in step S102 are used as training labels. Low-level features use small-sized target mask labels, and high-level features use large-sized target mask labels. These mask labels are aligned with the size of the network output attention map, with each pixel corresponding to a specific pixel. Pixel-level cross-entropy loss is used as the loss function. Supervision allows feature maps of different scales to focus on infrared ship targets of different scales during fusion, achieving the goal of preserving the advantages of each layer's feature maps during feature fusion. In this embodiment, the loss function calculation formula is as follows:

[0063]

[0064] Where i refers to the i-th pixel in the image, n refers to the total number of pixels in the image, and y i y′ represents the pixel value at that location in the pixel-level mask image. i This refers to the pixel value at that location in the network's output attention map.

[0065] Step S104: Obtain the image to be detected and input it into the trained detection network for detection.

[0066] The above technical solutions are merely exemplary embodiments of the present invention. For those skilled in the art, based on the application methods and principles disclosed in the present invention, it is easy to make various types of improvements or modifications, and not limited to the methods described in the specific embodiments of the present invention. Therefore, the methods described above are merely preferred and not restrictive.

Claims

1. A target detection method based on multi-scale attention, comprising the following steps: A multi-scale semantic feature fusion detection network is constructed, comprising: an object detection module, a multi-scale attention generation module, and a multi-scale feature fusion module, wherein: The target detection module extracts feature maps of the target at different levels; The multi-scale attention generation module generates attention maps for each layer based on the basic feature maps, and multiplies them with the basic feature maps of each layer to generate feature maps with attention added to each layer. The multi-scale feature fusion module fuses the feature maps after attention is added to each layer to generate the final feature; The target detection module performs target detection based on the final features; Obtain the target training dataset containing rectangular annotation information, and calculate multiple pixel-level mask label images corresponding to each training image based on the rectangular annotation boxes; The multi-scale semantic feature fusion detection network is trained using a target training dataset that includes the rectangular annotation information and pixel-level mask labels. The image to be detected is obtained and input into the trained detection network to obtain the detection result; The method for obtaining the pixel-level mask label includes the following steps: Obtain a target training dataset containing rectangular annotation information. The rectangular annotation information refers to the position and size information of the target in the image, represented by a quadruple Box=(x,y,w,h), where (x,y) refers to the coordinates of the upper left corner of the target's bounding rectangle, w refers to the width of the target's bounding rectangle, and h refers to the height of the target's bounding rectangle. For a single training picture Image, N pictures with the same size as Image are generated, denoted as ; Set the largest target rectangle bounding box size to the maximum value and the smallest target rectangle bounding box size to the minimum value in all training set images. Divide the size range into N intervals by average distribution and set the division threshold. Determine the size range of all rectangular bounding boxes in the Image. If a rectangular bounding box Box=(x, y, w, h) belongs to range i, then... N, Represented as: wherein, represents a pixel value at coordinate position (k,j) in Image M generates N pixel mask label images for different size targets corresponding to Image Generate corresponding pixel mask label images for all training set images in the manner described above. The multi-scale attention generation module generates feature maps for each layer after attention has been added, including the following steps: Max pooling is performed on the low-level feature maps, and deconvolution is performed on the high-level feature maps to make the feature map sizes uniform across all layers. For each layer, perform the following calculations: After further feature extraction through 3 layers of 3×3 convolution, features with 2 channels are obtained, and normalization processing is performed on them in the (0,1) interval to obtain a probability map representing the positive and negative categories of each pixel. The probability map representing the probability that a pixel is a positive sample is used as attention. It is then multiplied at the pixel level with the feature map obtained after unifying the size of the feature map of this layer to obtain the feature map after adding attention. In the step of training the multi-scale semantic feature fusion detection network, the pixel mask label map is used as the training label for the multi-scale attention generation module. The specific method includes: Small-sized target mask labels are used for low-level features, and large-sized target mask labels are used for high-level features. The size of the mask label is consistent with that of the output attention map, and each pixel corresponds to it. The pixel-level cross-entropy loss is used as the loss function, and the calculation formula is as follows: Where i refers to the i-th pixel in the image, and n refers to the total number of pixels in the image. This represents the pixel value at that location in the pixel-level mask image. This refers to the pixel value at that location in the network's output attention map.

2. The detection method according to claim 1, characterized in that, The target detection module uses the two-stage target detection network Faster-RCNN as the base network, in which VGG16 is used as the target basic feature extraction network, and the rectangle annotation information is used as the training label in this module.

3. The detection method according to claim 2, characterized in that, The target detection network is optimized as follows: The fourth pooling operation in VGG-16 is removed, reducing the feature resolution scaling factor from 16 to 8.

4. The method of claim 1, wherein The backbone network of the multi-scale feature fusion module is the VGG-16 network, and four feature layers, conv2-2, conv3-3, conv4-3 and conv5-3, are selected as feature fusion layers. The methods for generating the final features include: The number of channels in the feature maps of each layer is unified by 1×1 convolution; For each layer, pooling or deconvolution is used to unify it to the same scale as conv3-3; The final fused feature is obtained by adding each pixel individually.