Image target detection method and system

A target detection and image technology, which is applied in image analysis, image enhancement, image data processing, etc., can solve the problems of algorithm AP (low accuracy rate, low object detection accuracy rate of small targets, etc.), and achieve the effect of improving performance

Pending Publication Date: 2022-05-20
FUJIAN YIRONG INFORMATION TECH +2
0 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0005] At present, although there are some very mature algorithms for image target detection, so...
View more

Method used

1, adopt multi-branch target detection network, effectively promote the detection effect of small object
2, adopt multi-stage attention module, promote target detection accuracy rate
The problem that the one-stage target detection algorithm precision is not high enough, this is mainly because based on the bottom-up deep learning convolutional neural network, all features are treated equally, often the shallow layer information will be processed after multi-layer convolution lost. In the solution of the present invention, by adding a spatial network module, the deep feature and shallow feature are combined in the fusion network module, which solves the problem of the loss of shallow layer information after multi-layer convolution in the traditional convolutional neural network. For small objects There is a certain improvement in the target detection.
The structure of the space network module is shown in Figure 3, wherein the spatial attention mechanism (Spatial Attention Module, SAM) is used to learn the weight of multiple grade type features, and its input comes from the backbone network MAM (Multi-level Attention Module, multiple Level attention module) module output features, and the output features of the pre-convolutional layer in this module. Feature maps of smaller size will be upsampled to match the size of different input features. These features are concatenated to form a combined feature, which is then convolved with a 3×3 convolution with batch regularization and a rectified linear unit to form feature M. The feature M is then subj...
View more

Abstract

The invention discloses an image target detection method and system, and the method comprises the steps: employing a network structure based on an attention mechanism to extract image features, employing a region generation network to generate a candidate region, employing non-maximum suppression to remove redundant candidate boxes, and obtaining a final detection result. The target detection process comprises the following steps: inputting a picture, and extracting features of the picture through a backbone network and a spatial network; and the backbone network inputs the image feature data after convolution to the spatial network module and the fusion network module. The output features of the spatial network module, the backbone network module and the fusion network module and the output features of the subsequently added convolutional layers are used as the basis of subsequent regression and classification. And generating a plurality of prior frames through an RPN algorithm, and filtering through an NMS to obtain a final target detection result. According to the embodiment of the invention, through the image target detection method based on the attention mechanism, the detection and recognition capability of the small target object is further improved under the condition of improving the overall recognition accuracy.

Application Domain

Image enhancementImage analysis +2

Technology Topic

Space NetworkNetwork generation +10

Image

  • Image target detection method and system
  • Image target detection method and system
  • Image target detection method and system

Examples

  • Experimental program(3)

Example Embodiment

[0046] Example 1
[0047] This embodiment provides an image target detection method, such as figure 1 shown, including:
[0048] Obtaining an image to be detected, inputting the image to be detected into a backbone network module and a spatial network module to extract features respectively, and the backbone network module respectively inputting the image feature data extracted by a plurality of convolutional layers into the spatial network module and the fusion network module;
[0049] The spatial network module obtains high-resolution spatial features through the input image to be detected and the image feature data of multiple convolutional layers input by the backbone network module, and adds a spatial attention mechanism after each convolutional layer to enhance spatial information. , and finally integrate multiple spatial information of different scales and output;
[0050] The fusion network module combines the image feature data of multiple convolutional layers input by the backbone network module into deep features and shallow features, and then combines and outputs the deep features and shallow features;
[0051] The output features of the spatial network module and the fusion network module and the output features extracted by the convolution module added after the backbone network module are used as the basis for regression and classification, and multiple candidate regions are generated through the detection and recognition module, and then redundant through the filtering module. the remaining candidate regions, and finally output the target detection result.

Example Embodiment

[0052] A specific embodiment of the method of the present invention:
[0053] The image feature data extracted by the multiple convolutional layers of the backbone network module are respectively input to the spatial attention mechanism SAM of the spatial network module through the corresponding multi-level attention module MAM, and the multi-level attention module MAM is used to obtain multi-level attention. level feature information, and then combining the multi-level feature information to achieve feature enhancement.
[0054] The multi-level attention module MAM is specifically used to set F n Get F after upsampling n-1 , the two-level feature F n and F n-1 (F n Represents the feature after the nth layer convolution, F n-1 means to F n upsampled features) are concatenated to obtain the combined feature F, and then the feature F is transformed by a 3×3 convolution with batch regularization and nonlinear units to obtain the feature F′, the multi-level attention module MAM has Two hyperparameters: expansion rate d and compression rate r, when d=4, r=16, the formula of the final feature M is as follows:
[0055]
[0056] Among them, F' represents the eigenvalue after the convolution of the combined feature F, σ is the sigmoid function, BN is batch regularization, f 3×3 represents a 3×3 atrous convolution, f 7×7 Represents 7×7 atrous convolution, Con represents convolution, and Maxpool represents maximum pooling.
[0057] The backbone network module includes a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a fifth convolutional layer, and the fusion network module combines the fifth convolutional layer of the backbone network module. Layer and the feature combination of the fourth convolutional layer are used as deep features, the feature combination of the third convolutional layer and the second convolutional layer of the backbone network module is used as shallow features, and then the deep features and shallow features are combined to form a fusion The input features of the attention module use deconvolution to unify the feature sizes from different residual network blocks, so that the input W×H-dimensional features have D-dimensional channels; then the input features are pooled in two paths, two pooling The kernel sizes of the layers are 2×2 and 4×4, respectively; then the smaller-sized features are up-sampled to obtain the same size features as the original feature map; finally, the concatenated features of the two features are added to the input features, Obtain output features of spatial size W × H × 2D.
[0058] The detection and identification module generates a plurality of a priori frames as candidate regions through a candidate frame extraction algorithm, and the filtering module uses a non-maximum suppression method to eliminate redundant candidate regions.
[0059] like figure 2 As shown, a specific implementation is as follows:
[0060] 1. Design of network model
[0061] 1. Space Network
[0062] Objective: To solve the problem of lack of spatial information in the bottom-up CNN structure, and to improve the network's ability to process the spatial structure information of objects.
[0063] The spatial network module of the embodiment of the present invention can use three 3×3 convolution kernels with a stride of 2 to obtain high-resolution spatial features, and add a spatial attention mechanism (Spatial Attention Module, SAM) between the convolution layers. ) to enhance the spatial information, and then the spatial information of different positions in the feature is processed by the Feature Pyramid Networks (FPN), and finally the features of the image are output.
[0064] The spatial network module is composed of three 3×3 convolutional layers with stride 2 (conv1, conv2, and conv3), which retains a lot of rich spatial details with little increase in the amount of parameters. After the convolution operation, Batch Normalization (BN) and Rectified Linear Unit (ReLU) are used for normalization and activation. A spatial visual attention module is attached after each convolutional layer, which is designed to adjust the output features to adaptively capture the spatial regions of interest. The application of the spatial attention mechanism is reflected in the use of the spatial attention module.
[0065] The structure of the space network module is as follows image 3 , in which the Spatial Attention Module (SAM) is used to learn the weights of multiple level-type features, and its input comes from the output features of the backbone network MAM (Multi-level Attention Module, multi-level attention module) module, and The output features of the pre-convolutional layer in this module. Feature maps of smaller size will be upsampled to match the dimensions of different input features. These features are concatenated to form a combined feature, which is then convolved with a 3×3 convolution with batch regularization and rectified linear units to form feature M. Then feature M is subjected to global average pooling. The global average pooling output is multiplied by feature M to generate global spatial information. The purpose of pooling is to identify more discriminative regions and infer more nuanced attention. Finally, the convolved feature M is added to the global spatial information to obtain the output features of the spatial attention module. The spatial attention module not only combines the features of the spatial network and the backbone network, but also performs feature refinement by learning feature weights to improve the performance of the network.
[0066] These layers are followed by a Spatial Pyramid Pooling (SPP) module to integrate multiple spatial information at different scales. The purpose is to convert the multi-scale feature data to a fixed latitude through the spatial pyramid pooling layer, as the input of the fully connected layer, and send it to the fully connected layer, as shown in the following figure. It enables feature maps of any size to be converted into fixed-size feature vectors.
[0067] 2. Backbone network
[0068] Purpose: To introduce a novel multi-level attention module to combine multi-level features and provide enhanced features to the spatial network. Second, feature maps between different blocks are concatenated to generate deep and shallow feature blocks for fusion network modules.
[0069] The complete structure of the backbone network is as follows Figure 4. In the algorithm proposed in this embodiment, ResNet101 is used and pre-trained on multiple public datasets as the backbone network of the designed backbone network. The public datasets used in this scheme include: VOC-2012, ILSVRC-2017, MS-COCO-2018, OID-2018, etc.
[0070] The Multi-level Attention Module (MAM) in this module is designed to utilize the features of multiple levels possessed by CNN and combine these features. MAM captures these multi-level feature information, optimizes the feature information, and inputs the processed features into the spatial network module for further processing. The output features of the convolutional layer of the backbone network of this module are used for the next step of regression and classification.
[0071] MAM adopts cascading to combine multi-level features. will F n Get F after upsampling n-1 , two levels of feature F n and F n-1 (F n Represents the feature after the nth layer convolution, F n-1 means to F n The upsampled features) are concatenated to obtain the combined feature F. The feature F' is obtained by transforming the feature F by 3 × 3 convolution with batch regularization and non-linear units.
[0072] After using 1×1 convolution to reduce feature channels, 3×3 atrous convolution and 7×7 atrous convolution are used in parallel to enlarge the receptive field. The function of 1×1 convolution is not only to increase and reduce dimension, but also to integrate various feature information. This module has two hyperparameters: expansion ratio (d) and compression ratio (r). The inflation rate determines the size of the receptive field and helps to aggregate contextual information. The compression ratio changes the number of channels and thus determines the computational overhead. Through comparative experiments, we can obtain the best performance when we set {d=4, r=16}. The final feature M can be formulated as follows:
[0073]
[0074] where σ is the sigmoid function, BN is batch regularization, and f3X3 and f7×7 represent 3×3 atrous convolution and 7×7 atrous convolution, respectively.
[0075] 3. Fusion network module
[0076] Objective: In order to further improve the performance of the target detection algorithm, this fusion network module is designed, which combines the feature information of different scales to ensure the accurate positioning of the target in the complex background.
[0077] The network structure diagram of the fusion network module is as follows Figure 5 shown. The features of conv5 and conv4 are combined as deep features, the features of conv2 and conv3 are combined as shallow features, and then the deep block features and shallow block features are combined to form the input features of the fusion attention module. We use deconvolution to unify the feature sizes from different residual network blocks, so that the input W×H-dimensional features have D-dimensional channels.
[0078] The input features are then pooled in two paths, and the kernel sizes of the two pooling layers are 2×2 and 4×4, respectively. The smaller size features are then upsampled to obtain the same size features as the original feature map. Finally, the concatenated features of the two features are added to the input features to obtain the output features with spatial size W × H × 2D. The output feature is used as one of the multi-scale prediction features to make the final prediction.
[0079] In this module, each spatial location of the input feature map is enhanced by adopting the local context of different location features. The combination of features from different pooling layers not only enlarges the receptive field, but also makes better use of multi-scale contextual information.
[0080] 2. Generation of candidate regions
[0081] In this embodiment, FPN (Feature Pyramid Networks, feature pyramid) is used to generate candidate regions, and other existing technologies may also be used to generate candidate regions.
[0082] 3. Screening of candidate regions
[0083] In this embodiment, Non-Maximum Suppression (NMS) is used to screen the selected regions generated in the previous step, the purpose of which is to extract target detection frames with high confidence and suppress false detection frames with low confidence. . The execution steps are: sort the scores of all boxes, select the highest score and its corresponding box; traverse the remaining boxes, and delete the box if the overlap area (IOU) with the current highest score box is greater than a certain threshold. Continue to pick the one with the highest score from the unprocessed box and repeat the above process. until the final screening is completed.
[0084] The embodiment of the present invention designs a new three-branch attention target detection model, which is composed of three branches, a spatial network, a backbone network and a fusion network module. Subsequent experiments will demonstrate that it can achieve very competitive results in object detection performance. Novel spatial attention modules, multi-level attention modules, contextual attention modules, and pyramid pooling modules, which can be adapted to network architectures for other vision tasks with minor modifications.

Example Embodiment

[0086] Embodiment 2
[0087] In this embodiment, an image target detection system is provided, such as Image 6 As shown, including: backbone network module, space network module and fusion network module;
[0088] The backbone network module and the spatial network module are respectively connected to the input terminal, the multiple convolutional layers in the backbone network module are respectively connected to the spatial network module and the fusion network module, and the output terminal of the backbone network module passes through the convolution module. It is connected with the detection and identification module, the output ends of the space network module and the fusion network module are respectively connected with the detection and identification module, and the detection and identification module is connected with the filtering module.
[0089] A specific implementation of the image target detection system, the backbone network module includes: a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a fifth convolutional layer, the The second convolutional layer is connected to the first multi-level attention module MAM, the third convolutional layer is connected to the second multi-level attention module MAM, and the fourth convolutional layer is connected to the third multi-level attention module MAM connection, the third multi-level attention module, the second multi-level attention module and the first multi-level attention module are sequentially connected, the third multi-level attention module, the second multi-level attention module and the first multi-level attention module The multi-level attention module is connected with the spatial network module, and the second convolutional layer, the third convolutional layer, the fourth convolutional layer and the fifth convolutional layer are connected with the fusion network module respectively.
[0090] The spatial network module includes a first convolutional layer, a second convolutional layer, a third convolutional layer and a PPM module, and the first convolutional layer is connected to the second convolutional layer through a first spatial attention mechanism SAM, The second convolutional layer is connected to the third convolutional layer through the second spatial attention mechanism SAM, the third convolutional layer is connected to the PPM module through the third spatial attention mechanism SAM, and the first spatial attention The input end of the mechanism SAM is connected to the first multi-level attention module MAM of the backbone network module, and the input end of the second spatial attention mechanism SAM is connected to the second multi-level attention module MAM of the backbone network module , the input end of the third spatial attention mechanism SAM is connected with the third multi-level attention module MAM of the backbone network module.
[0091] The fusion network module takes the feature combination of the fifth convolutional layer and the fourth convolutional layer of the backbone network module as a deep feature, and takes the feature combination of the third convolutional layer and the second convolutional layer of the backbone network module as a shallow layer. features, and then the deep features and shallow features are combined to form the input features of the fusion attention module, and deconvolution is used to unify the feature sizes from different residual network blocks, so that the input W × H-dimensional features have D-dimensional channels; then The input features are pooled in two paths, and the kernel sizes of the two pooling layers are 2×2 and 4×4 respectively; then the smaller-sized features are up-sampled to obtain the same size features as the original feature map; finally , the connection feature of the two features is added to the input feature to obtain an output feature with a spatial size of W×H×2D.
[0092] The image feature data extracted by the multiple convolutional layers of the backbone network module are respectively input to the spatial attention mechanism SAM of the spatial network module through the corresponding multi-level attention module MAM, and the multi-level attention module MAM is used to obtain multi-level attention. level feature information, and then combine the multi-level feature information to achieve feature enhancement;
[0093] The multi-level attention module MAM is specifically used to set F n Get F after upsampling n-1 , the two-level feature F n and F n-1 concatenated to obtain the combined feature F, then transform the feature F by 3×3 convolution with batch regularization and non-linear unit to obtain the feature F′, the multi-level attention module MAM has two hyperparameters: dilation rate d and compression ratio r, when d=4, r=16, the formula of the final feature M is as follows:
[0094]
[0095] Among them, F' represents the eigenvalue after the convolution of the combined feature F, σ is the sigmoid function, BN is batch regularization, f 3×3 represents a 3×3 atrous convolution, f 7×7 Represents 7×7 atrous convolution, Con represents convolution, and Maxpool represents maximum pooling.
[0096] The technical solutions provided in the embodiments of the present application solve the problem of loss of shallow information after multi-layer convolution in traditional convolutional neural networks by adding a spatial network module and combining deep features and shallow features in the fusion network module. There is a certain improvement in the target detection of small objects; the use of the MAM module (multi-level attention module) in the backbone network to integrate multi-level features can significantly improve the performance. Compared with the baseline network without the multi-level attention module, the mAP (Mean Average Precision) metric can be greatly improved.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Node and a Method for use in a Wireless Communications System

ActiveUS20100279602A1improve performance
Owner:TELEFON AB LM ERICSSON (PUBL)

Neural network module for data mining

ActiveUS7069256B1improve performancegood integration
Owner:ORACLE INT CORP

Data transmission method for use in mobile communication systems

InactiveUS20120163311A1improve performancereduce signal overhead
Owner:ELECTRONICS & TELECOMM RES INST

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products