[0052] A specific embodiment of the method of the present invention:
[0053] The image feature data extracted by the multiple convolutional layers of the backbone network module are respectively input to the spatial attention mechanism SAM of the spatial network module through the corresponding multi-level attention module MAM, and the multi-level attention module MAM is used to obtain multi-level attention. level feature information, and then combining the multi-level feature information to achieve feature enhancement.
[0054] The multi-level attention module MAM is specifically used to set F n Get F after upsampling n-1 , the two-level feature F n and F n-1 (F n Represents the feature after the nth layer convolution, F n-1 means to F n upsampled features) are concatenated to obtain the combined feature F, and then the feature F is transformed by a 3×3 convolution with batch regularization and nonlinear units to obtain the feature F′, the multi-level attention module MAM has Two hyperparameters: expansion rate d and compression rate r, when d=4, r=16, the formula of the final feature M is as follows:
[0055]
[0056] Among them, F' represents the eigenvalue after the convolution of the combined feature F, σ is the sigmoid function, BN is batch regularization, f 3×3 represents a 3×3 atrous convolution, f 7×7 Represents 7×7 atrous convolution, Con represents convolution, and Maxpool represents maximum pooling.
[0057] The backbone network module includes a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a fifth convolutional layer, and the fusion network module combines the fifth convolutional layer of the backbone network module. Layer and the feature combination of the fourth convolutional layer are used as deep features, the feature combination of the third convolutional layer and the second convolutional layer of the backbone network module is used as shallow features, and then the deep features and shallow features are combined to form a fusion The input features of the attention module use deconvolution to unify the feature sizes from different residual network blocks, so that the input W×H-dimensional features have D-dimensional channels; then the input features are pooled in two paths, two pooling The kernel sizes of the layers are 2×2 and 4×4, respectively; then the smaller-sized features are up-sampled to obtain the same size features as the original feature map; finally, the concatenated features of the two features are added to the input features, Obtain output features of spatial size W × H × 2D.
[0058] The detection and identification module generates a plurality of a priori frames as candidate regions through a candidate frame extraction algorithm, and the filtering module uses a non-maximum suppression method to eliminate redundant candidate regions.
[0059] like figure 2 As shown, a specific implementation is as follows:
[0060] 1. Design of network model
[0061] 1. Space Network
[0062] Objective: To solve the problem of lack of spatial information in the bottom-up CNN structure, and to improve the network's ability to process the spatial structure information of objects.
[0063] The spatial network module of the embodiment of the present invention can use three 3×3 convolution kernels with a stride of 2 to obtain high-resolution spatial features, and add a spatial attention mechanism (Spatial Attention Module, SAM) between the convolution layers. ) to enhance the spatial information, and then the spatial information of different positions in the feature is processed by the Feature Pyramid Networks (FPN), and finally the features of the image are output.
[0064] The spatial network module is composed of three 3×3 convolutional layers with stride 2 (conv1, conv2, and conv3), which retains a lot of rich spatial details with little increase in the amount of parameters. After the convolution operation, Batch Normalization (BN) and Rectified Linear Unit (ReLU) are used for normalization and activation. A spatial visual attention module is attached after each convolutional layer, which is designed to adjust the output features to adaptively capture the spatial regions of interest. The application of the spatial attention mechanism is reflected in the use of the spatial attention module.
[0065] The structure of the space network module is as follows image 3 , in which the Spatial Attention Module (SAM) is used to learn the weights of multiple level-type features, and its input comes from the output features of the backbone network MAM (Multi-level Attention Module, multi-level attention module) module, and The output features of the pre-convolutional layer in this module. Feature maps of smaller size will be upsampled to match the dimensions of different input features. These features are concatenated to form a combined feature, which is then convolved with a 3×3 convolution with batch regularization and rectified linear units to form feature M. Then feature M is subjected to global average pooling. The global average pooling output is multiplied by feature M to generate global spatial information. The purpose of pooling is to identify more discriminative regions and infer more nuanced attention. Finally, the convolved feature M is added to the global spatial information to obtain the output features of the spatial attention module. The spatial attention module not only combines the features of the spatial network and the backbone network, but also performs feature refinement by learning feature weights to improve the performance of the network.
[0066] These layers are followed by a Spatial Pyramid Pooling (SPP) module to integrate multiple spatial information at different scales. The purpose is to convert the multi-scale feature data to a fixed latitude through the spatial pyramid pooling layer, as the input of the fully connected layer, and send it to the fully connected layer, as shown in the following figure. It enables feature maps of any size to be converted into fixed-size feature vectors.
[0067] 2. Backbone network
[0068] Purpose: To introduce a novel multi-level attention module to combine multi-level features and provide enhanced features to the spatial network. Second, feature maps between different blocks are concatenated to generate deep and shallow feature blocks for fusion network modules.
[0069] The complete structure of the backbone network is as follows Figure 4. In the algorithm proposed in this embodiment, ResNet101 is used and pre-trained on multiple public datasets as the backbone network of the designed backbone network. The public datasets used in this scheme include: VOC-2012, ILSVRC-2017, MS-COCO-2018, OID-2018, etc.
[0070] The Multi-level Attention Module (MAM) in this module is designed to utilize the features of multiple levels possessed by CNN and combine these features. MAM captures these multi-level feature information, optimizes the feature information, and inputs the processed features into the spatial network module for further processing. The output features of the convolutional layer of the backbone network of this module are used for the next step of regression and classification.
[0071] MAM adopts cascading to combine multi-level features. will F n Get F after upsampling n-1 , two levels of feature F n and F n-1 (F n Represents the feature after the nth layer convolution, F n-1 means to F n The upsampled features) are concatenated to obtain the combined feature F. The feature F' is obtained by transforming the feature F by 3 × 3 convolution with batch regularization and non-linear units.
[0072] After using 1×1 convolution to reduce feature channels, 3×3 atrous convolution and 7×7 atrous convolution are used in parallel to enlarge the receptive field. The function of 1×1 convolution is not only to increase and reduce dimension, but also to integrate various feature information. This module has two hyperparameters: expansion ratio (d) and compression ratio (r). The inflation rate determines the size of the receptive field and helps to aggregate contextual information. The compression ratio changes the number of channels and thus determines the computational overhead. Through comparative experiments, we can obtain the best performance when we set {d=4, r=16}. The final feature M can be formulated as follows:
[0073]
[0074] where σ is the sigmoid function, BN is batch regularization, and f3X3 and f7×7 represent 3×3 atrous convolution and 7×7 atrous convolution, respectively.
[0075] 3. Fusion network module
[0076] Objective: In order to further improve the performance of the target detection algorithm, this fusion network module is designed, which combines the feature information of different scales to ensure the accurate positioning of the target in the complex background.
[0077] The network structure diagram of the fusion network module is as follows Figure 5 shown. The features of conv5 and conv4 are combined as deep features, the features of conv2 and conv3 are combined as shallow features, and then the deep block features and shallow block features are combined to form the input features of the fusion attention module. We use deconvolution to unify the feature sizes from different residual network blocks, so that the input W×H-dimensional features have D-dimensional channels.
[0078] The input features are then pooled in two paths, and the kernel sizes of the two pooling layers are 2×2 and 4×4, respectively. The smaller size features are then upsampled to obtain the same size features as the original feature map. Finally, the concatenated features of the two features are added to the input features to obtain the output features with spatial size W × H × 2D. The output feature is used as one of the multi-scale prediction features to make the final prediction.
[0079] In this module, each spatial location of the input feature map is enhanced by adopting the local context of different location features. The combination of features from different pooling layers not only enlarges the receptive field, but also makes better use of multi-scale contextual information.
[0080] 2. Generation of candidate regions
[0081] In this embodiment, FPN (Feature Pyramid Networks, feature pyramid) is used to generate candidate regions, and other existing technologies may also be used to generate candidate regions.
[0082] 3. Screening of candidate regions
[0083] In this embodiment, Non-Maximum Suppression (NMS) is used to screen the selected regions generated in the previous step, the purpose of which is to extract target detection frames with high confidence and suppress false detection frames with low confidence. . The execution steps are: sort the scores of all boxes, select the highest score and its corresponding box; traverse the remaining boxes, and delete the box if the overlap area (IOU) with the current highest score box is greater than a certain threshold. Continue to pick the one with the highest score from the unprocessed box and repeat the above process. until the final screening is completed.
[0084] The embodiment of the present invention designs a new three-branch attention target detection model, which is composed of three branches, a spatial network, a backbone network and a fusion network module. Subsequent experiments will demonstrate that it can achieve very competitive results in object detection performance. Novel spatial attention modules, multi-level attention modules, contextual attention modules, and pyramid pooling modules, which can be adapted to network architectures for other vision tasks with minor modifications.