A sonar image target detection method based on multi-scale feature fusion

By employing a multi-scale feature fusion method, combined with convolutional layers, stacking modules, and attention mechanisms, the problems of poor foreground-background distinction and neglect of global contextual information in sonar image target detection are solved, achieving higher recognition accuracy and robustness.

CN117218502BActive Publication Date: 2026-06-19GUILIN UNIV OF ELECTRONIC TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUILIN UNIV OF ELECTRONIC TECH
Filing Date
2023-09-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing sonar image target detection models have difficulty effectively distinguishing between foreground and background during feature extraction, resulting in low recognition accuracy. Furthermore, they ignore global contextual information, which affects the model's recognition performance.

Method used

A multi-scale feature fusion method is adopted, which extracts multi-dimensional features through four convolutional layers, multiple stacking modules, downsampling operations, attention mechanism and spatial pyramid pooling, thereby enhancing the feature extraction capability and globality of the model. The incremental module without convolution is used to improve the recognition accuracy.

Benefits of technology

It improves the recognition accuracy of target detection in sonar images, reduces false detections and missed detections, enhances the robustness and adaptability of the model, and improves the richness and accuracy of feature extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117218502B_ABST
    Figure CN117218502B_ABST
Patent Text Reader

Abstract

This invention discloses a multi-scale feature fusion-based sonar image target detection method, specifically including the following steps: S1, feature extraction; S2, feature fusion; S3, target recognition; this invention relates to the field of underwater target recognition technology. This multi-scale feature fusion-based sonar image target detection method, by combining a dynamic convolutional layer design, extracts multi-dimensional features, better extracts the global contextual information of the features, improves the representational ability of the extracted deep abstract features, thereby improving the model's recognition accuracy. The use of stacked convolutions for feature extraction, with an attention mechanism added to the stacked convolutions, enhances the model, improves the globality of the feature extraction process, better distinguishes between foreground and background in sonar images with high similarity, avoids false detections or missed detections, and further improves recognition accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of underwater target recognition technology, specifically to a method for sonar image target detection using multi-scale feature fusion. Background Technology

[0002] With the rapid development of deep learning, the field of underwater target recognition in sonar images has broad development prospects. Emerging deep learning methods can effectively improve the generalization ability of recognition models, realize end-to-end information transmission and autonomous recognition. However, these tasks face some challenges, such as improving the accuracy and speed of detection and recognition, minimizing the cost of computation and communication, and reducing the complexity of models. Solving these problems is the key to improving the overall performance of underwater target detection systems.

[0003] Target detection in sonar images is traditionally performed by manually extracting features, primarily using pixel-based, feature-based, and echo-based methods. These methods largely rely on pixel value features, grayscale thresholds, or prior information about the target. However, the complex underwater environment and the influence of self-noise, reverberation noise, and environmental noise on echoes result in low resolution, blurred edge details, and severe speckle noise in sonar images, making it difficult to find suitable pixel features and grayscale thresholds. Furthermore, the diverse range of underwater targets makes obtaining prior information too costly. Therefore, current traditional algorithms have low accuracy in detecting underwater targets. With the development of deep learning, current deep learning-based detection methods can extract multi-layered abstract features, solving the problem of manual feature extraction in traditional methods. Simultaneously, data augmentation addresses the issue of limited sample sizes. Current sonar image detection methods demonstrate high detection speed and accuracy, but further research is needed to address the challenge of merging noise and target information during sonar image feature extraction, a challenge that currently hinders improvements in recognition accuracy.

[0004] Jin et al. (“A Sonar Image Recognition Method for Underwater Targets Based on Convolutional Neural Networks”, Journal of Northwestern Polytechnical University, Vol. 39, No. 2: pp. 285-291 (2021)) used a convolutional neural network to identify underwater targets. This network combines image prominence region segmentation and pyramid pooling based on sonar image features to prevent model overfitting, but does not consider the problems of complex background feature extraction and target feature loss in sonar image recognition.

[0005] Wang et al. (X, Wang, Z, Guan, J, Wang, and Y, Wang, “Target detection of color image sonar based on convolutional neural network”, J, Comput, Appl, vol, 39, no, S1, pp, 187191, Jul, 2019) successfully achieved target recognition of sonar images by using bilinear interpolation preprocessing technology combined with a target detection algorithm network. However, the experimental target was limited to wooden stakes, and the problem of insufficient experimental samples was not solved.

[0006] Sheng et al. (“Sonar Image Mine Detection Based on Sample Simulation Combined with Transfer Learning”, Journal of Intelligent Systems, Vol. 16, No. 2: pp. 385-392 (2021)) achieved sonar image simulation of mine targets, expanding the dataset. At the same time, combining sample simulation and transfer learning improved the accuracy of deep learning and avoided the problem of insufficient learning of deep learning with small samples. However, they ignored the global nature of the feature extraction process and the problem of the convolutional layer’s ability to learn contextual features.

[0007] Based on the search of the above information, the following shortcomings can be identified in the existing technology:

[0008] Existing sonar image target detection models mainly use stacked convolution for feature extraction. However, they have poor differentiation between foreground and background in sonar images with high similarity, which can easily lead to false detections or missed detections.

[0009] Existing sonar image target detection models ignore the global contextual information of features, resulting in poor representational ability of the extracted deep abstract features, thereby reducing the model's recognition accuracy. Summary of the Invention

[0010] (a) Technical problems to be solved

[0011] To address the shortcomings of existing technologies, this invention provides a multi-scale feature fusion method for sonar image target detection, which solves the aforementioned problems.

[0012] (II) Technical Solution

[0013] To achieve the above objectives, the present invention provides the following technical solution: a multi-scale feature fusion method for sonar image target detection, specifically comprising the following steps:

[0014] S1. Feature extraction: The original sonar image is input into the feature extraction module. After partial feature extraction through four convolutional layers, feature fusion is performed to obtain a four-fold downsampled feature map. The four-fold downsampled feature map is then alternately subjected to four stacking modules and three downsampling operations to obtain the first and second effective feature layers. Then, an attention mechanism is introduced to perform spatial pyramid pooling to obtain the third effective feature layer.

[0015] S2, Feature Fusion: After convolving the first and second effective feature layers using an attention mechanism, the convolved first and second effective feature layers are obtained. After dynamic convolution and upsampling operations on the third effective feature layer, it is merged with the convolved second effective feature layer through a stacking module to obtain a first fused feature layer. After dynamic convolution and upsampling operations on the first fused feature layer, it is merged with the convolved first effective feature layer through a stacking module to obtain a second fused feature layer. After downsampling operations, the second fused feature layer is merged with the convolved second effective feature layer and the first fused feature layer through a stacking module to obtain a third fused feature layer. After the third fused feature layer is merged with the third effective feature layer through a stacking module, a fourth fused feature layer is obtained.

[0016] S3. Target Recognition: The second-order fusion feature layer, the third-order fusion feature layer, and the fourth-order fusion feature layer are respectively input into a general-purpose novel convolutional module that improves accuracy without increasing the convolutional computation increment. The corresponding bounding box, category, and confidence information are output to predict the center point. The center point offset is calculated using the regression prediction results, and the width and height of the prediction box are predicted. At the same time, non-maximum suppression is performed.

[0017] The present invention is further configured such that the stacking modules in S1 and S2 include two first convolutional branches:

[0018] One of the first convolutional branches undergoes a 1x1 convolution to change the number of channels;

[0019] Another first convolutional branch first passes through a 1x1 convolutional module to change the number of channels, then passes through four 3x3 convolutional modules for feature extraction, and finally superimposes the four features with the output of the first convolutional branch to obtain the feature fusion result.

[0020] The present invention is further configured such that the downsampling operations in S1 and S2 include two downsampling branches:

[0021] After passing through a max pooling layer, one of the downsampling branches passes through a 1x1 convolutional module to change the number of channels;

[0022] Another downsampling branch undergoes a 1x1 convolution module to change the number of channels, followed by a 3x3 convolution with a stride of 2. The output of this convolution is then superimposed on the output of one of the downsampling branches to obtain the downsampling result.

[0023] The present invention is further configured such that: the spatial pyramid pooling operation in S1 includes two second convolutional branches:

[0024] After the second convolutional branch is processed by four different max pooling operations, the output convolution result is obtained.

[0025] The other second convolutional branch performs convolution processing directly. After outputting the convolution result, it is merged with the convolution structure output by the first and second convolutional branches to output the pooling result.

[0026] The present invention is further configured such that: the dynamic convolution in S2 uses a multidimensional attention mechanism to learn a four-dimensional convolution kernel space through a parallel strategy.

[0027] The present invention is further configured such that: the novel general convolution module in S3, which has no convolutional computation increment but improves accuracy, is used to modify the number of image channels of output features of different sizes, for feature extraction and feature smoothing.

[0028] The present invention is further configured such that: in S3, non-maximum suppression sorts the classification probabilities according to the classifier's category, and iteratively performs overlap operations with other boxes using the largest box, filtering out boxes with large overlaps repeatedly, removing redundant candidates, and predicting the optimal target object.

[0029] The present invention is further configured such that the attention mechanism in S1 and S2 includes a squeezing step, an excitation step, and a scaling step;

[0030] The compression step is as follows: by global average pooling, the two-dimensional feature (height * width) of each channel is compressed into 1 real number, and the feature map is changed from [height, width, number of channels] ==> [1, 1, number of channels] to obtain the global feature of the channel;

[0031] The activation step is as follows: generate a weight value for each feature channel, construct the correlation between channels through two fully connected layers, and the number of output weight values ​​is the same as the number of channels in the input feature map, [1, 1, number of channels] ==> [1, 1, number of channels]. Learn the relationship between each channel and obtain the weights of different channels.

[0032] The scaling step is as follows: the normalized weights obtained in the activation step are weighted onto the features of each channel, i.e., the channel is multiplied by the weight coefficient, [height, width, number of channels] * [1, 1, number of channels] ==> [height, width, number of channels]. The attention weights are calculated by comparing the input with the current state of the model to obtain the attention score of each input. These attention scores reflect the importance of each input to the current task. The inputs are weighted by multiplying each input by its corresponding attention score, thereby weighting the inputs according to their importance. Finally, the weighted inputs are summed or concatenated to obtain the final output, thereby capturing information related to the task.

[0033] (III) Beneficial Effects

[0034] This invention provides a multi-scale feature fusion method for target detection in sonar images. It has the following beneficial effects:

[0035] (1) By combining the design of dynamic convolutional layers, this invention extracts multi-dimensional features and better extracts the global context information of the features, thereby improving the representation ability of the extracted deep abstract features and thus improving the recognition accuracy of the model. The feature extraction is performed by stacked convolution. An attention mechanism is added to the stacked convolution to enhance the model and improve the globality of the feature extraction process. It can better distinguish the difference between the foreground and background of sonar images with high similarity, avoid the model from false detection or missed detection, and further improve the recognition accuracy.

[0036] (2) This invention extracts and integrates a portion of features through four convolutional layers, and then obtains three effective feature layers through multiple stacking modules and downsampling operations. The effectively extracted multi-dimensional mixed feature information enables the network to learn more features and has stronger robustness. It overcomes the possibility that information may disappear or expand after passing through multiple layers, making feature extraction more accurate and efficient.

[0037] (3) By setting the downsampling operation, the present invention enables the max pooling layer to only consider the maximum value information of local small regions and realize spatial downsampling, while the convolutional layer can consider all value information of local small regions and help to smoothly extract features and avoid losing more feature information during downsampling. Finally, the two are combined so that the information extracted by the feature has the advantages of both.

[0038] (4) By setting the attention mechanism, the present invention can improve the performance and accuracy of the model by selecting and focusing on important features in the data. Then, the input space pyramid pooling can increase the receptive field so that the algorithm can adapt to different resolutions, while reducing the amount of computation so that it is fast and accurate.

[0039] (5) The present invention uses spatial pyramid pooling operation design. On the one hand, it has four different max pooling operations to process different objects and has four receptive fields to distinguish between large and small targets. On the other hand, it uses convolutional layers to reduce the number of parameters. The combination of the two allows feature fusion of feature layers of different shapes, which is conducive to extracting better features and increasing the richness of extracted features. Attached Figure Description

[0040] Figure 1 This is a schematic diagram of the process of the present invention;

[0041] Figure 2 This is a schematic diagram of the system architecture for feature extraction according to the present invention;

[0042] Figure 3 This is a schematic diagram of the system architecture for feature fusion and target recognition of the present invention;

[0043] Figure 4 This is a schematic diagram of the stacking module process of the present invention;

[0044] Figure 5 This is a schematic diagram of the downsampling operation process of the present invention;

[0045] Figure 6 This is a schematic diagram of the pyramid pooling process of the present invention;

[0046] Figure 7 This is a schematic diagram of the attention mechanism of the present invention;

[0047] Figure 8 This is a schematic diagram of a sonar image dataset in an embodiment of the present invention. Detailed Implementation

[0048] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

[0049] Please see Figure 1-8 The present invention provides the following technical solution: a sonar image target detection method with multi-scale feature fusion, as detailed below:

[0050] Dataset preparation:

[0051] Using the STD-A dataset, which consists of 1620 sonar images, labels were created so that each image represents one of three underwater targets: a ship, a human, or an aircraft. These images will be used as input to a deep learning model. Through manual annotation, the target images were labeled into the three categories: ships, humans, and aircraft. The training and validation sets were split in a 9:1 ratio, as shown in the attached diagram. Figure 8 As shown;

[0052] Model preparation:

[0053] The network is set up for 300 generations with a batch size of 4. A stochastic gradient descent optimizer is used with an initial learning rate of 0.001. The learning rate is adaptively adjusted to avoid overfitting.

[0054] Model training:

[0055] S1. Feature Extraction: The training set is input into the feature extraction module. After partial feature extraction through four convolutional layers, feature fusion is performed to obtain a four-fold downsampled feature map. This four-fold downsampled feature map is then alternately subjected to four stacking layers and three downsampling operations to obtain the first and second effective feature layers. Finally, an attention mechanism is introduced to perform spatial pyramid pooling to obtain the third effective feature layer, as shown in the appendix. Figure 2 and attached Figure 3 As shown, when the input is 640x640, the sizes of the first effective feature layer, the second effective feature layer, and the third effective feature layer are 80x80, 40x40, and 20x20, respectively.

[0056] S2, Feature Fusion: After convolving the first and second effective feature layers using an attention mechanism, the convolved first and second effective feature layers are obtained. After dynamic convolution and upsampling operations on the third effective feature layer, it is merged with the convolved second effective feature layer through a stacking module to obtain a first fused feature layer. After dynamic convolution and upsampling operations on the first fused feature layer, it is merged with the convolved first effective feature layer through a stacking module to obtain a second fused feature layer. After downsampling operations, the second fused feature layer is merged with the convolved second effective feature layer and the first fused feature layer through a stacking module to obtain a third fused feature layer. After the third fused feature layer is merged with the third effective feature layer through a stacking module, a fourth fused feature layer is obtained.

[0057] S3, Target Recognition: The second-order fusion feature layer, the third-order fusion feature layer, and the fourth-order fusion feature layer are respectively input into a general-purpose novel convolutional module that improves accuracy without convolutional computation increment. The corresponding bounding box, category, and confidence information are output to predict the center point. The center point offset is calculated using the regression prediction results, and the width and height of the prediction box are predicted. At the same time, non-maximum suppression is performed.

[0058] During model training, the model is iterated continuously to obtain an improved model, and the optimal model is obtained when the loss function converges.

[0059] Specifically, the loss function is calculated as follows: the value of the loss function = loss of the probability of the target's existence * 0.1 + loss of the category prediction part * 0.125 + loss of the bounding box regression part * 0.0;

[0060] Specifically, in the bounding box regression part, knowing the prior box corresponding to each ground truth box, after obtaining the prior box corresponding to each box, the predicted box corresponding to the prior box is extracted, and the complete threshold loss is calculated using the ground truth box and the predicted box, which is used as the loss component of the bounding box regression part.

[0061] The probability component of the target existence is determined by knowing the prior box corresponding to each ground truth box. All prior boxes corresponding to ground truth boxes are positive samples, and the remaining prior boxes are negative samples. The cross-entropy loss is calculated based on the positive and negative samples and the prediction results of whether the feature point contains an object. This loss is used as a component of the probability component of the target existence.

[0062] In the category prediction part, knowing the prior box corresponding to each ground truth box, after obtaining the prior box corresponding to each box, the category prediction result of the prior box is extracted. Based on the category of the ground truth box and the category prediction result of the prior box, the cross-entropy loss is calculated as a component of the loss of the category prediction part.

[0063] The experimental tables and graphs are as follows:

[0064]

[0065] As a detailed explanation, the problem of model overfitting caused by small sample sizes in sonar image datasets is addressed by data augmentation. The dataset was increased to a certain extent, avoiding overfitting during training, improving model robustness, and reducing the model's sensitivity to images. In sonar images, the foreground and background of targets have high similarity, and the feature extraction process is prone to losing target feature information. By adding an attention mechanism to stacked convolution, the model is enhanced, improving the globality of the feature extraction process, better distinguishing between foreground and background in sonar images with high similarity, and avoiding false detections or missed detections. Dynamic convolution can better preserve the feature information of small targets in sonar images, thereby improving detection performance.

[0066] Experiments show that the target detection method described in this invention can avoid false positives and false negatives to a certain extent, and has better performance in marine image recognition with improved recognition accuracy.

Claims

1. A method of sonar image target detection with multi-scale feature fusion, The feature includes the following steps: S1. Feature extraction: The original sonar image is input into the feature extraction module. After partial feature extraction through four convolutional layers, feature fusion is performed to obtain a four-fold downsampled feature map. The four-fold downsampled feature map is then alternately subjected to four stacking modules and three downsampling operations to obtain the first and second effective feature layers. Then, an attention mechanism is introduced to perform spatial pyramid pooling to obtain the third effective feature layer. S2, Feature Fusion: After convolving the first and second effective feature layers using an attention mechanism, the convolved first and second effective feature layers are obtained. After dynamic convolution and upsampling operations on the third effective feature layer, it is merged with the convolved second effective feature layer through a stacking module to obtain a first fused feature layer. After dynamic convolution and upsampling operations on the first fused feature layer, it is merged with the convolved first effective feature layer through a stacking module to obtain a second fused feature layer. After downsampling operations, the second fused feature layer is merged with the convolved second effective feature layer and the first fused feature layer through a stacking module to obtain a third fused feature layer. After the third fused feature layer is merged with the third effective feature layer through a stacking module, a fourth fused feature layer is obtained. S3. Target Recognition: The second-order fusion feature layer, the third-order fusion feature layer, and the fourth-order fusion feature layer are respectively input into a general-purpose novel convolutional module that improves accuracy without increasing the convolutional computation increment. The corresponding bounding box, category, and confidence information are output to predict the center point. The center point offset is calculated using the regression prediction results, and the width and height of the prediction box are predicted. At the same time, non-maximum suppression is performed. 2.The multi-scale feature fusion sonar image target detection method of claim 1, wherein: The stacked modules in S1 and S2 include two first convolutional branches: One of the first convolutional branches undergoes a 1x1 convolution to change the number of channels; Another first convolutional branch first passes through a 1x1 convolutional module to change the number of channels, then passes through four 3x3 convolutional modules for feature extraction, and finally superimposes the four features with the output of the first convolutional branch to obtain the feature fusion result. 3.The multi-scale feature fusion sonar image target detection method of claim 1, wherein: The downsampling operations in S1 and S2 include two downsampling branches: After passing through a max pooling layer, one of the downsampling branches passes through a 1x1 convolutional module to change the number of channels; Another downsampling branch undergoes a 1x1 convolution module to change the number of channels, followed by a 3x3 convolution with a stride of 2. The output of this convolution is then superimposed on the output of one of the downsampling branches to obtain the downsampling result.

4. The multi-scale feature fusion sonar image target detection method according to claim 1, characterized in that: The spatial pyramid pooling operation in S1 includes two second convolutional branches: After the second convolutional branch is processed by four different max pooling operations, the output convolution result is obtained. The other second convolutional branch performs convolution processing directly. After outputting the convolution result, it is merged with the convolution structure output by the first and second convolutional branches to output the pooling result.

5. The method of claim 1, wherein: The dynamic convolution in S2 uses a multidimensional attention mechanism to learn a four-dimensional convolution kernel space through a parallel strategy.

6. The multi-scale feature fusion sonar image target detection method according to claim 1, characterized in that: The novel general-purpose convolution module in S3, which improves accuracy without increasing the convolutional computation increment, is used to modify the number of image channels for output features of different sizes, and is used for feature extraction and feature smoothing.

7. The sonar image target detection method based on multi-scale feature fusion according to claim 1, characterized in that: In S3, non-maximum suppression sorts the classification probabilities according to the classifier's category, and iteratively performs overlap operations on the largest bounding box with other bounding boxes, filtering out the repeated boxes with large overlaps, removing redundant candidates, and predicting the optimal target object.

8. The sonar image target detection method based on multi-scale feature fusion according to claim 1, characterized in that: The attention mechanisms in S1 and S2 include a squeezing step, an excitation step, and a scaling step. The compression step is as follows: by global average pooling, the two-dimensional feature (height * width) of each channel is compressed into 1 real number, and the feature map is changed from [height, width, number of channels] ==> [1, 1, number of channels] to obtain the global feature of the channel; The activation step is as follows: generate a weight value for each feature channel, construct the correlation between channels through two fully connected layers, and the number of output weight values ​​is the same as the number of channels in the input feature map, [1, 1, number of channels] ==> [1, 1, number of channels]; The scaling step is as follows: the normalized weights obtained in the excitation step are weighted onto the features of each channel, that is, the channel is multiplied by the weight coefficient, [height, width, number of channels] * [1, 1, number of channels] ==> [height, width, number of channels].