A sonar image target detection method and system based on frequency-space cooperative perception

By employing a frequency-space collaborative sensing method, combined with adaptive equalization preprocessing and a multi-scale feature fusion network, the problems of missed detection and false detection of small targets in sonar images are solved, achieving high-precision target detection.

CN122244658APending Publication Date: 2026-06-19ANHUI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANHUI UNIV
Filing Date
2026-03-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing sonar image target detection technologies struggle to effectively extract local boundary texture information of small targets, leading to missed detections and false detections, especially with insufficient detection capabilities against dark backgrounds.

Method used

A frequency-space collaborative sensing method is adopted, which enhances the extraction and fusion of target boundary texture information by combining spatial and frequency domain features through contrast-limited adaptive equalization preprocessing, cross-stage partially connected feature fusion module, frequency domain feature extraction and multi-branch downsampling fusion network.

Benefits of technology

It improves the accuracy and discriminative power of target detection in sonar images, enhances contrast and texture stability, and improves the detection accuracy and recall of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244658A_ABST
    Figure CN122244658A_ABST
Patent Text Reader

Abstract

This invention discloses a frequency-space collaborative sensing method and system for target detection in sonar images. The method includes the following steps: acquiring sonar images, preprocessing them using contrast-limited adaptive equalization, and dividing them into training and validation sets; constructing an initial target detection network and training it based on the training and validation sets to obtain the target detection network; acquiring the sonar image to be detected and inputting it into the target detection network for target detection to obtain the detection result. The contrast-limited adaptive equalization grayscale transformation preprocessing of this invention enhances the contrast of sonar images, making target boundaries clearer, textures more stable, and background stripes and speckles more distinguishable, effectively improving the model's detection accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision technology, specifically relating to a sonar image target detection method and system based on frequency-space collaborative sensing. Background Technology

[0002] Sonar target detection is crucial in marine engineering fields such as underwater equipment maintenance and marine resource exploration, military fields such as underwater target identification and anti-submarine warfare, and scientific research fields such as marine biology research and underwater archaeology, playing a vital role in marine biology and underwater safety. Firstly, using only the spatial domain to extract target features from sonar images is problematic. Because the grayscale difference between the target and background in sonar images is very small, and edges are blurred, convolution in the spatial domain mainly relies on local brightness or contrast and edges, making it difficult to learn stable boundary patterns. This easily leads to mistaking the background for the target, significantly reducing the detection capability for small or weak targets. Secondly, spatial domain features contain a large amount of redundant information, greatly increasing computational complexity. Therefore, using only the spatial domain, especially for sonar targets (such as ship hulls, schools of fish, and terrain structures) which appear as large, blurry shapes with almost no clear texture, makes it difficult to distinguish between a blurry target and blurry background clutter. Therefore, how to better extract sonar target features is key to accurate sonar target detection.

[0003] To improve the detection accuracy of targets in sonar images, researchers have proposed multiple feature extraction and enhancement stages, sequentially extracting features at different depths, and utilizing a multi-scale feature fusion module to better capture small targets. However, sonar image target detection, relying on spatial domain features, cannot capture the detailed texture information of the target's local boundaries, leading to missed or false detections of small targets in sonar images against dark backgrounds. Therefore, extracting local texture features of target boundaries in sonar images remains a pressing problem for target detection. Summary of the Invention

[0004] This invention aims to address the shortcomings of existing technologies and provides the following solutions: A method for target detection in sonar images using frequency-space collaborative sensing includes the following steps: Sonar images were acquired, preprocessed using contrast-limited adaptive equalization, and divided into training and validation sets. Construct an initial object detection network, and train the initial object detection network based on the training set and validation set to obtain the object detection network; The sonar image to be detected is acquired and input into the target detection network for target detection, and the detection result is obtained.

[0005] Preferably, the target detection network includes: The spatial domain feature extraction network introduces a cross-stage partially connected feature fusion module to perform feature fusion and enhancement in four feature extraction stages to obtain a spatial domain feature map. A frequency domain feature extraction network is introduced, which uses a two-dimensional discrete cosine transform frequency domain feature extraction module to extract frequency domain feature maps from the preprocessed image. A multi-branch downsampling fusion network performs multi-branch downsampling and feature fusion on the obtained frequency domain feature map to obtain a multi-scale frequency domain feature map. The feature fusion network adaptively fuses spatial domain feature maps and multi-scale frequency domain feature maps to obtain a final fused feature map, and then performs target detection based on the final fused feature map.

[0006] Preferably, the preprocessing method includes: For the input sonar image The channels are obtained ,in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index; For each channel Perform grayscale transformation to obtain the transformed data. : in, This represents the grayscale transformation operator. α This indicates the contrast limiting parameter. Ω Indicates a block configuration; Transformed data Channel-dimensional fusion is performed to obtain the processed dataset. : ; For the processed dataset Divide the dataset into training and validation sets: , in, Represents the training set, This represents the validation set.

[0007] Preferably, the cross-stage partial connectivity feature fusion module includes: The feature map obtained after inputting the sonar image to be detected through two convolutional layers ,in, C Indicates the number of input channels. H Indicates altitude, W Indicates width; The convolutional feature map is obtained after passing through a convolutional layer with a kernel size of 1×1. : , in, Represents the convolution kernel. C 1 indicates the number of output channels. b Indicates the bias term; convolutional feature maps Divide the part into two equal parts to obtain the feature map. and feature map : ; For feature maps , make output ; For feature maps The input is fed into a backbone network consisting of two Bottleneck modules connected in series. Each Bottleneck module contains two branches: the main path and the residual connections. The feature map output by the main path is... The feature map output by the residual link is ; feature map and feature map The feature map is obtained by adding elements one by one. ; splicing by channel dimension and The spliced ​​feature map is obtained. : ; The spliced ​​feature map After passing through a convolutional layer with a kernel size of 1×1, a stride of 1, and padding of 0, the spatial domain feature map is obtained. : , in, This represents the convolution kernel.

[0008] Preferably, the two-dimensional discrete cosine transform frequency domain feature extraction module includes: Input sonar image to be detected An 8×8 sliding window with a step size of 8 is used for block processing, and each block is 8×8 in size, resulting in the block-wise feature map. : , in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, This represents the feature map element index of each block. This represents the element index representation of the original feature map. Indicates the channel index. and Represents block coordinates, and Indicates the coordinates within the block; Perform a two-dimensional discrete cosine transform on the segmented feature map to obtain the discrete cosine transform coefficient matrix. : , in, D Represents the basis matrix of the one-dimensional discrete cosine transform. T Represents the transpose of a matrix; The discrete cosine transform coefficient matrix of each block of feature map Perform a zigzag scan and rearrange it into a one-dimensional sequence of length 64. : , , in, Represents the first in a one-dimensional sequence m One element, Indicates the position coordinates of the zigzag sequence; Reshape the channels by arranging the 64 discrete cosine transform coefficients of each block row by row to obtain a new total channel index. : ; Based on the new total channel index Obtain frequency domain feature map : , in, , and Represents the first in a one-dimensional sequence m Each element.

[0009] Preferably, the multi-branch downsampling fusion network includes: Frequency domain feature map Perform 3×3 average pooling to obtain the feature map. : , in, P 1 indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index. h and w Indicates the position index; feature map The feature map is obtained by dividing the feature map into two parts along the channel dimension. and feature map : , Among them, feature map Feature map ; For feature maps Perform downsampling to obtain the downsampled feature map. : in, Cin 1 indicates the number of input channels; For feature maps Max pooling yields the pooled feature map. : ; Feature maps after pooling Perform convolution to obtain the convolutional feature map. : , in, Cin 2 indicates the number of input channels. W 2 represents the convolution kernel. b 2 indicates the bias term; Downsampled feature map and convolutional feature maps Feature concatenation is performed along the channel dimension to obtain the concatenated feature map. : , in, ; For the spliced ​​feature map Channel integration yields multi-scale frequency domain feature maps : , in, Indicates the output channel index. C 2 represents the target number of channels. Indicates the input channel index. W d represents the convolution kernel, b d represents the bias term.

[0010] The present invention also provides a frequency-space collaborative sensing sonar image target detection system, wherein the system applies the above-mentioned method and includes: a data partitioning module, a network training module and a detection module; The data partitioning module is used to acquire sonar images, and preprocesses them using contrast-limited adaptive equalization to divide them into training and validation sets. The network training module is used to build an initial object detection network and train the initial object detection network based on the training set and validation set to obtain the object detection network; The detection module is used to acquire the sonar image to be detected and input it into the target detection network for target detection to obtain the detection result.

[0011] Compared with the prior art, the beneficial effects of the present invention are as follows: (1) The contrast-limited adaptive equalization grayscale transformation preprocessing of the present invention can enhance the contrast of sonar images, making the target boundaries of sonar images clearer, the texture more stable, and the background stripes and speckles more distinguishable, effectively improving the detection accuracy of the model.

[0012] (2) The two-dimensional discrete cosine transform frequency domain feature extraction module designed in this invention performs block-based two-dimensional discrete cosine transform on the input image to capture the frequency domain features of the target, thus solving the problem that a single spatial domain feature extraction network cannot clearly extract the target texture and boundary information.

[0013] (3) The multi-branch downsampling fusion module designed in this invention adopts a dual-branch parallel structure and uses two methods, convolutional downsampling and pooling downsampling, to extract significant features. It achieves effective integration of multi-scale features through splicing and fusion. Traditional downsampling calculation is simple and crude. In contrast, it maximizes the retention of more information and maintains the ability to detect small targets. Attached Figure Description

[0014] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 This is a schematic diagram of the method flow according to an embodiment of the present invention; Figure 2 This is a schematic diagram of sonar image target detection using frequency-space collaborative sensing according to an embodiment of the present invention; Figure 3 This is a comparison image of sonar images before and after preprocessing in an embodiment of the present invention; Figure 4 This is a schematic diagram of the target detection results in a sonar image according to an embodiment of the present invention; Figure 5 This is a schematic diagram of the confusion matrix of 10 sonar target classification results in an embodiment of the present invention. Detailed Implementation

[0016] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0017] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0018] Example 1 In this embodiment, as Figure 1 As shown, a frequency-space collaborative sensing method for target detection in sonar images includes the following steps: S1. Acquire sonar images, preprocess them using contrast-limited adaptive equalization, and divide them into training and validation sets.

[0019] In this embodiment, the preprocessing method includes: processing the input sonar image The channels are obtained ,in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index; for each channel Perform grayscale transformation to obtain the transformed data. : in, This represents the grayscale transformation operator. α This indicates the contrast limiting parameter. Ω Indicates block configuration; for transformed data Channel-dimensional fusion is performed to obtain the processed dataset. : ; For the processed dataset Divide the dataset into training and validation sets: , in, Represents the training set, This represents the validation set.

[0020] like Figure 3The image shows a comparison of sonar images before and after preprocessing. The left side of the image is the original sonar image, and the right side is the sonar image after contrast-limited adaptive equalization preprocessing. Compared to the original sonar image, the brightness and contrast of small targets (the aircraft target in the example image) against a dark background are enhanced, laying the foundation for improving detection accuracy in subsequent steps.

[0021] S2. Construct an initial object detection network and train it based on the training and validation sets to obtain the object detection network.

[0022] Object detection networks such as Figure 2 As shown, it includes: The spatial domain feature extraction network introduces a cross-stage partially connected feature fusion module to perform feature fusion and enhancement in four feature extraction stages to obtain a spatial domain feature map.

[0023] In this embodiment, the cross-stage partial connectivity feature fusion module includes: The feature map obtained after inputting the sonar image to be detected through two convolutional layers ,in, C Indicates the number of input channels. H Indicates altitude, W The width is represented; the convolutional feature map is obtained after passing through a convolutional layer with a kernel size of 1×1. : , in, Represents the convolution kernel. C 1 represents the number of output channels, and 1×1 represents the size of the convolution kernel. b This indicates the bias term.

[0024] convolutional feature maps Divide the part into two equal parts to obtain the feature map. and feature map : ; The number of channels is half of the original number of channels, and the size remains unchanged. and .

[0025] For feature maps , make output This reduces the computational cost of the subsequent backbone network and preserves the original features for the final fusion.

[0026] For feature maps The input is fed into a backbone network consisting of two Bottleneck modules connected in series. Each Bottleneck module contains two branches: the main path and the residual link; network branches By compressing a standard convolutional layer by half through one channel, with a 3×3 kernel, stride of 1, and padding of 1, we obtain... and : , , The image size remains unchanged, and it passes through the same convolutional layer again, maintaining the same image size while restoring the original number of channels, to obtain the output feature map: , Output of the Bottleneck module's main path .

[0027] Network Branch Output: , Bottleneck module residual link output .

[0028] feature map and feature map The feature map is obtained by adding elements one by one. .

[0029] splicing by channel dimension and The spliced ​​feature map is obtained. : ; The spliced ​​feature map After passing through a convolutional layer with a kernel size of 1×1, a stride of 1, and padding of 0, the spatial domain feature map is obtained. : , in, , Represents the convolution kernel. Number of input channels and The number of output channels is the same.

[0030] A frequency domain feature extraction network is introduced, which introduces a two-dimensional discrete cosine transform frequency domain feature extraction module to extract frequency domain feature maps from the preprocessed image.

[0031] In this embodiment, the two-dimensional discrete cosine transform frequency domain feature extraction module includes: Input sonar image to be detected An 8×8 sliding window with a step size of 8 is used for block processing, and each block is 8×8 in size, resulting in the block-wise feature map. : , in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, and H and W All are multiples of 8. This represents the feature map element index of each block. This represents the element index representation of the original feature map. Indicates the channel index. and Represents block coordinates, and Indicates the coordinates within the block.

[0032] To perform a two-dimensional discrete cosine transform on the segmented feature map, first define a one-dimensional discrete cosine transform basis matrix. : in, N It represents 8×8 blocks. n Indicates the spatial domain sampling point index. γ This represents the frequency component index in the frequency domain.

[0033] Based on this, for each 8×8 local block To perform a two-dimensional discrete cosine transform, its coefficient matrix is ​​defined as follows: Then we have the following expression for the two-dimensional discrete cosine transform: in, Two-dimensional discrete cosine transform matrix form: , in, D Represents the basis matrix of the one-dimensional discrete cosine transform. T Represents the transpose of a matrix; The discrete cosine transform coefficient matrix of each block of feature map Perform a zigzag scan and rearrange it into a one-dimensional sequence of length 64.

[0034] The zigzag arrangement refers to: A fixed zigzag scanning order is predefined on an 8×8 coordinate grid, resulting in 64 pairs of coordinates: These 64 pairs of coordinates cover all positions in the 8×8 matrix and are arranged in a zigzag pattern, alternating along the main diagonal from the top left to the bottom right.

[0035] For each channel cand each block position Define the corresponding one-dimensional coefficient vector after zigzag rearrangement. Its m-th element is given by the following formula: , in, Represents the first in a one-dimensional sequence m The element is equal to the discrete cosine transform coefficient matrix at a specified position in the zigzag sequence. The element at that location, Indicates the position coordinates of the zigzag sequence; Reshape the channels, and denote the channel index of the final output feature map as... Arrange the 64 discrete cosine transform coefficients of each block row by row to obtain the new total channel index. : ; Based on the new total channel index Obtain frequency domain feature map : , in, , and Represents the first in a one-dimensional sequence m Each element.

[0036] A multi-branch downsampling fusion network performs multi-branch downsampling and feature fusion on the obtained frequency domain feature map to obtain a multi-scale frequency domain feature map.

[0037] In this embodiment, the multi-branch downsampling fusion network includes: Frequency domain feature map Perform 3×3 average pooling while keeping the spatial size constant to obtain the feature map. : , in, P 1 indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index. h and w Indicates the position index.

[0038] feature map The feature map is obtained by dividing the feature map into two parts along the channel dimension. and feature map : , Among them, feature map Feature map .

[0039] For feature maps Perform downsampling to obtain the downsampled feature map. : in, Kernel size 3×3, stride 2, padding 1. Cin 1 indicates the number of input channels and the number of output channels. , C 2 indicates the target number of channels.

[0040] For feature maps Max pooling yields the pooled feature map. : ; in, Pooling kernel size 3×3, step size 2, fill size 1.

[0041] Feature maps after pooling Perform convolution to obtain the convolutional feature map. : , in, kernel size , Indicates the number of input channels. W 2 represents the convolution kernel. b 2 indicates the bias term.

[0042] Downsampled feature map and convolutional feature maps Feature concatenation is performed along the channel dimension to obtain the concatenated feature map. : , in, .

[0043] For the spliced ​​feature map Channel integration yields multi-scale frequency domain feature maps : , in, Indicates the output channel index. Indicates the input channel index. Represents the convolution kernel. This indicates the bias term.

[0044] For each spatial location 1×1 convolution converts the input channel vector Linear mapping to output channel vector Multi-scale frequency domain feature maps are obtained. .

[0045] The feature fusion network adaptively fuses spatial domain feature maps and multi-scale frequency domain feature maps to obtain a final fused feature map, and then performs target detection based on the final fused feature map.

[0046] S3. Acquire the sonar image to be detected and input it into the target detection network for target detection to obtain the detection result.

[0047] In this embodiment, images and inverse preprocessing were performed on 10 types of sonar targets: sphere, circle cage, cube, cylinder, human body, metal bucket, plane, ROV, square cage, and tire. These image data were divided into a training set of 7200 images and a validation set of 1800 images, proportionally. An initial learning rate of 0.001 and 200 epochs were set, and the sonar target classification model was iteratively learned using the training set.

[0048] like Figure 4 As shown in the figure, the network performance was verified using a validation set. The network achieved a detection accuracy of 90.7% and a recall rate of 81.4% on the test set. The embodiments of this invention demonstrate high accuracy in detecting 10 types of sonar targets.

[0049] like Figure 5 As shown in the confusion matrix of the classification results of 10 types of sonar targets, the average classification accuracy of the 10 types of sonar targets can reach 82.7%, and this embodiment has high discrimination of sonar targets.

[0050] Example 2 In this embodiment, a frequency-space collaborative sensing sonar image target detection system includes: a data partitioning module, a network training module, and a detection module.

[0051] The data partitioning module acquires sonar images and preprocesses them using contrast-limited adaptive equalization, dividing them into training and validation sets. The network training module constructs an initial target detection network and trains it based on the training and validation sets to obtain the final target detection network. The detection module acquires the sonar images to be detected and inputs them into the target detection network for target detection, obtaining the detection results.

[0052] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A method for target detection in sonar images using frequency-space collaborative sensing, characterized in that, Includes the following steps: Sonar images were acquired, preprocessed using contrast-limited adaptive equalization, and divided into training and validation sets. Construct an initial object detection network, and train the initial object detection network based on the training set and validation set to obtain the object detection network; The sonar image to be detected is acquired and input into the target detection network for target detection, and the detection result is obtained.

2. The sonar image target detection method based on frequency-space cooperative sensing according to claim 1, characterized in that, Object detection networks include: The spatial domain feature extraction network introduces a cross-stage partially connected feature fusion module to perform feature fusion and enhancement in four feature extraction stages to obtain a spatial domain feature map. A frequency domain feature extraction network is introduced, which uses a two-dimensional discrete cosine transform frequency domain feature extraction module to extract frequency domain feature maps from the preprocessed image. A multi-branch downsampling fusion network performs multi-branch downsampling and feature fusion on the obtained frequency domain feature map to obtain a multi-scale frequency domain feature map. The feature fusion network adaptively fuses spatial domain feature maps and multi-scale frequency domain feature maps to obtain a final fused feature map, and then performs target detection based on the final fused feature map.

3. The sonar image target detection method based on frequency-space collaborative sensing according to claim 1, characterized in that, Preprocessing methods include: For the input sonar image The channels are obtained ,in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index; For each channel Perform grayscale transformation to obtain the transformed data. : in, This represents the grayscale transformation operator. α This indicates the contrast limiting parameter. Ω Indicates a block configuration; Transformed data Channel-dimensional fusion is performed to obtain the processed dataset. : ; For the processed dataset Divide the dataset into training and validation sets: , in, Represents the training set, This represents the validation set.

4. The sonar image target detection method based on frequency-space cooperative sensing according to claim 2, characterized in that, The cross-stage partial connectivity feature fusion module includes: The feature map obtained after inputting the sonar image to be detected through two convolutional layers ,in, C Indicates the number of input channels. H Indicates altitude, W Indicates width; The convolutional feature map is obtained after passing through a convolutional layer with a kernel size of 1×1. : , in, Represents the convolution kernel. C 1 indicates the number of output channels. b Indicates the bias term; convolutional feature maps Divide the part into two equal parts to obtain the feature map. and feature map : ; For feature maps , make output ; For feature maps The input is fed into a backbone network consisting of two Bottleneck modules connected in series. Each Bottleneck module has two branches: a main path and residual connections. The feature map output by the main path is... The feature map output by the residual link is ; feature map and feature map The feature map is obtained by adding elements one by one. ; splicing by channel dimension and The spliced ​​feature map is obtained. : ; The spliced ​​feature map After passing through a convolutional layer with a kernel size of 1×1, a stride of 1, and padding of 0, the spatial domain feature map is obtained. : , in, This represents the convolution kernel.

5. The sonar image target detection method based on frequency-space cooperative sensing according to claim 2, characterized in that, The two-dimensional discrete cosine transform frequency domain feature extraction module includes: Input sonar image to be detected An 8×8 sliding window with a step size of 8 is used for block processing, and each block is 8×8 in size, resulting in the block-wise feature map. : , in, C Indicates the number of input channels. H Indicates altitude, W Indicates width, This represents the feature map element index of each block. This represents the element index representation of the original feature map. Indicates the channel index. and Represents block coordinates, and Indicates the coordinates within the block; Perform a two-dimensional discrete cosine transform on the segmented feature map to obtain the discrete cosine transform coefficient matrix. : , in, D Represents the basis matrix of the one-dimensional discrete cosine transform. T Represents the transpose of a matrix; The discrete cosine transform coefficient matrix of each block of feature map Perform a zigzag scan and rearrange it into a one-dimensional sequence of length 64. : , , in, Represents the first in a one-dimensional sequence m One element, Indicates the position coordinates of the zigzag sequence; Reshape the channels by arranging the 64 discrete cosine transform coefficients of each block row by row to obtain a new total channel index. : ; Based on the new total channel index Obtain frequency domain feature map : , in, , and Represents the first in a one-dimensional sequence m Each element.

6. The sonar image target detection method based on frequency-space cooperative sensing according to claim 2, characterized in that, Multi-branch downsampling fusion networks include: Frequency domain feature map Perform 3×3 average pooling to obtain the feature map. : , in, P 1 indicates the number of input channels. H Indicates altitude, W Indicates width, Indicates the channel index. h and w Indicates the position index; feature map The feature map is obtained by dividing the feature map into two parts along the channel dimension. and feature map : , Among them, feature map Feature map ; For feature maps Perform downsampling to obtain the downsampled feature map. : in, Cin 1 indicates the number of input channels; For feature maps Max pooling yields the pooled feature map. : ; Feature maps after pooling Perform convolution to obtain the convolutional feature map. : , in, Cin 2 indicates the number of input channels. W 2 represents the convolution kernel. b 2 indicates the bias term; Downsampled feature map and convolutional feature maps Feature concatenation is performed along the channel dimension to obtain the concatenated feature map. : , in, ; For the spliced ​​feature map Channel integration yields multi-scale frequency domain feature maps : , in, Indicates the output channel index. C 2 represents the target number of channels. Indicates the input channel index. W d represents the convolution kernel, b d represents the bias term.

7. A frequency-space cooperative sensing sonar image target detection system, wherein the system applies the method described in any one of claims 1-6, characterized in that, include: The system comprises a data partitioning module, a network training module, and a detection module. The data partitioning module is used to acquire sonar images, and preprocesses them using contrast-limited adaptive equalization to divide them into training and validation sets. The network training module is used to build an initial object detection network and train the initial object detection network based on the training set and validation set to obtain the object detection network; The detection module is used to acquire the sonar image to be detected and input it into the target detection network for target detection to obtain the detection result.