A side scan sonar target detection method and device based on adaptive frequency domain purification and multi-scale context fusion

By improving the YOLOv1 model and combining adaptive frequency domain sanitization and multi-scale context fusion techniques, the problem of detecting weak targets and reducing false alarm rate in complex underwater environments by side-scan sonar was solved, achieving efficient and accurate target detection.

CN122307528APending Publication Date: 2026-06-30GUANGDONG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG UNIV OF TECH
Filing Date
2026-04-27
Publication Date
2026-06-30

Smart Images

  • Figure CN122307528A_ABST
    Figure CN122307528A_ABST
Patent Text Reader

Abstract

This invention discloses a side-scan sonar target detection method and apparatus based on adaptive frequency domain sanitization and multi-scale context fusion, belonging to the field of side-scan sonar target detection technology. The method proposes an improved YOLOv11 seabed detection model. An efficient unified perception module and a dynamic multi-scale frequency domain fusion module are introduced into the backbone network. These modules enhance context awareness through multi-scale dilated convolution and adaptively suppress high-frequency noise using frequency domain transformation. The neck network employs a coordinate attention mechanism to strengthen spatial location information and combines a dynamic upsampling device with a cross-feature guided fusion strategy to improve the multi-scale feature fusion effect. Experimental results show that this method can effectively improve target detection accuracy and robustness in complex seabed environments, especially demonstrating superior recognition capabilities for small targets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of side-scan sonar target detection technology, and in particular to a side-scan sonar target detection method and apparatus based on adaptive frequency domain sanitization and multi-scale context fusion. Background Technology

[0002] With the increasing global demand for marine resource development, environmental protection, and national security, efficient and accurate underwater sensing and detection technologies have become indispensable in the field of marine science and engineering. Among the many underwater detection technologies, acoustic detection dominates due to its unique advantages of long-distance underwater propagation and minimal impact from light and water turbidity.

[0003] Side-scan sonar, as a core device for underwater imaging, is widely used in shipwreck search and rescue, subsea pipeline inspection, and marine topographic mapping, among other fields. Its real-time and high-resolution characteristics have significant practical value. However, in the complex and ever-changing underwater environment, side-scan sonar images are inevitably severely affected by speckle noise, reverberation, and multipath effects, and targets exhibit significant scale differences and background texture blurring. Existing technologies mostly rely on traditional spatial domain filtering or direct transfer of general optical target detection models. The former struggles to preserve target edge details while suppressing acoustic noise, while the latter lacks targeted optimization for the spectral characteristics and multi-scale contextual information of sonar images, leading to frequent missed detections of weak targets and high false alarm rates against complex backgrounds in actual detection missions. Summary of the Invention

[0004] The purpose of this invention is to provide a side-scan sonar target detection method and apparatus based on adaptive frequency domain sanitization and multi-scale context fusion, aiming to solve or improve at least one of the above-mentioned technical problems.

[0005] To achieve the above objectives, the present invention provides the following solution: A side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion includes: A dataset of images of seabed targets scanned by side-scan sonar was obtained and divided into a training set and a validation set. An improved seabed detection model based on the YOLOv11 model was constructed. Specifically, in the backbone network, an efficient unified perception module replaced the original C3K2 module, and a dynamic multi-scale frequency domain fusion module replaced the original C2PSA module. In the neck network, input features were processed by a coordinate attention module (CA) and convolutional layers. In the upsampling path, after upsampling by the dynamic upsampler DySample, the features were split: one path went through the CA module, and the other retained the original upsampled features. In layers P3 and P4, the processed input features were weighted and fused with the upsampled features processed by the CA module, and then fused with the original upsampled features from DySample through a cross-feature guided fusion module. Finally, the outputs of each layer were input into the C3K2 module. The seabed detection model is trained using the training and validation sets to generate the final seabed detection model. Using the final seabed detection model, the test image is input, and the predicted target bounding box of the side-scan sonar image is output.

[0006] Furthermore, the efficient unified perception module includes: The input features are initially processed by the CBS module and split into two paths through the Split operation. One path retains the original features, while the other path selects a different processing path based on the state of the use_csp parameter. When use_csp=True, the feature input consists of two cascaded cross-stage context feature aggregation modules, and the outputs of the two modules are concatenated with the original feature through the Concat operation; when use_csp=False, the feature input consists of two cascaded pyramid-hole context-aware attention modules, and the outputs of the two modules are concatenated through the Concat operation. The final features are output through the CBS module.

[0007] Furthermore, the execution flow of the cross-stage context feature aggregation module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is input into the CBS module and the pyramid hole context-aware attention module in sequence. The two output features are then fused through a Concat operation and output as the final features through the CBS module.

[0008] Furthermore, the execution flow of the pyramid void context-aware attention module includes: Input feature map First, go through a Convolution is used to reduce the dimensionality of channels and obtain intermediate features; intermediate features The input is fed into a multi-branch dilated convolutional structure, gradually expanding the receptive field. The expression is: In the formula, This refers to the intermediate feature maps in the expansion path; It is a linear rectification activation function; For the first One dilated convolution operation; The kernel size; The dilation rate of the dilated convolution; A symmetrical reverse contraction path is used to fuse features from the large receptive field and features from the small receptive field to generate aggregated features, expressed as follows: In the formula, This is an aggregation feature; This refers to intermediate feature maps within the contraction path; Aggregate features through After convolution to restore the channel dimension, the input is fed into the efficient multi-scale attention module EMA, where residual connections are made with the original input features to generate the final output features.

[0009] Furthermore, the execution flow of the dynamic multi-scale frequency domain fusion module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is processed through two cascaded spatial-frequency attention refinement blocks. The two output features are then fused through a Concat operation and output as the final features by the CBS module.

[0010] Furthermore, the execution flow of the spatial-frequency attention refinement block is as follows: After the input features are processed by the Attention module, they are connected with the original input features by residual connection to generate intermediate features. The intermediate features are then processed by the efficient cross-spectral feature aggregator and connected with the intermediate features by residual connection to output the final feature map. The execution flow of the high-efficiency cross-spectral feature aggregator is as follows: The feedforward network performs channel-dimensional feature transformation to generate intermediate features. The expression is: In the formula, As an intermediate feature; For input features; For instance normalization operations; Use the GELU activation function; intermediate features The input dynamic window frequency modulation unit outputs frequency domain enhancement features, which are then processed by learnable residual weights. The original input features are fused with the frequency domain enhanced features to generate the final output of the efficient cross-spectral feature aggregator, expressed as: In the formula, The final output of the efficient transspectral feature aggregator; These are learnable residual weights; These are the original input features; This is a frequency domain enhancement feature.

[0011] Furthermore, the execution flow of the dynamic window frequency modulation unit is as follows: Based on input features The weights are adaptively assigned to K predefined windows of different sizes, as expressed by: In the formula, Assign vectors to the weights; Use the Softmax activation function; This is a global average pooling operation; Input features Simultaneously, k parallel window frequency domain modulation branches are input, and each branch processes independently and outputs reconstructed features. ; All reconstructed features are summed element-wise to generate aggregated features, which are then multiplied element-wise with the weight allocation vector to output frequency domain enhanced features. The expression is: In the formula, The frequency domain enhancement feature of the output of the dynamic window frequency modulation unit; Let be the weight of the k-th window; The number of windows; This represents the reconstructed feature of the k-th window frequency domain modulation branch.

[0012] Furthermore, the execution flow of the window frequency domain modulation branch is as follows: Input features After multi-scale window spectral transformation to convert to the frequency domain, a spectral representation is generated, expressed as follows: In the formula, Represented by the spectrum; It is a two-dimensional real-number fast Fourier transform; To divide the filled feature map into multiple non-overlapping parts window; Let be the window size for the k-th branch; For reflection fill operation; Spectral representation with learnable complex frequency domain filters Element-wise multiplication is performed, and the result is transformed back to the spatial domain using an inverse Fourier transform. After removing padding, the reconstructed feature map is obtained, expressed as: In the formula, To reconstruct the feature map; This is the inverse Fourier transform; For cropping operations; It is a learnable complex frequency domain filter.

[0013] Furthermore, the execution flow of the cross-feature guided fusion module is as follows: The two input feature maps are concatenated along the channel dimension to generate a wide feature map containing joint information. This wide feature map is then input into the CA module to generate a joint attention map, expressed as: In the formula, For joint attention graphs; For CA modules; This is the high-level feature map after upsampling and alignment; These are low-level features after preprocessing; The joint attention map is segmented along the channel dimension to obtain guided weight maps corresponding to high-level and low-level features, respectively, as expressed in the following expression: In the formula, For advanced guided weighting graphs; This is a low-level guided weight graph; For the splitting operation; Based on the guided weight map, cross-guided fusion is performed to generate the final fused feature map, expressed as: In the formula, This is the fused feature map output by the cross-feature guided fusion module.

[0014] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects: This invention discloses a side-scan sonar target detection method and apparatus based on adaptive frequency domain sanitization and multi-scale context fusion. The method utilizes an efficient unified perception module, employing pyramid dilated convolution and multi-scale attention to enhance contextual awareness, adapting to the detection needs of targets of varying sizes. A dynamic multi-scale frequency domain fusion module is introduced to transform features to the frequency domain and adaptively suppress high-frequency noise components, achieving deep feature map sanitization. Simultaneously, a coordinate attention mechanism is used to address the problem of weak target features being submerged in deep networks. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a schematic flowchart of the method of the present invention; Figure 2 This is a schematic diagram of the structure of the seabed detection model in this embodiment; Figure 3 This is a schematic diagram of the structure of the high-efficiency unified perception module in this embodiment; Figure 4 This is a schematic diagram of the cross-stage context feature aggregation module in this embodiment; Figure 5 This is a schematic diagram of the pyramid cavity context-aware attention module in this embodiment; Figure 6 This is a schematic diagram of the structure of the efficient multi-scale attention module (EMA) in this embodiment; Figure 7 This is a schematic diagram of the dynamic multi-scale frequency domain fusion module in this embodiment; Figure 8 This is a schematic diagram of the spatial-frequency attention refinement block in this embodiment; Figure 9 This is a schematic diagram of the structure of the high-efficiency transspectral feature aggregator in this embodiment; Figure 10 This is a schematic diagram of the dynamic window frequency modulation unit in this embodiment; Figure 11 This is a schematic diagram of the window frequency domain modulation branch in this embodiment; Figure 12 This is a schematic diagram of the cross-feature guided fusion module in this embodiment; Figure 13 This is a schematic diagram of the coordinate attention module (CA) in this embodiment; Figure 14 This is a schematic diagram of the side-scan sonar target detection device in this embodiment; In the diagram, 101 is the mother ship; 102 is the winch; 103 is the tow cable; 104 is the side-scan sonar towed fish; 105 is the seabed target; 106 is the deck unit; 107 is the data processing terminal; 108 is the seabed; 109 is the high-frequency acoustic pulse; and 110 is the acoustic echo signal. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] The purpose of this invention is to provide a side-scan sonar target detection method and apparatus based on adaptive frequency domain sanitization and multi-scale context fusion, aiming to solve or improve at least one of the above-mentioned technical problems.

[0019] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0020] like Figure 1 As shown, this invention provides a side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion, comprising: Step 1: Obtain the image dataset of seabed targets scanned by side-scan sonar, and divide it into training set and validation set; In this embodiment, image annotation uses the open-source tool labelImg, and the annotation content includes the category and the coordinates of the top left and bottom right corners of the seabed target. The annotated information file is in XML format.

[0021] Step 2: Based on the YOLOv11 model, an improved seabed detection model is constructed. Specifically, in the backbone network, an efficient unified perception module replaces the original C3K2 module, and a dynamic multi-scale frequency domain fusion module replaces the original C2PSA module. In the neck network, input features are processed by a coordinate attention module (CA) and convolutional layers. In the upsampling path, after upsampling by the dynamic upsampler DySample, the features are split: one path goes through the CA module, and the other retains the original upsampled features. In layers P3 and P4, the processed input features are weighted and fused with the upsampled features processed by the CA module, and then fused with the original upsampled features from DySample through a cross-feature guided fusion module. Finally, the outputs of each layer are input into the C3K2 module. like Figure 2As shown, the execution flow of the seabed detection model is as follows: The size is The side-scan sonar images are input into the backbone network. The image data stream first passes through a series of convolutional layers and an efficient unified perception module to perform progressive downsampling and depth feature extraction. The backbone network generates feature maps in multiple stages. Among these, two intermediate feature maps are extracted. and The scales are respectively , At the end of the backbone network, the highest-level features are processed by the Spatial Pyramid Pooling (SPPF) module. These features are then fed into a dynamic multi-scale frequency domain fusion module for deep fusion, ultimately generating a high-level semantic feature map. , scale is .

[0022] Three feature maps at different scales , and The feature maps are fed into the neck network and fused using a top-down feature fusion path. First, the feature maps... , and Each feature map undergoes channel optimization via a Coordinate Attention (CA) module and feature transformation via convolutional layers. In the top-down fusion process, the high-level P5 feature map is first processed by a lightweight and efficient Dynamic Sampling (DySample) module, and then fused with the P4 feature map through element-wise multiplication and cross-feature guided fusion modules. This fusion process is also applied to the path from P4 to P3. This design achieves the integration and transfer between high-level semantic information and low-level spatial information. After each fusion stage, the feature map passes through a C3k2 module for further optimization of the feature representation.

[0023] Finally, the adaptive multi-scale context fusion network outputs enhanced feature maps at three multi-scales. , and The scales are respectively , and These three enhanced feature maps are simultaneously fed into the detector head. The detector head uses these multi-scale feature maps to predict the bounding boxes and categories of targets of different sizes in parallel.

[0024] like Figure 3As shown, the efficient unified perception module processes the input features through the CBS (convolution-batch normalization-activation) module, and then splits them into two paths through the Split operation. One path retains the original features, while the other path selects a different processing path based on the state of the use_csp parameter. When use_csp=True, the feature input consists of two cascaded cross-stage context feature aggregation modules, and the outputs of the two modules are concatenated with the original feature through the Concat operation; when use_csp=False, the feature input consists of two cascaded pyramid-hole context-aware attention modules, and the outputs of the two modules are concatenated through the Concat operation. The final features are output through the CBS module.

[0025] In this embodiment, the values ​​of the use_csp parameter of the high-efficiency unified perception module in the backbone network are False, False, True, and True, respectively.

[0026] like Figure 4 As shown, the execution flow of the cross-stage context feature aggregation module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is input into the CBS module and the pyramid hole context-aware attention module in sequence. The two output features are then fused through a Concat operation and output as the final features through the CBS module.

[0027] like Figure 5 As shown, the execution flow of the pyramid cavity context-aware attention module includes: Input feature map First, go through a Convolution is used for channel reduction to obtain intermediate features. ,in Dimensionality reduction factor; intermediate features The input is fed into a multi-branch dilated convolutional structure, gradually expanding the receptive field. The expression is: In the formula, This refers to the intermediate feature maps in the expansion path; It is a linear rectified activation function used to introduce nonlinear factors and enhance the expressive power of the model; For the first A dilated convolution operation is used to extract features without increasing the number of parameters and computational cost; The kernel size; The dilation rate of the dilated convolution controls the size of the receptive field. Equivalent to standard convolution; The residual connections and progressively increasing dilation rate described above effectively aggregate contextual information from the local to the mid-range. By achieving more comprehensive coverage of the receptive field, the meshing effect commonly found in standard dilated convolutions is mitigated.

[0028] A symmetrical reverse contraction path is used to fuse features from the large receptive field and features from the small receptive field to generate aggregated features, expressed as follows: In the formula, This is an aggregation feature; This refers to intermediate feature maps within the contraction path; The above-mentioned feature fusion of large and small receptive fields ensures that information from different receptive fields can effectively interact and complement each other.

[0029] Aggregate features through After convolution to restore the channel dimension, the input is processed by the efficient multi-scale attention module (EMA) and residually connected with the original input features to generate the final output features, expressed as: In the formula, For the final output features; These are the original input features.

[0030] like Figure 6 As shown, the processing flow of the EMA module is as follows: Input feature map First, it was divided into Non-overlapping sub-feature groups The number of channels in each group is .

[0031] The above decomposes global attention computation into multiple local subtasks, which significantly reduces computational complexity.

[0032] Each sub-feature group Independent spatial feature extraction, through Convolution captures local spatial dependencies and outputs... .

[0033] Cross-group information fusion is achieved through a dual-path mechanism, including: Spatial attention path: for each set of features Perform global average pooling (YAvgPool) along the channel dimension to obtain the spatial description vector. ; all After splicing Convolutional fusion generates a spatial attention map, expressed as: In the formula, Spatial attention map; For the Sigmoid function; Channel attention path: for each set of features Perform global average pooling (XAvgPool) along the spatial dimension to obtain the channel description vector. ; After normalization using GroupNorm, the inter-group correlation weights are calculated using Softmax, expressed as follows: In the formula, The weights represent the inter-group correlation. The Softmax function is used to normalize the input vector into a probability distribution so that the sum of the elements of the inter-group correlation weights is 1, thereby quantifying the importance of different feature groups. The group normalization operation reduces internal covariate bias and improves the stability and convergence speed of model training by normalizing the channel dimension of each group of features. Feature enhancement is achieved through weighted fusion, expressed as follows: In the formula, For output features; Input feature map; represents the inter-group correlation weight corresponding to the i-th feature group; The output feature is the i-th sub-feature group after being processed by 3×3 convolution.

[0034] like Figure 7 As shown, the execution flow of the dynamic multi-scale frequency domain fusion module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is processed through two cascaded spatial-frequency attention refinement blocks. The two output features are then fused through a Concat operation and output as the final features by the CBS module.

[0035] like Figure 8 As shown, the execution flow of the spatial-frequency attention refinement block is as follows: After the input features are processed by the Attention module, a residual connection is made with the original input features to generate intermediate features. These intermediate features are then processed by the efficient cross-spectral feature aggregator and a residual connection is made with the intermediate features to output the final feature map, expressed as: In the formula, The final feature map output by the spatial-frequency attention refinement block; For efficient cross-spectral feature aggregators; Input feature map; This is for Attention operations.

[0036] like Figure 9 As shown, the execution flow of the efficient cross-spectral feature aggregator is as follows: The feature transformation along the channel dimension is performed through a feedforward network (FFN) to enhance its representational power, thereby generating intermediate features. The expression is: In the formula, As an intermediate feature; For input features; For instance normalization operations; Use the GELU activation function; intermediate features The input dynamic window frequency modulation unit outputs frequency domain enhancement features, which are then processed by learnable residual weights. The original input features are fused with the frequency domain enhanced features to generate the final output of the efficient cross-spectral feature aggregator, expressed as: In the formula, The final output of the efficient transspectral feature aggregator; These are learnable residual weights; These are the original input features; This is a frequency domain enhancement feature.

[0037] The adaptive residual connection using learnable residual weights enables the model to dynamically balance the original spatial information and the frequency domain sanitization information, thereby effectively suppressing noise while effectively preserving the structural information that is crucial for target detection.

[0038] like Figure 10 As shown, the execution flow of the dynamic window frequency modulation unit is as follows: Based on input features The weights are adaptively assigned to K predefined windows of different sizes, as expressed by: In the formula, Assign vectors to the weights; Use the Softmax activation function; This is a global average pooling operation; Input features Simultaneously, k parallel window frequency domain modulation branches are input, and each branch processes independently and outputs reconstructed features. ( ); All reconstructed features are summed element-wise to generate aggregated features, which are then multiplied element-wise with the weight allocation vector to output frequency domain enhanced features. The expression is: In the formula, The frequency domain enhancement feature of the output of the dynamic window frequency modulation unit; Let be the weight of the k-th window; The number of windows; This represents the reconstructed feature of the k-th window frequency domain modulation branch.

[0039] like Figure 11 As shown, the execution flow of the window frequency domain modulation branch is as follows: Input features After multi-scale window spectral transformation to convert to the frequency domain, a spectral representation is generated, expressed as follows: In the formula, Represented by the spectrum; It is a two-dimensional real-number fast Fourier transform; To divide the filled feature map into multiple non-overlapping parts window; Let be the window size for the k-th branch; This is a reflection fill operation used to reduce boundary effects.

[0040] Spectral representation with learnable complex frequency domain filters Element-wise multiplication is performed, and the result is transformed back to the spatial domain using an inverse Fourier transform. After removing padding, the reconstructed feature map is obtained, expressed as: In the formula, To reconstruct the feature map; This is the inverse Fourier transform; For cropping operations; It is a learnable complex frequency domain filter.

[0041] like Figure 12 As shown, the execution flow of the cross-feature guided fusion module is as follows: The two input feature maps are concatenated along the channel dimension to generate a wide feature map containing joint information. This wide feature map is then input into the CA module to generate a joint attention map, expressed as: In the formula, For joint attention graphs; For CA modules; This is the high-level feature map after upsampling and alignment; These are low-level features after preprocessing; The joint attention map is segmented along the channel dimension to obtain guided weight maps corresponding to high-level and low-level features, respectively, as expressed in the following expression: In the formula, It is a high-level guided weight map, containing contextual information derived from high-level semantics, which can be used to guide low-level features; This is a low-level guided weight map, containing information derived from low-level details, which can be used to optimize high-level features; For the splitting operation; Based on the guided weight map, cross-guided fusion is performed to generate the final fused feature map, expressed as: In the formula, This is the fused feature map output by the cross-feature guided fusion module.

[0042] like Figure 13 The execution flow of the CA module is as follows: Input feature map Along the horizontal direction respectively ( (axis) and vertical direction ( After global average pooling of the axes, the data is concatenated, and the expression is: In the formula, Features after splicing; The spliced ​​features pass Convolution, batch normalization, and activation functions generate intermediate features, expressed as: In the formula, As an intermediate feature; For activation functions; For batch normalization; for convolution; After splitting the intermediate features, they are respectively processed through... Convolution and the sigmoid activation function generate attention weights, which are then residually connected to the original input features to obtain the final output features, expressed as: In the formula, Features after splitting; For the Sigmoid function; These are the original input features; This is the final output characteristic of the CA module.

[0043] Step 3: Train the seabed detection model using the training and validation sets to generate the final seabed detection model, including: The training method included training from scratch for 100 epochs without using pre-trained weights; the loss function used was the same as that in the original YOLOv11. Network training parameters were set as follows: learning rate lr = 0.01, batch size = 64, training set to validation set split of 0.9:0.1, SGD optimizer, and 100 training epochs.

[0044] Step 4: Using the final seabed detection model, input the test image and output the predicted target bounding box of the side-scan sonar image.

[0045] Image to be tested (Size adjusted to) The input is fed into the trained seabed detection model for forward inference. The model's backbone and neck network extract multi-scale features, and finally the detection head outputs three feature maps at different scales (respectively...). , and ).

[0046] Subsequently, the output of the detection head is decoded and post-processed: Feature Reorganization: Classification and regression prediction tensors at various scales are extracted, concatenated, and permuteed, shifting the channel dimension to the end to form a unified prediction format. The shape of the category prediction branch is as follows: The bounding box (BBox) predicts the branch shape as follows: .

[0047] Confidence filtering and preliminary screening: Calculate the target confidence of the prediction box, sort it in descending order according to the preset threshold (conf=0.001), and remove predictions with low confidence.

[0048] Non-maximum suppression: A non-maximum suppression algorithm based on intersection-to-union ratio (IoU) (threshold iou=0.6) is used to remove redundant overlapping boxes and the maximum number of detections per image is limited (max_per_img=300).

[0049] Coordinate Restoration and Visualization: The retained bounding box coordinates are mapped back from the network input scale to the original image scale to obtain normalized coordinates. A bounding box is then drawn on the image. If the final output contains the detection box, it is determined that a seabed target exists in the image; otherwise, it is determined that no target was detected.

[0050] like Figure 14 As shown, in one embodiment, the present invention provides a side-scan sonar target detection device, including: a mother ship 101, a winch 102, a tow cable 103, a side-scan sonar tow 104, a seabed target 105, a deck unit 106, and a data processing terminal 107.

[0051] The execution process is as follows: The mother ship 101 uses a winch 102 on its deck to control the length of the tow cable 103, pulling the side-scan sonar towed fish 104 to navigate at a predetermined depth underwater. The side-scan sonar towed fish 104 emits high-frequency acoustic pulses 109 towards both sides of the seabed 108. The acoustic beams cover the seabed area and illuminate the seabed target 105, generating an acoustic echo signal 110 containing information on the target's scattering intensity and acoustic characteristics. The acoustic echo signal 110 is transmitted back in real time via the tow cable 103 to the deck unit 106 located on the deck of the mother ship 101. The deck unit 106 acts as a signal relay and conversion interface, responsible for demodulating and converting the received signal from analog to digital, and sending the digitized sonar image data to the data processing terminal 107 through a data transmission interface.

[0052] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0053] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion, characterized in that, include: A dataset of images of seabed targets scanned by side-scan sonar was obtained and divided into a training set and a validation set. An improved seabed detection model based on the YOLOv11 model was constructed. Specifically, in the backbone network, an efficient unified perception module replaced the original C3K2 module, and a dynamic multi-scale frequency domain fusion module replaced the original C2PSA module. In the neck network, input features were processed by a coordinate attention module (CA) and convolutional layers. In the upsampling path, after upsampling by the dynamic upsampler DySample, the features were split: one path went through the CA module, and the other retained the original upsampled features. In layers P3 and P4, the processed input features were weighted and fused with the upsampled features processed by the CA module, and then fused with the original upsampled features from DySample through a cross-feature guided fusion module. Finally, the outputs of each layer were input into the C3K2 module. The seabed detection model is trained using the training and validation sets to generate the final seabed detection model. Using the final seabed detection model, the test image is input, and the predicted target bounding box of the side-scan sonar image is output.

2. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 1, characterized in that, The high-efficiency unified perception module includes: The input features are initially processed by the CBS module and split into two paths through the Split operation. One path retains the original features, while the other path selects a different processing path based on the state of the use_csp parameter. When use_csp=True, the feature input consists of two cascaded cross-stage context feature aggregation modules, and the outputs of the two modules are concatenated with the original feature through the Concat operation; when use_csp=False, the feature input consists of two cascaded pyramid-hole context-aware attention modules, and the outputs of the two modules are concatenated through the Concat operation. The final features are output through the CBS module.

3. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 2, characterized in that, The execution flow of the cross-stage context feature aggregation module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is input into the CBS module and the pyramid hole context-aware attention module in sequence. The two output features are then fused through a Concat operation and output as the final features through the CBS module.

4. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 2, characterized in that, The execution flow of the pyramid void context-aware attention module includes: Input feature map First, go through a Convolution is used to reduce the dimensionality of channels and obtain intermediate features; intermediate features The input is fed into a multi-branch dilated convolutional structure, gradually expanding the receptive field. The expression is: In the formula, This refers to the intermediate feature maps in the expansion path; It is a linear rectification activation function; For the first One dilated convolution operation; The kernel size; The dilation rate of the dilated convolution; A symmetrical reverse contraction path is used to fuse features from the large receptive field and features from the small receptive field to generate aggregated features, expressed as follows: In the formula, This is an aggregation feature; This refers to intermediate feature maps within the contraction path; Aggregate features through After convolution to restore the channel dimension, the input is fed into the efficient multi-scale attention module EMA, where residual connections are made with the original input features to generate the final output features.

5. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 1, characterized in that, The execution flow of the dynamic multi-scale frequency domain fusion module is as follows: The input features are initially processed by the CBS module and split into two paths through a Split operation. One path retains the original features, while the other path is processed through two cascaded spatial-frequency attention refinement blocks. The two output features are then fused through a Concat operation and output as the final features by the CBS module.

6. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 5, characterized in that, The execution flow of the spatial-frequency attention refinement block is as follows: After the input features are processed by the Attention module, they are connected with the original input features by residual connection to generate intermediate features. The intermediate features are then processed by the efficient cross-spectral feature aggregator and connected with the intermediate features by residual connection to output the final feature map. The execution flow of the high-efficiency cross-spectral feature aggregator is as follows: The feedforward network performs channel-dimensional feature transformation to generate intermediate features. The expression is: In the formula, As an intermediate feature; For input features; For instance normalization operations; Use the GELU activation function; intermediate features The input dynamic window frequency modulation unit outputs frequency domain enhancement features, which are then processed by learnable residual weights. The original input features are fused with the frequency domain enhanced features to generate the final output of the efficient cross-spectral feature aggregator, expressed as: In the formula, The final output of the efficient transspectral feature aggregator; These are learnable residual weights; These are the original input features; This is a frequency domain enhancement feature.

7. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 6, characterized in that, The execution flow of the dynamic window frequency modulation unit is as follows: Based on input features The weights are adaptively assigned to K predefined windows of different sizes, as expressed by: In the formula, Assign vectors to the weights; Use the Softmax activation function; This is a global average pooling operation; Input features Simultaneously, k parallel window frequency domain modulation branches are input, and each branch processes independently and outputs reconstructed features. ; All reconstructed features are summed element-wise to generate aggregated features, which are then multiplied element-wise with the weight allocation vector to output frequency domain enhanced features. The expression is: In the formula, The frequency domain enhancement feature of the output of the dynamic window frequency modulation unit; Let be the weight of the k-th window; The number of windows; This represents the reconstructed feature of the k-th window frequency domain modulation branch.

8. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 7, characterized in that, The execution flow of the window frequency domain modulation branch is as follows: Input features After multi-scale window spectral transformation to convert to the frequency domain, a spectral representation is generated, expressed as follows: In the formula, Represented by the spectrum; It is a two-dimensional real-number fast Fourier transform; To divide the filled feature map into multiple non-overlapping parts window; Let be the window size for the k-th branch; For reflection fill operation; Spectral representation with learnable complex frequency domain filters Element-wise multiplication is performed, and the result is transformed back to the spatial domain using an inverse Fourier transform. After removing padding, the reconstructed feature map is obtained, expressed as: In the formula, To reconstruct the feature map; This is the inverse Fourier transform; For cropping operations; It is a learnable complex frequency domain filter.

9. The side-scan sonar target detection method based on adaptive frequency domain sanitization and multi-scale context fusion according to claim 1, characterized in that, The execution flow of the cross-feature guided fusion module is as follows: The two input feature maps are concatenated along the channel dimension to generate a wide feature map containing joint information. This wide feature map is then input into the CA module to generate a joint attention map, expressed as: In the formula, For joint attention graphs; For CA modules; This is the high-level feature map after upsampling and alignment; These are low-level features after preprocessing; The joint attention map is segmented along the channel dimension to obtain guided weight maps corresponding to high-level and low-level features, respectively, as expressed in the following expression: In the formula, For advanced guided weighting graphs; This is a low-level guided weight graph; For the splitting operation; Based on the guided weight map, cross-guided fusion is performed to generate the final fused feature map, expressed as: In the formula, This is the fused feature map output by the cross-feature guided fusion module.

10. A side-scan sonar target detection device applying the detection method described in claims 1-9, characterized in that, include: Mother ship (101), winch (102), tow cable (103), side-scan sonar towed fish (104), seabed target (105), deck unit (106) and data processing terminal (107). The execution process is as follows: the mother ship (101) uses the winch (102) on the deck to control the length of the tow cable (103) to pull the side-scan sonar towed fish (104) to sail at a predetermined depth underwater. The side-scan sonar towed fish (104) emits high-frequency acoustic pulses (109) to both sides of the seabed (108). The acoustic beams cover the seabed area and illuminate the seabed target (105), generating an acoustic echo signal (110) containing target scattering intensity information and acoustic shadow characteristics. The acoustic echo signal (110) is transmitted back in real time to the deck unit (106) located on the deck of the mother ship (101) via the tow cable (103). The deck unit (106) serves as a signal relay and conversion interface, demodulating and converting the received signals into analog and digital signals, and sending the digitized sonar image data to the data processing terminal (107) through the data transmission interface.