A SAR ship image target detection method based on double-path cooperative improved YOLOv11
By improving the YOLOv11 model and combining it with the iSimAM module, NA-ASFFHead, and SpdBlock, the problems of local and global feature co-modeling and noise suppression in SAR ship detection were solved, achieving more efficient multi-scale feature fusion and noise suppression, and improving the robustness and accuracy of detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING COLLEGE OF FINANCE ECONOMICS
- Filing Date
- 2026-05-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing SAR ship detection methods have shortcomings in co-modeling of local and global features, noise suppression, and multi-scale feature fusion, which reduces the effectiveness and reliability of detection tasks, especially in complex sea conditions and multi-scale scenarios where false negatives and false positives are prone to occur.
We adopt a YOLOv11 model based on dual-path collaborative improvement, combining the iSimAM module, NA-ASFFHead module and SpdBlock. By introducing learnable gating units to dynamically balance the roles of iRMB and SimAM, we achieve collaborative enhancement of local details and global context. Furthermore, we suppress noise interference through adaptive spatial feature fusion and noise estimation subnetworks, thereby optimizing feature representation and detection performance.
It significantly improves the robustness and accuracy of SAR ship detection, effectively suppresses speckle noise, enhances the detection capability of small targets, reduces false alarms, and improves detection consistency and accuracy in multi-scale scenarios.
Smart Images

Figure CN122244637A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of SAR ship image target detection, specifically to a SAR ship image target detection method based on dual-path collaborative improvement of YOLOv11. Background Technology
[0002] Synthetic Aperture Radar (SAR) possesses all-weather, all-time Earth observation capabilities, enabling it to acquire high-resolution remote sensing images under complex weather conditions. Compared to optical imaging methods, SAR signals can penetrate clouds and fog, thus offering unique applications in marine monitoring, disaster response, and target reconnaissance. However, in SAR ship detection, speckle noise generated by factors such as the imaging mechanism leads to blurred target edges and feature degradation; especially in near-shore scenarios, land, islands, and ships often exhibit similar strong scattering characteristics, easily causing false detections. Furthermore, ship targets vary in scale, ranging from small fishing boats occupying a single pixel to large cargo ships occupying tens of pixels, making multi-scale detection technology even more challenging. To address the aforementioned challenges, traditional detection techniques based on Constant False Alarm Rate (CFAR) rely heavily on expert experience and manually acquired features. This makes it difficult to guarantee the continued effectiveness of target features, resulting in poor robustness. While suitable for handling simple, single search and rescue images, they perform poorly in complex maritime situations. Deep learning networks, on the other hand, can quickly and accurately learn invariant features from large datasets, eliminating the time-consuming task of manual feature extraction. The network model exhibits good robustness, thus reducing the influence of human factors. Therefore, CNN-based target detection algorithms have inherent advantages over traditional ship detection methods in SAR image ship detection. Based on the detection paradigm, they can be divided into two-stage detectors and single-stage detectors. Single-stage detectors utilize a complete convolutional network to perform a single classification and regression of anchor boxes to generate detection results. Tang et al., based on YOLOv7-tiny, introduced deformable convolution and BiFormer attention mechanisms, proposing an adaptive feature recognition method based on BiFormer attention. They designed a loss function based on a dynamic non-monotonic focusing mechanism to improve the detection accuracy of small near-shore vessels. Sun et al. added an angle classification structure and embedded bidirectional information interaction to the BiFA-YOLO head network to efficiently aggregate multi-scale features, enhancing its multi-scale representation capabilities. Guan et al. designed a shuffling reparameterization module based on YOLOv8, combined with a hybrid attention module, effectively improving the feature extraction capability for small targets. These studies laid the foundation for enhancing backbone networks through attention mechanisms and revealed the key role of attention modules in improving SAR ship detection performance, providing assistance for subsequent research on attention mechanisms. In recent years, attention mechanisms have evolved from channel attention to spatial attention, and then to parameterless attention, with designs gradually moving towards efficiency and lightweight implementation. SimAM (Similarity-based Attention Module) simulates the attention mechanism in neuroscience through an energy function, assigning weights to each neuron without additional parameters, demonstrating superior performance in multiple visual tasks. Ren et al., addressing the problems of complex models and massive computational demands, proposed an efficient and lightweight network, YOLO-Lite, for search and rescue vessel detection. They designed a lightweight feature enhancement backbone network, embedded channel and position enhancement attention modules, and customized an enhanced spatial pyramid pooling module to solve the problem of lost positional information for small SAR vessels in high-level features. Xu et al. proposed a simple parameterless attention module based on the attention-based YOLOv8 algorithm, enabling the network to automatically emphasize key features in the image, enhancing its ability to represent target regions and suppress background interference, thereby improving detection accuracy. Ning et al. adopted the SimAM attention mechanism to address the limitations and increased computational cost of embedding attention mechanisms into convolutional neural networks, facilitating efficient extraction of vessel positions from high-resolution SAR images. However, SimAM's global context awareness capability tends to overlook local textures and details, thus reducing the accuracy of small object detection. To improve the accuracy of small object detection, many scholars have introduced the Inverted Residual Mobile Block (iRMB) and designed new attention mechanisms around it. Yu et al. used a lightweight inverted residual mobile block (iRMB) in the backbone network to reduce network parameters and shorten detection time, constructing feature maps using multi-scale pyramids to achieve the detection of targets of different sizes. Pan H et al. introduced iRMB into EMO, reducing the number of model parameters and introducing an enhanced feature extraction module, SCConv-C3, which uses spatial and channel reconstruction convolutions to eliminate channel and spatial redundancy in the image while enhancing feature representation capabilities. Wang J et al. fused iRMB with EMA to generate an iEMA module to enhance feature representation and context modeling, improving the accuracy of detecting small boats obscured by speckle noise and feature blurring.
[13] The aforementioned research laid the foundation for introducing an attention mechanism into iRMB to improve detection performance. However, the local modeling characteristics of iRMB make it difficult to capture the long-range spatial dependencies required for large targets, and it lacks a dedicated suppression mechanism for speckle noise, which constitutes a significant limitation in SAR images with high noise and multiple scales. Therefore, this study combines the parameterless global attention of SimAM with the local convolutional features of iRMB to achieve complementary advantages. In terms of multi-scale feature fusion, adaptive spatial feature fusion effectively alleviates semantic conflicts and gradient inconsistencies in feature pyramids by learning spatial weights of feature maps at different levels. Xu et al. applied the ASFF module to the YOLO detection head, proposing MC-ASFF-ShipYOLO, which introduced a Monte Carlo attention module and combined it with the adaptive spatial feature fusion (ASFF) module to achieve dynamic fusion of cross-scale features, improving the consistency of multi-scale ship detection.
[14] Hong et al. introduced HGNetV2 to replace the YOLOv8 backbone network and introduced ASFF into the YOLOv8 detector head, which improved the mAP50 index while maintaining a low number of parameters.
[15] Therefore, introducing adaptive spatial feature fusion into SAR ship detection helps to enhance the consistency of multi-scale feature representation while maintaining computational efficiency, thereby improving the overall performance of target detection. However, the shortcomings of existing methods in co-modeling of local and global features, noise suppression, and multi-scale feature fusion reduce the effectiveness and reliability of SAR ship detection tasks. Summary of the Invention
[0003] This invention proposes a SAR ship image target detection method based on dual-path collaborative improved YOLOv11. First, ship images are acquired, and an iSimAM-YOLO model is constructed. The iSimAM-YOLO model includes: an iSimAM module, an NA-ASFFHead module, and a SpdBlock module. The iSimAM module includes an iRMB module, a SimAM module, and an iSimAM module. The inverted residual structure iRMB module is a hybrid network module combining depthwise separable convolution and self-attention mechanisms. For the input image, the iRMB module first performs dimensionality expansion using a multilayer perceptron (MLP), with an output / input ratio of [missing value]. ; In the formula, For input feature values, For the output results, , , These represent the image's channels, height, and width, respectively. To extend the multilayer perceptron, a single The process involves convolution; then, efficient operators are applied to the output. MHSA is used as an efficient operator F to further enhance image features; Next, another multilayer perceptron is used for dimensionality reduction to obtain... After the above steps, the number of channels is reduced. The final residual connection output is obtained: The core of the SimAM module lies in its 3D attention weight generation mechanism, which directly assigns independent 3D weights to neurons in the channel, height, and width dimensions of the feature map. The iSimAM module dynamically fuses iRMB and SimAM through gating units, leveraging the local multi-scale extraction advantages of iRMB while introducing the global noise suppression capabilities of SimAM, achieving synergistic enhancement of local details and global context. The gating fusion mechanism is designed as follows: Let the input features be... GAP stands for Global Average Pooling, which aggregates global spatial information from the entire feature map. FC stands for Fully Connected Layer. Features are obtained via the iRMB path and the SimAM path, respectively. and The gating unit first performs global average pooling on the input, and then generates fusion weights for the two paths through a lightweight fully connected layer. : ;in For the Sigmoid function, satisfying The final output is: . The NA-ASFFHead module adjusts multi-scale features to the same size through adaptive fusion feature (ASFF), enabling each layer of features to serve target detection at its corresponding scale. The adaptive fusion feature (ASFF) is calculated by multiplying features from different layers by weight parameters. , , And by adding them together, we get the following formula: In the formula, For new fusion features, For pixel coordinates, For the feature layer of ASFF, For the feature values of different ASFF layers, , , These are the weight parameters for different layers. The weight parameters are the weights of the feature map after... The result is obtained through convolution, then concatted, and then normalized to a value between 0 and 1 using softmax, while ensuring that the sum of the three weight parameters is 1. The weight parameter calculation method ; ; ; . The method further includes: connecting a noise estimation subnetwork in parallel for each scale feature in ASFF, outputting a single-channel confidence map with the same spatial size as the feature map, used to represent the confidence of each location feature. This confidence map is then multiplied element-wise with the initial weight map to obtain the adjusted weights, thereby dynamically gating the weights before fusion; the noise estimation subnetwork consists of two... The convolutional layer consists of layers where each convolutional kernel calculates a weighted sum of pixels in the neighborhood of that location as it slides across the feature map. The SpdBlock module consists of a spatial-to-depth SPD layer and a convolutional layer with a stride of 1, preserving fine-grained feature information during feature map downsampling; the SPD layer is used to downsample the feature map. The transformation process is as follows: assuming a given size... intermediate feature map The SPD layer divides the feature map into a series of sub-feature maps through a slicing operation, as shown in the following equation: ;in, This is the index of the subgraph. To address the shortcomings of existing methods in collaborative modeling of local and global features, noise suppression, and multi-scale feature fusion, this invention proposes a SAR ship image target detection model (iSimAM-YOLO) based on dual-path collaborative improvement of YOLOv11. This model designs a gated dual-path collaborative enhancement architecture. By introducing learnable gating units, it dynamically balances the roles of iRMB and SimAM, achieving adaptive fusion of local details and global context, thus enhancing target features while suppressing speckle noise. Secondly, it uses SpdBlock to replace traditional stride convolutional downsampling to better preserve detail information and reduce pixel loss for small targets. Then, it introduces an adaptive spatial feature fusion detection head, which dynamically learns the spatial weights of multi-scale features to alleviate semantic conflicts between different levels in the feature pyramid. To improve noise resistance, a parallel noise estimation subnetwork is designed, multiplying it with the initial weights to suppress noise interference and improve the model's robustness to multi-scale variations and noise. Finally, the algorithm was trained and validated on the public dataset HRSID, and compared with YOLOv8-SimAM, YOLOv8-BiFPN, YOLOv11, ACYOLO, YOLOv11-SimAM, etc., which verified the effectiveness and robustness of the proposed algorithm. Attached Figure Description Figure 1 The model is iSimAM-YOLOv11; Figure 2 For iRMB network structure; Figure 3 The SimAM attention mechanism structure; Figure 4 For iSimAM module; Figure 5 It uses an ASFF module structure; Figure 6 It uses a NA-ASFF module structure; Figure 7 To estimate the network structure for noise; Figure 8 This refers to the SPDBlock transformation process; Figure 9 Here is a diagram of the SPDblock framework; Figure 10 For visual analysis results comparison. Detailed Implementation The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. YOLOv11 is designed as an efficient, accurate and robust detection system, and has shown excellent performance in handling complex environments, dynamic scenes and small object detection tasks
[16] . For SAR image target detection, YOLOv11 integrates advanced modules such as C3k2 module and spatial pyramid pooling fusion, which effectively enhances feature extraction and multi-scale target recognition capabilities, and achieves a balance between detection accuracy and computational efficiency
[17] . However, directly applying it to SAR ship detection will face the problem of lacking a dedicated suppression mechanism for speckle noise, which makes weak target features susceptible to interference; the basic attention mechanism has limited adaptability to the unique scattering characteristics and complex backgrounds of SAR images; at the same time, the multi-scale fusion strategy is still insufficient for handling the dramatic changes in target scale, and the problems of small target missed detection and background false detection are still prominent. In order to meet the above challenges, this paper designs a dual-path iSimAM attention mechanism based on YOLOv11 and combines SPDBlock with an adaptive spatial feature fusion detection head containing a noise detection branch to construct the iSimAM-YOLO model, the overall structure of which is as follows. Figure 1 As shown. This scheme effectively alleviates problems such as speckle noise interference, weak target detection, and class imbalance in SAR images, and is especially suitable for fields such as national defense security, emergency response, and environmental monitoring where high accuracy and efficiency are required. 1.1 iSimAM Module 1.1.1 iRMB Module In order to improve the detection accuracy and generalization ability of small target ships, the idea of inverted residual structure was introduced and the iSimAM attention mechanism was designed. In the design of iRMB, the concept of inverted residual block was incorporated into the attention mechanism for improvement. By capturing the global dependency between input regions through self-attention, iRMB achieves overall context modeling during feature extraction, enhancing the understanding of complex data patterns. The structure of IRB is designed to achieve high performance in resource-constrained environments
[13] . The inverted residual structure iRMB is a hybrid network module that combines depthwise separable convolution and self-attention mechanism. For the input image, iRMB first expands the dimension through a multilayer perceptron (MLP), with an output / input ratio of 1. . (1) In the formula, For input feature values, For the output results, , , These represent the image's channels, height, and width, respectively. To extend the multilayer perceptron, it is usually composed of a Convolutional components. By expanding the channel dimension of the input features through MLP, richer feature representations are provided for subsequent depthwise separable convolutions, which helps improve the model's ability to capture multi-scale features. Then, apply efficient operators to the output. MHSA is often used as an efficient operator to further enhance image features. (2) Next, another multilayer perceptron is used for dimensionality reduction to obtain... After the above steps, the number of channels is reduced. The final residual connection output is obtained: (3) like Figure 1 In the iRMB module, there are Convolution, depthwise convolution, and self-attention mechanisms. (Using...) Convolution is used to compress and expand the number of channels to optimize computational efficiency; depthwise separable convolution is used to capture spatial features, while attention mechanisms are used to capture global dependencies between features. In SAR image detection tasks, the inherent speckle noise of SAR images interferes with local feature extraction, and iRMB lacks a targeted noise suppression mechanism, causing weak target features to be easily submerged by noise; in addition, in complex sea conditions or nearshore scenes, iRMB's ability to distinguish low-contrast targets is insufficient, easily resulting in missed detections and false alarms. 1.1.2 SimAM Module In SAR image target detection, traditional attention modules have limitations such as fixed local receptive fields, which are not conducive to extracting subtle features of weak targets, and difficulty in effectively distinguishing low-contrast targets in complex scenes
[18] . These problems lead to a decrease in the detection effect of small targets and low-contrast features in SAR images. To address the above challenges, the SimAM module is introduced. It utilizes its parameterless attention reweighting method, which does not rely on traditional weight injection fusion and can effectively enhance local features without increasing learnable parameters. The core of SimAM lies in its three-dimensional attention weight generation mechanism, which directly assigns independent three-dimensional weights to neurons in the channel, height and width dimensions of the feature map without the need for weight copying or expansion operations
[19] . This mechanism gives each neuron its own weight, thereby enabling simultaneous evaluation of the importance of channels and spatial positions, avoiding the redundant process of generating one-dimensional weights and then expanding them to three dimensions in traditional methods, and achieving more accurate and efficient feature weighting. Its structure diagram is shown below; 1.1.3 iSimAM Module To suit the field of SAR ship image detection, this paper designs a gated dual-path collaborative architecture. This architecture retains the advantages of the inverted residual moving block in local feature perception and multi-scale transformation, while introducing a pluggable SimAM parameterless attention branch to achieve adaptive feature calibration without increasing network parameters. The deep fusion of these two approaches achieves dynamic balance through a learnable gating unit, the structure of which is as follows: Figure 4 As shown. By dynamically fusing iRMB and SimAM through a gating unit, the advantages of iRMB in local multi-scale extraction and the global noise suppression capability of SimAM can be leveraged to achieve synergistic enhancement of local details and global context. This significantly improves the robustness and accuracy of SAR ship detection while maintaining computational efficiency. The gating fusion mechanism is designed as follows: Let the input features be... GAP stands for Global Average Pooling, which aggregates the global spatial information of the entire feature map, providing a statistical basis for subsequent gating weight generation. FC stands for Fully Connected Layer. Features are obtained via the iRMB path and the SimAM path, respectively. and The gating unit first performs global average pooling on the input, and then generates fusion weights for the two paths through a lightweight fully connected layer. : (4) in For the Sigmoid function, satisfying The final output is: (5) The gating mechanism adopted in this paper can dynamically adjust the contribution ratio of local iRMB and global SimAM paths according to the statistical characteristics of input features. In the background of strong noise, it automatically suppresses the interference response of SimAM branches; in the weak texture region, it enhances the detail extraction capability of iRMB branches. The gating parameters are jointly optimized with the network without the need for additional supervision signals. The designed iSimAM attention mechanism has the following three advantages: (1) RMB achieves efficient multi-scale local feature extraction through inverted residual structure and depth-separable convolution, making up for the problem of insufficient capture of local features of SAR images by traditional convolution; (2) SimAM generates parameterless three-dimensional attention weights based on the statistical characteristics of feature maps, effectively suppressing speckle noise and enhancing low-contrast targets; (3) The deep integration of the two makes feature extraction and attention weighting complementary and enhanced, significantly improving feature discrimination while maintaining computational efficiency. 1.2 NA-ASFFHead Module In the YOLO detection framework, common multi-scale feature fusion methods often cause conflicts between different feature layers. The same target may be identified as a positive sample in one layer, but may be incorrectly identified as background in other layers. This contradiction will deepen the semantic differences between different scales
[20] . Traditional FPN directly fuses feature maps of different levels. Due to the differences in spatial features and semantic information of each layer, it is easy to cause inconsistent gradient propagation. High-level features are usually more suitable for detecting large targets, while low-level features are more sensitive to small targets. When SAR images contain objects of different sizes at the same time, simple fusion of features of different layers may cause interference, affect the stable propagation of gradients, and thus reduce the detection accuracy. In order to improve the shortcomings of multi-scale feature fusion, this paper introduces an adaptive spatial feature fusion mechanism to enhance the feature fusion effect of FPN. This mechanism automatically adjusts the contribution of each layer of features by learning spatial weights, which helps to alleviate semantic conflicts between layers, suppress noise that is detrimental to gradient propagation, and thus improve the consistency between scales
[21] . ASFF adjusts multi-scale features to the same size, enabling each layer of features to better serve object detection at its corresponding scale, ultimately improving the model's detection accuracy in complex scenes and with multi-scale objects. The architecture of the ASFF module is as follows: Figure 5 As shown. Adaptive fusion feature (ASFF) is formed by multiplying features from different layers by weight parameters. , , And by adding them together, we get the following formula
[22] ; (6) In the formula, For new fusion features, For pixel coordinates, For the feature layer of ASFF, For the feature values of different ASFF layers, , , These are the weight parameters for different layers. The weight parameters are the weights of the feature map after... The weights are obtained by convolution, then concatenated, and then normalized to a value between 0 and 1 using softmax, while ensuring that the sum of the three weight parameters is 1. The weight parameters are calculated as shown in (7)-(10). (7) (8) (9) (10) Adaptive spatial feature fusion (ASFF) learns weight maps of spatial locations to perform weighted fusion of feature maps at different scales. However, in SAR image detection scenarios, feature maps suffer from significant detail loss due to speckle noise. Traditional ASFF's initial weights are generated solely from the features themselves via 1×1 convolutions, failing to distinguish feature types and potentially assigning inappropriate fusion weights to noisy regions, thus contaminating the final features. To address this, ASFF connects a noise estimation subnetwork in parallel for each scale feature, outputting a single-channel confidence map with the same spatial size as the feature map, representing the confidence level of each location. This confidence map is then element-wise multiplied with the initial weight map to obtain adjusted weights, thus dynamically gating the weights before fusion. This ensures that high-confidence locations retain their original weights while effectively suppressing the weights of low-confidence locations. Noise estimation subnetwork such as Figure 7 As shown, it consists of two The system consists of convolutional layers. Each convolutional kernel, as it slides across the feature map, calculates a weighted sum of pixels in its neighborhood at that location. Thus, these kernels can extract the mean, variance, and even higher-order texture information of local regions, mapping high-noise regions to low-confidence areas and low-noise regions to high-confidence areas, achieving an implicit estimation of the signal-to-noise ratio at each location. Subsequently, the initial weights are multiplied element-wise with the confidence map to obtain the modulated weights. (11) In the formula, This is the modulated weight map. For multiple scales, This is represented as the initial weighted graph. The confidence map, ranging from 0 to 1, is output by a parallel noise estimation subnetwork. This operation adaptively suppresses noise locations; when a pixel is identified as strong noise, its fusion weight is reduced, while the original weight is retained for confident regions. To prevent negative values or scale imbalance in the modulated weights, the three modulation weights are concatenated along the channel dimension and Softmax normalization is performed again, ensuring that the sum of the weights at the three scales is 1. Compared to traditional ASFF, NA-ASFFHead dynamically suppresses speckle noise through a learnable confidence map, avoiding noise features contaminating the fusion result. Furthermore, the noise estimation subnetwork is jointly optimized with the detection head, eliminating the need for independent noise annotation. 1.3 SpdBlock When processing SAR images of small-scale targets, convolutional neural networks have a significant decrease in detection performance. This is because they generally use strided convolution and pooling layers for downsampling, which leads to the loss of spatial detail information and restricts the model's ability to extract subtle features. To alleviate this problem, a space-to-depth module (SpdBlock) is used to reduce the performance degradation caused by strided convolution and pooling operations in low-resolution and small-target scenes
[23] . SpdBlock consists of a Space-to-Depth (SPD) layer and a convolutional layer with a stride of 1, preserving fine-grained feature information during feature map downsampling
[24] . The SPD layer is essentially an extension of image transformation technology in convolutional neural networks, mainly used for downsampling feature maps. Its transformation process is as follows: given a size of intermediate feature map The SPD layer divides the feature map into a series of sub-feature maps through a slicing operation, as shown in the following equation: (12) in, This serves as the index for the subgraph. Through this method, the spatial dimension of the original feature map is compressed, while the channel dimension is correspondingly expanded, thus achieving downsampling while preserving spatial information. When the scale factor... At that time, feature map It is divided into four sub-feature maps based on pixel location. The size of each sub-feature map is This halved the spatial resolution of the original feature maps. Subsequently, these sub-feature maps were concatenated along the channel dimension to obtain the intermediate feature map. Its spatial dimensions are reduced to the original The channel dimension is increased to the original This represents a multiple, achieving a significant improvement from... arrive The transformation process, by rearranging spatial information, retains all information while downsampling, providing richer feature representations for subsequent convolution operations
[24] . After completing the SPD transformation, a convolutional layer with a stride of 1 is further introduced, employing... The filter configuration will transform the intermediate feature map Convert to output feature map . In feature extraction, the stride setting is equally important. Convolution operations with a stride greater than 1 often lead to a sharp drop in the dimensionality of the feature map space and introduce sampling bias. Using a stride of 3... Taking convolution as an example, each pixel in the feature map is sampled only once, resulting in the direct loss of a large amount of information. Convolution with a stride of 2 produces asymmetric sampling, leading to inconsistent sampling frequencies. This imbalance not only destroys the integrity of the features but also weakens the model's discriminative ability. Therefore, using a stride of 1 effectively avoids these problems, maintaining the integrity of the feature map's spatial structure while enhancing its discriminative power. The SpdBlock algorithm used borrows the idea of sparse feature aggregation, achieving a good balance between computational efficiency and information preservation through a collaborative mechanism of channel expansion and spatial compression, providing a high-quality feature foundation for subsequent tasks such as object detection and image classification. 2. Analysis of Experimental Results 2.1 Ablation Experiment This experiment uses the HRSID dataset for model training and validation. This dataset contains 5604 high-resolution 800×800 pixel marine images, covering various typical and complex marine scenes such as the high seas, coastlines, ports, and docks. A total of 16591 ship targets are labeled in the dataset, with small, medium, and large ships accounting for 54.5%, 43.5%, and 2%, respectively. This multi-scale and imbalanced target distribution closely resembles the real marine monitoring environment and effectively evaluates the algorithm's ability to detect ships of different sizes, especially small and medium-sized targets. The dataset is divided into training and validation sets in a 7:3 ratio to ensure the model can learn on sufficient data and be reliably evaluated on independent samples. The experimental environment consists of an Intel i5-12400F CPU, an NVIDIA GeForce RTX 4060, and CUDA 11.8. To verify the effectiveness of the proposed module, systematic ablation experiments were conducted on the HRSID dataset. Using YOLOv11 as the baseline model, the independent contributions and combined effects of the proposed iSimAM module and the introduced Detect_ASFF detection head were systematically evaluated. The proposed module demonstrated significant performance improvements across multiple key metrics, including mAP, precision (P), and recall (R). The experimental results are shown in Table 1 below. Table 1 Comparison Results of Ablation Experiments Table 1 Comparative Results of Ablation Experiments Model mAP@0.5 mAP@0.5:0.95 Precision Recall YOLOv11 0.8859 0.6348 0.896 0.8099 YOLOv11-SimAM 0.903 0.656 0.9046 0.8180 YOLOv11-ASFFHead 0.9091 0.6712 0.9075 0.8257 YOLOv11-iSimAM 0.903 0.655 0.9118 0.8173 YOLOv11-Spdblock 0.9036 0.6590 0.9109 0.8182 YOLOv11-Spd-iSimAM 0.9111 0.6722 0.9166 0.8266 YOLOv11-Spd-NA-ASFFHead 0.9116 0.6802 0.9165 0.8252 iSimAM-YOLO 0.9224 0.6960 0.9122 0.8440 As shown in Table 1, each improved module enhances the SAR ship detection performance to varying degrees. Introducing the SimAM attention mechanism increases the model's accuracy to 90.46%, indicating that this mechanism effectively suppresses background clutter interference and improves detection accuracy. Using the ASFFhead module, mAP@0.5 reaches 90.91%, verifying the advantages of adaptive spatial feature fusion in enhancing multi-scale feature representation. The improved iSimAM module achieves an accuracy of 91.18%, outperforming SimAM, highlighting the improved attention mechanism's role in enhancing feature representation capabilities. Furthermore, using Spdblock significantly improves mAP@0.5, demonstrating its ability to retain more detailed features. Combining Spdblock with iSimAM significantly improved mAP@0.5 and mAP@0.5:0.95 compared to using Spdblock and iSimAM modules separately, with improvements in precision and recall as well. This indicates that the fusion of spatial downsampling and the improved attention mechanism can better preserve details of small targets and suppress false alarms. Combining Spdblock with NA-ASFFHead yielded 91.16% mAP@0.5 and 68.02% mAP@0.5:0.95, both higher than using a single module alone, demonstrating that the preservation of detailed features and adaptive multi-scale feature fusion have a synergistic gain effect. Ultimately, the iSimAM-YOLO model, which integrates iSimAM, NA-ASFFhead, and Spdblock, performed best across all evaluation metrics. Compared to the baseline YOLOv11, it improved mAP@0.5, mAP@0.5:0.95, precision, and recall by 3.65, 6.12, 1.62, and 3.41 percentage points, respectively, and also comprehensively outperformed the aforementioned pairwise combination models. These results demonstrate that the proposed method effectively addresses the issues of feature ambiguity and missed detection of small targets in SAR ship detection through synergistic enhancement of feature representation, multi-scale fusion, and spatial information preservation, exhibiting superior detection performance and robustness in complex scenarios. 2.2 Comparison Results Analysis To verify the advantages of the proposed module, a systematic comparative experiment was conducted on the HRSID dataset. The proposed method was compared with common YOLOv8, YOLOv10s, YOLOv11, YOLOv8-SimAM
[25] , ACYOLOv11
[26] , YOLOv11-SimAM, YOLOv8-BiFPN, etc. The experimental results are shown in Table 2 below; Table 2 Comparison Results Table 2 Comparison Results Model mAP@0.5 mAP@0.5:0.95 Precision Recall YOLOv8 0.8989 0.6489 0.9091 0.8185 YOLOv11-SimAM 0.903 0.656 0.9046 0.8180 YOLOv10s 0.8932 0.6577 0.9107 0.8043 YOLOv11 0.8859 0.6348 0.8960 0.8099 YOLOv8-SimAM 0.9106 0.6650 0.9146 0.8291 YOLOv8-BiFPN 0.9059 0.6648 0.9080 0.8232 ACYOLOv11 0.8860 0.63 0.8966 0.8020 iSimAM-YOLO 0.9224 0.6960 0.9122 0.8440 As shown in Table 2, compared with basic versions such as YOLOv8, YOLOv10s, and YOLOv11, the model presented in this paper shows significant improvements in mAP@0.5 and mAP@0.5:0.95, with precision increased by 2.00%, 1.84%, and 3.31%, respectively, and recall also improved. This leading performance indicates that the introduction of the iSimAM and NA-ASFF modules can effectively overcome the shortcomings of the original model in feature extraction under complex sea clutter backgrounds, significantly improving detection performance. Compared to the base model incorporating SimAM attention, this paper improves mAP@0.5 by 1.04%, mAP@0.5:0.95 by 2.67%, and precision by 1.45% compared to YOLOv8-SimAM. Compared to YOLOv11-SimAM, this paper improves mAP@0.5 by 1.42%, mAP@0.5:0.95 by 2.89%, and precision by 0.84%. This demonstrates that the proposed iSimAM can more accurately focus on key ship regions, and the addition of the NA-ASFF module further optimizes multi-scale feature fusion, surpassing the performance of a simple attention mechanism. YOLOv8-BiFPN enhances multi-scale representation by introducing a weighted bidirectional feature pyramid, but our model still outperforms it in all metrics. ACYOLOv11's improvement over YOLOv11 is limited, with all its metrics lower than our model, highlighting the targeted and efficient design of our proposed module. In actual SAR ship detection missions, false alarm rate is often more critical than recall rate. The model in this paper suppresses sea clutter false alarms with higher accuracy while maintaining near-optimal average accuracy, demonstrating a better balance between accuracy and recall rate, and is more suitable for application scenarios that are sensitive to false alarms. To more intuitively demonstrate the performance improvement brought about by the proposed method, the detection results of the model on the HRSID test set were visualized and analyzed. Figure 10 The results of each method are presented. from Figure 10 As can be seen, correct labels are marked with a yellow box, and false positives are marked with red. In port areas with densely packed ships and partial obstructions (such as...),... Figure 10 (As shown in the red box), the baseline YOLOv8-11 and other improved algorithms miss small targets at a distance. The algorithm in this paper exhibits the strongest anti-interference capability and can clearly detect target objects, indicating that the iSimAM attention mechanism can dynamically enhance the feature response of ship targets while effectively suppressing background noise, allowing the model to focus on key areas, thereby significantly reducing false alarms. Its predicted bounding boxes not only fit the true contour of the target more closely, but also can clearly distinguish adjacent or occluded individuals, significantly reducing false detections and missed detections. The embodiments or examples described above further illustrate the purpose, technical solutions, and advantages of the present invention. It should be understood that the embodiments or examples described above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made to the present invention within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A SAR ship image target detection method based on dual-path collaborative improved YOLOv11, characterized in that: First, ship images are acquired, and an iSimAM-YOLO model is constructed. The iSimAM-YOLO model includes: an iSimAM module, an NA-ASFFHead module, and a SpdBlock module. The iSimAM module includes an iRMB module, a SimAM module, and an iSimAM module. The inverted residual structure iRMB module is a hybrid network module that combines depthwise separable convolution and self-attention mechanisms. For the input image, the iRMB module first performs dimensionality expansion using a multilayer perceptron (MLP), with an output / input ratio of [value missing]. ; In the formula, For input feature values, For the output results, , , These represent the image's channels, height, and width, respectively. To extend the multilayer perceptron, a single The process involves convolution; then, efficient operators are applied to the output. MHSA is used as an efficient operator F to further enhance image features; Next, another multilayer perceptron is used for dimensionality reduction to obtain... After the above steps, the number of channels is reduced. The final residual connection output is obtained: The core of the SimAM module lies in its 3D attention weight generation mechanism, which directly assigns independent 3D weights to neurons in the channel, height, and width dimensions of the feature map. The iSimAM module dynamically fuses iRMB and SimAM through gating units, leveraging the local multi-scale extraction advantages of iRMB while introducing the global noise suppression capabilities of SimAM, achieving synergistic enhancement of local details and global context. The gating fusion mechanism is designed as follows: Let the input features be... GAP stands for Global Average Pooling, which aggregates global spatial information from the entire feature map. FC stands for Fully Connected Layer. Features are obtained via the iRMB path and the SimAM path, respectively. and The gating unit first performs global average pooling on the input, and then generates fusion weights for the two paths through a lightweight fully connected layer. : ;in For the Sigmoid function, satisfying The final output is: .
2. The SAR ship image target detection method based on dual-path collaborative improved YOLOv11 as described in claim 1, characterized in that: The NA-ASFFHead module adjusts multi-scale features to the same size through adaptive fusion of ASFF features, so that each layer of features can serve the target detection at the corresponding scale. Adaptive The fusion feature ASFF is made by multiplying features from different layers by weight parameters. , , And by adding them together, we get the following formula: In the formula, For new fusion features, For pixel coordinates, For the feature layer of ASFF, For the feature values of different ASFF layers, , , These are the weight parameters for different layers. The weight parameters are the weights of the feature map after... The result is obtained through convolution, then concatted, and then normalized to a value between 0 and 1 using softmax, while ensuring that the sum of the three weight parameters is 1.
3. The SAR ship image target detection method based on dual-path collaborative improved YOLOv11 according to claim 2, characterized in that: The weight parameter calculation method ; ; ; .
4. The SAR ship image target detection method based on dual-path collaborative improved YOLOv11 according to claim 3, characterized in that: Also includes: In ASFF, a noise estimation subnetwork is connected in parallel for each scale feature, and the output is a single-channel confidence map with the same spatial size as the feature map, used to represent the confidence of each location feature. This confidence map is then multiplied element-wise with the initial weight map to obtain the adjusted weights, thus dynamically gating the weights before fusion; the noise estimation subnetwork consists of two... The convolutional layer consists of layers where each convolutional kernel calculates a weighted sum of pixels in the neighborhood of that location as it slides across the feature map.
5. The SAR ship image target detection method based on dual-path collaborative improved YOLOv11 according to claim 1, characterized in that: The SpdBlock module consists of a spatial-to-depth SPD layer and a convolutional layer with a stride of 1, preserving fine-grained feature information during feature map downsampling; the SPD layer is used to downsample the feature map. The transformation process is as follows: assuming a given size... intermediate feature map The SPD layer divides the feature map into a series of sub-feature maps through a slicing operation, as shown in the following equation: ;in, This is the index of the subgraph.