Multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression
By constructing a dual heterogeneous backbone network that runs in parallel with CSPDarknet53 and ShuffleNetV2, and combining dynamic feature fusion and adaptive noise suppression, the problems of insufficient generalization ability and difficulty in detecting small targets in ship target detection are solved, achieving higher detection accuracy and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN INST OF ADVANCED TECH
- Filing Date
- 2026-04-16
- Publication Date
- 2026-06-23
Smart Images

Figure CN122049720B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image target detection technology, and in particular to a multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression. Background Technology
[0002] Ship target detection is one of the main methods for maritime safety assessment. With the widespread application of optical satellite imagery in maritime safety assessment, research on ship target detection algorithms has undergone a significant evolution from traditional methods to deep learning paradigms. Early methods mostly followed a sequential process of "sea-land separation - environmental effect suppression - candidate target extraction - false alarm filtering - type classification," with the core challenge being target separation and discrimination in complex sea conditions. Although traditional methods have advantages in speed, such as those based on threshold segmentation, visual saliency, or shape and texture features, they rely on manually designed features. When faced with real-world scenarios involving densely packed targets, varying scales, and adverse lighting and weather conditions, their generalization ability is limited, resulting in a high false alarm rate.
[0003] To overcome the aforementioned limitations, deep learning-based methods have become the mainstream of current research. The core advantage of these methods lies in their ability to automatically learn discriminative feature representations from massive amounts of data. Based on whether or not rotational candidate regions are generated, they can be divided into two categories: one is two-stage rotational object detection methods. These methods achieve high-precision orientation awareness through carefully designed rotational candidate region generation and feature alignment mechanisms, resulting in leading detection accuracy. However, due to their relatively complex multi-stage process, their detection speed is usually slower than single-stage methods. The other category includes end-to-end single-stage rotational object detection methods, anchor-free methods, and DETR-based methods. These methods have significant advantages in model efficiency and detection speed, and their performance is continuously optimized through various techniques (such as feature alignment modules, Gaussian distribution loss, and polar coordinate representation). However, their generalization and stability performance still vary across different scenarios and datasets, and DETR-based methods often face the challenge of slow training convergence. To address the unique challenges of ship target detection, some studies have developed targeted designs for ship targets in optical remote sensing images. For example, some studies have enhanced multi-scale ship feature extraction and high-level semantic feature representation by applying attention mechanisms, but this also increases model complexity and training costs. Other studies have improved detection performance by applying the aspect ratio and angle attributes of the ship target itself. However, this approach relies on accurate aspect ratio and angle information, which may be affected by annotation accuracy in practical applications. Still others have improved the bounding box regression accuracy by modifying the loss function, but the improvement in accuracy is limited. Summary of the Invention
[0004] Therefore, it is necessary to provide a multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression that can significantly improve the model's ability to distinguish and locate ship targets, especially small targets, in order to address the above-mentioned technical problems.
[0005] A multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression, the method comprising:
[0006] Acquire optical satellite remote sensing images; construct a multi-scale ship target detection model; the multi-scale ship target detection model includes a first backbone network, a second backbone network, a dynamic feature fusion module, an adaptive noise suppression module, a PAN neck network, and a multi-scale detection head;
[0007] Optical satellite remote sensing images are input in parallel into the first backbone network and the second backbone network to obtain multi-level, multi-scale first feature maps and second feature maps, respectively.
[0008] The feature maps of the same semantic level output by the first backbone network and the second backbone network are input into the dynamic feature fusion module. By calculating the feature differences between the two feature maps, difference-guided fusion weights are generated, and the two feature maps are adaptively fused based on the fusion weights to obtain multi-scale fused feature maps at the semantic level.
[0009] The fused feature map is input into the adaptive noise suppression module. Using the original image as a guide, the fused feature map is subjected to guided filtering. The filtering intensity is dynamically adjusted according to the local image content to obtain the purified feature map.
[0010] The purified feature maps from multiple semantic levels are input into the neck network and enhanced through a bottom-up path to generate a multi-scale feature pyramid.
[0011] The multi-scale feature pyramid is detected using a multi-scale detection head, and the rotated detection bounding box and category information of the ship target are output.
[0012] The aforementioned multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression constructs a dual heterogeneous backbone network structure that includes CSPDarknet53 and ShuffleNetV2 in parallel. CSPDarknet53 extracts semantically rich high-level features through dense convolutions and residual connections, providing strong semantic support for target discrimination in complex scenes. ShuffleNetV2 uses channel shuffling and grouped convolutions to preserve image details and low-level texture features with extremely low computational cost. Both methods capture information at different semantic levels simultaneously, enabling the model to perceive the global semantic associations of large-scale targets and accurately capture the edge and texture details of small-sized targets, thereby fundamentally enhancing the model's adaptability to multi-scale targets. Secondly, addressing the issue of significant differences in generalization performance of single-stage methods across different scenarios, this application designs a dynamic feature fusion module based on feature difference modeling. By calculating the feature differences between two backbone networks at the same spatial location, a difference attention map is generated. When the feature differences are large, more complementary information is retained; when the differences are small, global priors are fused to suppress redundancy. This difference-aware fusion strategy effectively solves the scale and semantic gap between heterogeneous features, avoids feature redundancy and information loss caused by simple splicing or addition, and significantly improves the representation ability of targets in complex backgrounds. An adaptive noise suppression module is proposed, using the original image as a structural guide to perform guided filtering on the fused features. A lightweight noise estimation network dynamically predicts the smoothing intensity of each pixel. Strong smoothing is applied in low-frequency regions covered by clouds to eliminate interference, moderate smoothing is applied in wave texture regions to retain potential targets, and details are preserved at the edges of ships for accurate positioning. This adaptive filtering method, which transfers the spatial structure prior of the original image to the feature space, can more accurately suppress complex background noise while retaining key target features compared to fixed feature enhancement or simple attention mechanisms in background techniques. This application constructs a multi-scale feature pyramid by enhancing the bottom-up information path through a PAN neck network, and combines it with an OBB detection head that supports rotating detection boxes, enabling the model to accurately detect ship targets with arbitrary orientations. In the feature extraction stage, this application achieves multi-scale feature complementarity through heterogeneous dual backbones; in the feature fusion stage, it achieves effective integration of heterogeneous features through a difference-aware strategy; and in the feature purification stage, it achieves precise suppression of background noise through adaptive guided filtering. These measures significantly improve the model's ability to discriminate and locate ship targets, especially small targets, and solve technical problems such as limited generalization ability, high false alarm rate, and difficulty in detecting small targets in background techniques. Attached Figure Description
[0013] Figure 1 This is a flowchart of a multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression in one embodiment;
[0014] Figure 2This is a schematic diagram of a dual heterogeneous backbone fusion network structure in one embodiment;
[0015] Figure 3 This is a schematic diagram of a basic unit of ShuffleNetV2 in one embodiment;
[0016] Figure 4 This is a schematic diagram of the ShuffleNetV2 downsampling module in another embodiment;
[0017] Figure 5 This is a structural diagram of the DFF module in one embodiment. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0019] In one embodiment, such as Figure 1 As shown, a multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression is provided, including the following steps:
[0020] Step 102: Acquire optical satellite remote sensing images; construct a multi-scale ship target detection model; the multi-scale ship target detection model includes a first backbone network, a second backbone network, a dynamic feature fusion module, an adaptive noise suppression module, a PAN neck network, and a multi-scale detection head.
[0021] The acquired optical satellite remote sensing images are used as input images to be detected. These images may contain complex background interference such as cloud cover, sea surface flares, and wave textures. The overall architecture of the constructed multi-scale ship target detection model is as follows: Figure 2 As shown, the first backbone network is a high-precision backbone network, using the CSPDarknet53 architecture; the second backbone network is a lightweight backbone network, built based on ShuffleNetV2; the PAN neck network (Path Aggregation Network) is used to enhance multi-scale feature representation; the multi-scale detection head uses the OBB (Oriented bounding box) module to support rotating detection box output.
[0022] Specifically, Backbone1 (high-precision backbone) adopts the CSPDarknet architecture, consisting of a CBS module (Conv+BN+SiLU), a C2f module (containing a residual structure with cross-layer connections), and SPPF (spatial pyramid pooling). This backbone, through dense convolutions and residual connections, can extract semantically rich and expressive high-level features, providing high-precision feature support for the detection of small targets and complex scenes. Backbone2 (lightweight backbone) is built on ShuffleNetV2, consisting of a Conv_bn module (Conv+BN), a MaxPool2d downsampling module, multiple stages (including ChannelShuffle and PointwiseGroupConvolution of ShuffleNetV2), and SPPF. This backbone, through channel shuffling and grouped convolutions, significantly reduces computation and parameter count while preserving the ability to capture image details and low-level textures, providing lightweight low-level and mid-level features for the network.
[0023] The specific structure of Backbone1 is as follows:
[0024] Level 0: Conv, [64,3,2];
[0025] Input: Original image;
[0026] Operation: Convolution kernel 3×3, stride 2, number of channels 64, to achieve the first downsampling;
[0027] Output: Feature map at resolution P1 / 2;
[0028] Layer 1: Conv, [128,3,2];
[0029] Operation: Further downsampling, expanding the number of channels to 128;
[0030] Output: P2 / 4 resolution feature map;
[0031] Layer 2: C2f module, repeated 3 times, 128 channels;
[0032] Operation: A cross-stage feature fusion structure is introduced, which incorporates residual connections and grouped convolutions to enhance feature reuse and gradient flow;
[0033] Output: Maintain P2 / 4 resolution, with further refinement of features;
[0034] Layer 3: Conv, [256,3,2];
[0035] Operation: Third downsampling, increasing the number of channels to 256;
[0036] Output: P3 / 8 resolution feature map;
[0037] Layer 4: C2f module, repeated 6 times, 256 channels;
[0038] Output: P3 / 8 deep features, suitable for medium target detection;
[0039] Level 5: Conv, [512,3,2];
[0040] Operation: Fourth downsampling, 512 channels;
[0041] Output: P4 / 16 resolution feature map;
[0042] Layer 6: C2f module, repeated 6 times, 512 channels;
[0043] Output: P4 / 16 deep features;
[0044] Layer 7: Conv, [1024,3,2];
[0045] Operation: Fifth downsampling, 1024 channels;
[0046] Output: P5 / 32 resolution feature map;
[0047] Layer 8: C2f module, repeated 3 times, 1024 channels;
[0048] Output: P5 / 32 deep semantic features, suitable for large target detection;
[0049] Layer 9: SPPF module, output channel 1024;
[0050] Operation: Spatial pyramid pooling enhances the global information representation capability of high-level features;
[0051] The specific structure of Backbone2 is as follows:
[0052] Layers 10-12: ShuffleNetV2 module;
[0053] Input: Original image;
[0054] Operation: Channel shuffle and grouped convolution are used to achieve efficient feature extraction and channel interaction;
[0055] Output: Three sets of multi-scale feature maps, with the following dimensions:
[0056] 1×192×32×32 (Shallow high-resolution features);
[0057] 1×384×16×16 (Middle layer features);
[0058] 1×768×8×8 (deep semantic features);
[0059] Function: Extracts high-resolution features with rich details, suitable for small target detection;
[0060] Layer 13: SPPF module;
[0061] Input: Feature map from the last layer of ShuffleNetV2 (768 channels, 8×8 resolution);
[0062] Operation: Spatial pyramid pooling is achieved through multiple parallel max pooling layers, fusing multi-scale contextual information;
[0063] Output: 1024-channel feature map, enhanced receptive field;
[0064] Function: To improve the global perception capability of features without significantly increasing the amount of computation.
[0065] Step 104: Input the optical satellite remote sensing images in parallel into the first backbone network and the second backbone network to obtain the first feature map and the second feature map with multiple levels and scales, respectively.
[0066] By using parallel input, two feature extraction paths simultaneously capture information from different semantic levels. The first backbone network (CSPDarknet53) consists of a CBS module (Conv+BN+SiLU), a C2f module (containing residual structures with cross-layer connections), and SPPF (spatial pyramid pooling), which can extract high-level features with rich semantic information and strong expressive power. The second backbone network (ShuffleNetV2) consists of a Conv_bn module (Conv+BN), a MaxPool2d downsampling module, and multiple stages (including ChannelShuffle and PointwiseGroupConvolution of ShuffleNetV2) and SPPF. Through channel shuffling and grouped convolution, it reduces the amount of computation while retaining the ability to capture image details and low-level textures.
[0067] These two heterogeneous backbone networks enrich the feature extraction paths, enabling comprehensive capture of information at different semantic levels simultaneously. This structure allows the model to understand image content from different scales and perspectives through two parallel feature extraction paths when processing the same image, greatly enhancing the model's ability to discriminate targets. It achieves more accurate judgments, whether dealing with subtle differences in detail or complex semantic relationships. Furthermore, feature fusion from the two paths allows the model to better adapt to targets of different scales, preventing detection bias due to targets being too large or too small. Moreover, during training and deployment, the shared input image leads to more stable parameter optimization, reducing the risk of unstable training states. ShuffleNetV2, as a high-efficiency, lightweight convolutional neural network for mobile and embedded devices, such as... Figure 3 As shown, it can achieve the highest possible detection accuracy with minimal computational resources and memory usage, facilitating subsequent embedded deployment applications of the network. Meanwhile, as... Figure 4 As shown, in units requiring spatial downsampling, the identity mapping branch is removed, the stride of the 3x3 depthwise convolution is set to 2, and a 3x3 average pooling with a stride of 2 is performed on the identity mapping branch before concatenation. This reduces resolution while preserving as much information as possible. Compared to simple pooling or large-span convolutional downsampling, this method, which combines pooling and depthwise convolutional downsampling, results in less information loss when reducing resolution, and is particularly beneficial for preserving subtle clues and small object information in the background in the early stages. By stacking these basic units and downsampling units, ShuffleNetV2 naturally generates feature maps with different spatial resolutions.
[0068] Step 106: Input the feature maps of the same semantic level output by the first backbone network and the second backbone network into the dynamic feature fusion module. By calculating the feature differences between the two feature maps, generate difference-guided fusion weights, and adaptively fuse the two feature maps based on the fusion weights to obtain multi-scale fused feature maps at the semantic level.
[0069] The structure of the Dynamic Feature Fusion (DFF) module is as follows: Figure 5 As shown, due to the heterogeneity in scale and semantics of the feature maps output by the two backbone networks, direct fusion may lead to information loss. The DFF module introduces a complementary differential attention mechanism, which generates a differential attention map by calculating the feature differences between the two backbone networks at the same spatial location, guiding the adaptive adjustment of the fusion weights. Specifically, when the feature differences are large, the module tends to retain more complementary information; when the feature differences are small, it fuses global priors to suppress redundancy, thereby effectively addressing the scale and semantic gap between heterogeneous features.
[0070] Specifically, the DFF fusion module based on feature difference modeling operates as follows:
[0071] Fusion Point 1 (Layer 14): Input: 192×32×32 features from Backbone2 + 128×32×32 features from Backbone1, Output Channel Configuration: [48,128,64], Function: Fusion of shallow high-resolution features to enhance detail preservation.
[0072] Fusion Point 2 (Layer 15): Input: 384×16×16 features from Backbone2 + 256×16×16 features from Backbone1; Output channel configuration: [96,256,128]; Function: Fusion of mid-layer features to balance semantics and details.
[0073] Fusion Point 3 (Layer 16): Input: 1024×8×8 features of Backbone2 + 1024×8×8 features of Backbone1; Output Channel Configuration: [256,512,256]; Function: Fusion of deep semantic features to enhance classification and localization capabilities.
[0074] For feature fusion of dual-backbone networks, a DFF module based on feature difference modeling is adopted. This module introduces a complementary difference attention mechanism, which generates a difference attention map by calculating the feature differences between the two backbone networks at the same spatial location, guiding the adaptive adjustment of fusion weights. When the feature differences are large, the module tends to retain more complementary information, thereby enhancing the discriminative representation in multi-scale features; when the feature differences are small, global priors are fused to suppress redundancy. This difference-aware fusion strategy effectively solves the scale and semantic gap between heterogeneous features, significantly improves the representation ability of targets in complex backgrounds, while maintaining lightweight cascading operations and controllable additional computational overhead, making it suitable for edge deployment.
[0075] Step 108: Input the fused feature map into the adaptive noise suppression module. Using the original image as a guide, perform guided filtering on the fused feature map and dynamically adjust the filtering intensity according to the local image content to obtain the purified feature map.
[0076] The Adaptive Noise Suppression (ANSM) module uses the original optical satellite imagery as a structure guide to perform guided filtering on the fused multi-scale feature map, and dynamically predicts the smoothing intensity for each pixel through a lightweight noise estimation network. Specifically, ANSM adjusts the filtering parameters based on local image content: strong smoothing is applied to low-frequency regions covered by clouds to eliminate interference, moderate smoothing is applied to wave texture regions to preserve potential targets, and details are preserved at ship edges for accurate localization. By transferring the spatial structure prior of the original image to the feature space, this module effectively cleanses features while maximizing the preservation of target edge and contour information.
[0077] Specifically, the adaptive noise suppression module operates as follows:
[0078] Layer 17: The feature map of fusion point 1 is input into the adaptive noise suppression module for noise filtering;
[0079] Layer 18: The feature map of fusion point 2 is input into the adaptive noise suppression module for noise filtering;
[0080] Layer 19: The feature map of fusion point 3 is input into the adaptive noise suppression module for noise filtering;
[0081] After fusion, the features are upsampled, stitched together, and refined using C2f to construct a three-scale detection head:
[0082] P3 / 8 Detection Head (25th Layer): Suitable for small target detection;
[0083] P4 / 16 Detection Head (28th Layer): Suitable for medium-sized target detection;
[0084] P5 / 32 Detection Head (Layer 31): Suitable for large target detection;
[0085] Final output (layer 32): OBB module, supports rotating detection box output, number of categories.
[0086] Step 110: Input the purified feature maps from multiple semantic levels into the neck network and generate a multi-scale feature pyramid through bottom-up path enhancement.
[0087] The neck network adopts a PAN (Path Aggregation Network) structure, which enhances the information flow of the bottom-up path to supplement the information flow of the top-down path, thereby constructing a richer multi-scale feature pyramid and providing feature maps containing information of different scales for the subsequent detection head.
[0088] Step 112: Detect the multi-scale feature pyramid using the multi-scale detection head, and output the rotation detection box result and category information of the ship target.
[0089] The detection head uses an OBB (Oriented Bounding Box) module, which supports rotating the detection box output and can accurately detect ship targets facing any orientation, with an unspecified number of categories. The final output detection result includes the target's category information and the position information of the rotated detection box.
[0090] The aforementioned multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression constructs a dual heterogeneous backbone network structure to extract multi-scale and multi-level features in parallel, enriching the feature extraction path. A dynamic feature fusion module enables adaptive fusion of the differential features between the two backbones, bridging the scale and semantic gaps between heterogeneous features. An adaptive noise suppression module utilizes the original image to guide filtering and dynamically adjusts the filtering intensity, effectively suppressing interference from complex backgrounds such as clouds and waves. This method significantly improves the model's ability to distinguish and locate ship targets, especially small targets.
[0091] In one embodiment, the first backbone network consists of a CBS module, a C2f module, and an SPPF module in sequence, outputting first feature maps at five scales: P1 / 2, P2 / 4, P3 / 8, P4 / 16, and P5 / 32; the second backbone network consists of a Conv_bn module, a max pooling downsampling module, a channel shuffling module, a grouped convolution module, and an SPPF module, outputting second feature maps at three scales: 32×32, 16×16, and 8×8.
[0092] Specifically, the structure of the first backbone network (CSPDarknet53) is as follows: Layer 0, Conv, [64,3,2], takes the original image as input, with a 3×3 kernel, a stride of 2, and 64 channels, performing the first downsampling and outputting a P1 / 2 resolution feature map; Layer 1, Conv, [128,3,2], further downsampling, expanding the number of channels to 128, and outputting a P2 / 4 resolution feature map; Layer 2, C2f module, repeated 3 times, with 128 channels, a cross-stage feature fusion structure, introducing residual connections and grouped convolutions, and outputting a P2 / 4 resolution feature map; Layer 3, Conv, [256,3,2], performs the third downsampling, increasing the number of channels to 256, and outputting a P3 / 8 resolution feature map. Feature map; Layer 4 C2f module repeated 6 times, 256 channels, output P3 / 8 deep features; Layer 5 Conv, [512,3,2], fourth downsampling, 512 channels, output P4 / 16 resolution feature map; Layer 6 C2f module repeated 6 times, 512 channels, output P4 / 16 deep features; Layer 7 Conv, [1024,3,2], fifth downsampling, 1024 channels, output P5 / 32 resolution feature map; Layer 8 C2f module repeated 3 times, 1024 channels, output P5 / 32 deep semantic features; Layer 9 SPPF module, output 1024 channels, spatial pyramid pooling enhances the global information expression capability of high-level features. The specific structure of the second backbone network (ShuffleNetV2) is as follows: Layers 10-12 of the ShuffleNetV2 module employ channel shuffling and grouped convolutions to output three sets of multi-scale feature maps with sizes of 1×192×32×32 (shallow high-resolution features), 1×384×16×16 (mid-level features), and 1×768×8×8 (deep semantic features); Layer 13, the SPPF module, takes the feature map (768 channels, 8×8 resolution) from the last layer of ShuffleNetV2 as input and performs spatial pyramid pooling through multiple parallel max-pooling layers, outputting a 1024-channel feature map to enhance the receptive field. Through this heterogeneous dual-backbone structure, CSPDarknet53 provides rich high-level semantic features, while ShuffleNetV2 preserves details and texture features with lower computational cost. The two complement each other, enhancing the model's ability to discriminate multi-scale objects and facilitating subsequent embedded deployment applications.
[0093] In one embodiment, feature maps of the same semantic level output by the first backbone network and the second backbone network are input to the dynamic feature fusion module. By calculating the feature differences between the two feature maps, difference-guided fusion weights are generated, including:
[0094] The feature differences between the first backbone network and the second backbone network at the same spatial location are calculated to obtain the feature difference metric matrix; the feature difference metric matrix is mapped to a difference attention map; the feature maps of the first backbone network and the second backbone network are concatenated along the channel dimension and then generated by convolution to generate initial fusion weights; the difference attention map is combined with the initial fusion weights to obtain difference-guided fusion weights.
[0095] Specifically, the structure of the Dynamic Feature Fusion (DFF) module in this embodiment is as follows: Figure 5 As shown, its core idea is to adaptively fuse multi-scale local features based on feature differences and global information. First, the degree of difference between the two backbone networks at the same location is quantified using a feature difference metric. Then, a lightweight convolutional network maps the difference metric to a difference attention map. Simultaneously, the two feature maps are concatenated along channels and then convolved to generate initial fusion weights. Finally, the difference attention map is combined with the initial fusion weights to obtain difference-guided fusion weights. This difference-aware fusion strategy effectively addresses the scale and semantic gap between heterogeneous features, significantly improving the representation ability of targets in complex backgrounds, while maintaining lightweight cascaded operations and controllable additional computational overhead.
[0096] In one embodiment, the feature differences between the first backbone network and the second backbone network at the same spatial location are calculated to obtain a feature difference metric matrix, including:
[0097] The feature differences between the first and second backbone networks at the same spatial location are calculated, resulting in the feature difference metric matrix:
[0098] ;
[0099] in, and These are the feature vectors of the first and second backbone networks at their respective positions.
[0100] Specifically, the spatial location of the two backbone networks is calculated. The difference between the feature vectors, The larger the value, the more significant the feature difference between the two backbone networks at that location, and the stronger their complementarity. This metric matrix can quantify the degree of difference between heterogeneous features, providing a basis for subsequent dynamic fusion.
[0101] In one embodiment, mapping the feature dissimilarity metric matrix to a dissimilarity attention map includes:
[0102] The feature difference metric matrix is mapped to a difference attention map as follows:
[0103] ;
[0104] in, This is the feature difference measure matrix. This indicates the first convolution operation. This indicates the second convolution operation. express Sigmoid Activation function.
[0105] Specifically, the feature difference metric matrix is mapped to a difference attention map with values ranging from (0,1) using two convolutional layers and a sigmoid activation function. The values of the difference attention map represent the degree of feature difference between the two backbone networks at that location, and are subsequently used to guide the allocation of fusion weights.
[0106] In one embodiment, the feature maps of the first backbone network and the second backbone network are concatenated along the channel dimension and then convolved to generate initial fusion weights; the difference attention map is combined with the initial fusion weights to obtain difference-guided fusion weights, including:
[0107] The feature maps of the first and second backbone networks are concatenated along the channel dimension and then convolved to generate the initial fusion weights:
[0108] ;
[0109] in, , These are the feature maps of the first and second backbone networks, respectively. This indicates splicing along the channel dimension. This indicates the first convolution operation. express Sigmoid Activation function.
[0110] Combining the difference attention map with the initial fusion weights, the difference-guided fusion weights are obtained as follows:
[0111] ;
[0112] in, for The average value in the spatial dimension, For difference attention maps, Indicates the spatial location of the initial fusion weights. The value at that location.
[0113] Final feature map and After modeling the feature differences described above, the new features are obtained as follows:
[0114] ;
[0115] ;
[0116] In this way, regions with large differences will rely more on dynamic weights, while regions with small differences will tend to use global average weights, thus achieving adaptive fusion that is difference-aware.
[0117] The new features, after modeling the aforementioned differences, are concatenated along the channel dimension to form the new features. :
[0118] ;
[0119] To ensure that subsequent modules can utilize the fused features, the number of channels needs to be restored to the original number C through a channel reduction mechanism. Channel reduction in DFF is not a simple matter of using 1×1×1 convolutions, but rather utilizes global channel information. Guidance. This information, extracted through cascaded average pooling (AVGPool), convolutional layers (Conv1), and the sigmoid activation function, is used to describe feature importance:
[0120] ;
[0121] After the fused features are calibrated using global channel information, a 1×1×1 convolutional layer (Conv1) selects feature maps based on importance. This channel information guides the convolutional layer to retain important features. At the same time, features with low information content are discarded:
[0122] ;
[0123] To model the spatial dependencies between local feature maps, from the feature maps and Global spatial information is extracted through a 1×1×1 convolutional layer (Conv1) and a sigmoid activation function. Used to calibrate feature maps and enhance the weights of salient spatial regions:
[0124] ;
[0125] .
[0126] In one embodiment, the fused feature map is input to an adaptive noise suppression module. Using the original image as a guide, the fused feature map undergoes guided filtering, and the filtering intensity is dynamically adjusted based on local image content to obtain a purified feature map, including:
[0127] The original image is adjusted to the same resolution as the fused feature map and mapped to a single-channel guide map through convolution. The regularization parameter for each spatial location is predicted by a noise estimation network.
[0128] The cleaned feature map is obtained by independently performing guided filtering on each channel of the fused feature map using a single-channel guided map and regularization parameters, and introducing residual connections.
[0129] Specifically, the adaptive noise suppression module uses the original image as a guide to perform guided filtering on the fused multi-scale feature map, and dynamically adjusts the filtering intensity through a noise estimation network to achieve adaptive feature purification against cloud and wave interference.
[0130] Guided filtering requires output image q Within a local window and the guide image G Satisfy a linear relationship, thereby... G Structural information is passed to q Simultaneously smooth input p Noise in pixels. Mathematically, for pixels... k local window centered (radius r ), Assumption:
[0131] ;
[0132] coefficient By minimizing the reconstruction error within the window, we obtain:
[0133] ;
[0134] in Here, is the regularization parameter, controlling the smoothness level. Solving for it, we get:
[0135] ;
[0136] in Let G be the mean and variance within the window. for p The mean of the values is used to calculate the final output:
[0137] ;
[0138] in for The average value across all windows containing pixel i.
[0139] To adapt to varying noise intensity in different areas, for example, areas covered by clouds require strong smoothing (large... ), while details need to be preserved in the edge areas of the ship (small The above global parameters An adaptive value is needed to achieve adaptive guided filtering. This means predicting an adaptive value for each pixel using a lightweight noise estimation network. The value allows for local adjustment of the filter strength.
[0140] The Adaptive Noise Suppression (ANSM) module in this embodiment uses the original image as a guide to perform guided filtering on the fused multi-scale feature map, and dynamically adjusts the filtering intensity through a noise estimation network to achieve adaptive feature purification against cloud and wave interference. The core advantage of this module is that by transferring the spatial structure prior of the original image to the feature space, it can effectively purify features while preserving the edge and contour information of the target to the maximum extent.
[0141] In one embodiment, the original image is resized to the same resolution as the fused feature map and mapped to a single-channel guide map via convolution. A noise estimation network predicts regularization parameters for each spatial location, including:
[0142] The original image is adjusted to the same resolution as the fused feature map, and then mapped to a single-channel guide map through convolution:
[0143] ;
[0144] in, The original image after resolution adjustment. This is a 1×1 convolution operation;
[0145] To predict the local noise intensity at each spatial location, a noise estimation network is constructed. The input is the guiding graph. G and feature map F The concatenation of channel aggregation (average pooling along the channel dimension) is performed after two 3×3 convolutional layers and After activation, the regularization parameter for each spatial location is predicted by the noise estimation network as follows:
[0146] ;
[0147] in, The average along the channel dimension, Used to guarantee , This is a 3×3 convolution operation. is the activation function. This network can infer the required smoothing intensity for a region based on local image content and feature responses.
[0148] Specifically, the guide map The original image, after resolution adjustment, is mapped to a single channel using 1×1 convolutions, enabling the network to adaptively learn the structural information in the original image most useful for filtering, such as brightness and gradient. Regularization parameters are predicted by a noise estimation network whose input is the guiding image. and feature map channel average The splicing, after two layers of 3×3 convolution and Activation, finally through The function outputs a positive value to ensure the regularization parameter is greater than 0. This network can infer the required smoothing intensity for a region based on local image content and feature responses: it outputs a larger smoothing value in low-frequency regions covered by clouds. Apply enhanced smoothing, resulting in a smaller output at the ship's edge areas. Retain details.
[0149] In one embodiment, guided filtering is performed independently on each channel of the fused feature map, including:
[0150] For each channel of the fused feature map, calculate the local mean, variance, and covariance; calculate the linear coefficients based on the regularization parameter, perform average pooling on the linear coefficients, calculate the filtered features based on the obtained smoothed linear coefficients, and introduce a learnable scalar for residual connection.
[0151] Specifically, for feature maps F Each channel c Independent guided filtering is performed to preserve cross-channel diversity. Linear coefficients are obtained by calculating local statistics, then average pooling is used to smooth the coefficients, and finally the filtered features are calculated. To prevent over-smoothing from causing information loss, residual connections are introduced, allowing the network to autonomously weigh the contributions of the original features and the filtered features.
[0152] In one embodiment, local mean, variance, and covariance are calculated for each channel of the fused feature map; linear coefficients are calculated based on regularization parameters, and average pooling is performed on the linear coefficients; filtered features are calculated based on the obtained smoothed linear coefficients; and a learnable scalar is introduced for residual connection, including:
[0153] For each channel of the fused feature map, the local mean, variance, and covariance are calculated as follows:
[0154] ;
[0155] ;
[0156] ;
[0157] in, Represents a guide diagram The mean within a local window, Representing feature map channels The mean within a local window, Represents a guide diagram Variance within a local window, Represents a guide diagram With feature map channels Covariance within a local window Represents radius r Average pooling, This represents a single-channel guide diagram. This represents the current channel of the fused feature map;
[0158] The linear coefficients are calculated based on the regularization parameters as follows:
[0159] ;
[0160] in, Represents the regularization parameter;
[0161] The linear coefficients are then subjected to average pooling to obtain the smoothed coefficients:
[0162] ;
[0163] The filtered features are calculated based on the obtained smoothed linear coefficients:
[0164] ;
[0165] Repeat the above operation for all channels to obtain the filtered feature map. ;
[0166] The purified feature map obtained by introducing residual connections is as follows:
[0167] ;
[0168] in, For learnable scalars, This is the fused feature map.
[0169] This module effectively suppresses background noise such as clouds and waves while preserving the edge and contour information of ship targets to the maximum extent. Simulation experiments have verified that the false positives and false negatives of this application are significantly reduced on targets of different scales. This application can better adapt to changes in target scale and has a stronger ability to distinguish targets. In addition, this application can still accurately detect different types of targets under interfering background conditions.
[0170] In summary, this application can accurately detect different types of targets even in complex backgrounds, and performs well in situations where the water surface reflects fish scale light and the imaging conditions are cloudy or foggy.
[0171] It should be understood that, although Figure 1The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0172] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0173] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these modifications and improvements all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A multi-scale ship target detection method based on heterogeneous backbone fusion and noise suppression, characterized in that, The method includes: Acquire optical satellite remote sensing images; construct a multi-scale ship target detection model; the multi-scale ship target detection model includes a first backbone network, a second backbone network, a dynamic feature fusion module, an adaptive noise suppression module, a PAN neck network, and a multi-scale detection head; Optical satellite remote sensing images are input in parallel into the first backbone network and the second backbone network to obtain multi-level, multi-scale first feature maps and second feature maps, respectively. The feature maps of the same semantic level output by the first backbone network and the second backbone network are input into the dynamic feature fusion module. By calculating the feature differences between the two feature maps, difference-guided fusion weights are generated, and the two feature maps are adaptively fused based on the fusion weights to obtain the multi-scale fused feature map of the semantic level. The fused feature map is input into the adaptive noise suppression module. Using the original image as a guide, the fused feature map is subjected to guided filtering. The filtering intensity is dynamically adjusted according to the local image content to obtain the purified feature map. The purified feature maps from multiple semantic levels are input into the neck network and enhanced through a bottom-up path to generate a multi-scale feature pyramid. The multi-scale feature pyramid is detected by the multi-scale detection head, and the rotation detection box result and category information of the ship target are output. The feature maps of the same semantic level output by the first backbone network and the second backbone network are input into the dynamic feature fusion module. By calculating the feature differences between the two feature maps, difference-guided fusion weights are generated, including: Calculate the feature differences between the first backbone network and the second backbone network at the same spatial location to obtain a feature difference metric matrix; map the feature difference metric matrix into a difference attention map; concatenate the feature maps of the first backbone network and the second backbone network along the channel dimension and generate initial fusion weights through convolution; combine the difference attention map with the initial fusion weights to obtain difference-guided fusion weights; Calculate the feature differences between the first backbone network and the second backbone network at the same spatial location to obtain a feature difference measurement matrix, including: Calculate the feature differences between the first backbone network and the second backbone network at the same spatial location to obtain the feature difference metric matrix. in, and These are the feature vectors of the first backbone network and the second backbone network at corresponding positions, respectively; Mapping the feature dissimilarity metric matrix to a dissimilarity attention map includes: The feature difference metric matrix is mapped to a difference attention map. in, The feature difference metric matrix is... This indicates the first convolution operation. This indicates the second convolution operation. express Sigmoid Activation function; The feature maps of the first backbone network and the second backbone network are concatenated along the channel dimension and then convolved to generate initial fusion weights; the difference attention map is combined with the initial fusion weights to obtain difference-guided fusion weights, including: The feature maps of the first backbone network and the second backbone network are concatenated along the channel dimension and then generated through convolution to produce initial fusion weights. in, , These are feature maps of the first backbone network and the second backbone network, respectively. This indicates splicing along the channel dimension. This indicates the first convolution operation. express Sigmoid Activation function; Combining the difference attention map with the initial fusion weights yields the difference-guided fusion weights. in, for The average value in the spatial dimension, For difference attention maps, Indicates the spatial location of the initial fusion weights. The value at that location.
2. The method according to claim 1, characterized in that, The first backbone network consists of a CBS module, a C2f module, and an SPPF module, which output first feature maps at five scales: P1 / 2, P2 / 4, P3 / 8, P4 / 16, and P5 / 32. The second backbone network consists of a Conv_bn module, a max pooling downsampling module, a channel shuffling module, a grouped convolution module, and an SPPF module, which output second feature maps at three scales: 32×32, 16×16, and 8×8.
3. The method according to claim 1, characterized in that, The fused feature map is input into the adaptive noise suppression module. Using the original image as a guide, guided filtering is applied to the fused feature map, and the filtering intensity is dynamically adjusted based on local image content to obtain a purified feature map, including: The original image is adjusted to the same resolution as the fused feature map, and then mapped to a single-channel guide map through convolution. The regularization parameter for each spatial location is predicted by a noise estimation network. The purified feature map is obtained by independently performing guided filtering on each channel of the fused feature map using the single-channel guide map and the regularization parameter, and introducing residual connections.
4. The method according to claim 3, characterized in that, The original image is adjusted to the same resolution as the fused feature map, and then mapped to a single-channel guide map via convolution. A noise estimation network is used to predict the regularization parameters for each spatial location, including: The original image is adjusted to the same resolution as the fused feature map, and then mapped to a single-channel guide map through convolution. in, The original image after resolution adjustment. This is a 1×1 convolution operation; The regularization parameter predicted for each spatial location using the noise estimation network is: in, The average along the channel dimension, Used to guarantee , This is a 3×3 convolution operation. This is the activation function.
5. The method according to claim 3, characterized in that, Each channel of the fused feature map is independently subjected to guided filtering, including: For each channel of the fused feature map, calculate the local mean, variance, and covariance; calculate the linear coefficients according to the regularization parameter, perform average pooling on the linear coefficients, calculate the filtered features based on the obtained smoothed linear coefficients, and introduce a learnable scalar for residual connection.
6. The method according to claim 5, characterized in that, For each channel of the fused feature map, calculate the local mean, variance, and covariance; calculate linear coefficients based on the regularization parameter, perform average pooling on the linear coefficients, calculate the filtered features based on the obtained smoothed linear coefficients, and introduce a learnable scalar for residual connection, including: For each channel of the fused feature map, the local mean, variance, and covariance are calculated as follows: in, Represents a guide diagram The mean within a local window, Representing feature map channels The mean within a local window, Represents a guide diagram Variance within a local window, Represents a guide diagram With feature map channels Covariance within a local window Represents radius r Average pooling, This represents a single-channel guide diagram. This represents the current channel of the fused feature map; The linear coefficients are calculated based on the regularization parameters as follows: in, Represents the regularization parameter; The linear coefficients are then subjected to average pooling to obtain smoothed coefficients. ; The filtered features are calculated based on the obtained smoothed linear coefficients. ; Repeat the filtering operation for all channels to obtain the filtered feature map. ; The purified feature map is obtained by introducing residual connections. in, For learnable scalars, This is the fused feature map.