A deep learning-based instance segmentation method for multi-scale occluded ships at sea
By employing multi-branch adaptive feature fusion and occlusion region decoupling techniques based on the MAF-ORDNet model, the accuracy and robustness issues of ship instance segmentation in complex maritime scenarios are addressed, achieving efficient multi-scale ship instance segmentation suitable for real-time monitoring of edge computing devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGSU UNIV OF SCI & TECH
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to achieve high-precision and robust ship instance segmentation in complex maritime scenarios, especially under multi-scale occlusion conditions, where model detection accuracy is low, occlusion region boundaries are blurred, instance separation is difficult, and mask integrity and continuity are insufficient.
The MAF-ORDNet model based on deep learning is adopted, including a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatial adaptive detection head. Through the collaborative design of stepwise pooling and multi-branch dilated convolution, dynamic weighting and global channel attention mechanisms are introduced. Combined with the occlusion region decoupling and edge enhancement modules, a complete instance mask is generated.
It significantly improves the accuracy of multi-scale ship segmentation, while maintaining lightweight and high efficiency. The model has only 12.4M parameters and an inference speed of 110.2FPS, enabling real-time ship monitoring on edge computing devices.
Smart Images

Figure CN122289691A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of ship image segmentation technology, and in particular to a method for segmenting occluded ship instances at sea based on multi-branch adaptive feature fusion and decoupling of occlusion regions. Background Technology
[0002] With increasingly busy maritime traffic, ship monitoring and management are demanding higher levels of intelligence. Traditional monitoring relies on manual operation, which makes it difficult to achieve real-time and accurate detection, and prolonged operation can easily lead to fatigue and missed detections. To improve monitoring efficiency, intelligent systems based on instance segmentation have gradually become a research hotspot. This type of method can not only identify ship outlines but also distinguish adjacent or overlapping vessels, providing more refined perception information for port management and maritime traffic scheduling.
[0003] Currently, research on ship instance segmentation has accumulated to some extent. Chinese patent CN114219989A proposes a ship instance segmentation method for foggy scenes based on interference suppression and dynamic contours. It improves segmentation accuracy under foggy conditions by constructing a dedicated network, effectively reducing the false negative rate. However, its design is mainly geared towards a single weather scenario and does not consider the impact of mutual occlusion and scale changes among ships. Chinese patent CN115797626A introduces the global and local attention mechanism GALA into the SOLOv2 network, utilizing 1D strip pooling and 2D global pooling to preserve the global position and semantic information of ships, achieving improved accuracy on general datasets. However, this method has limited ability to model edge details in densely occluded scenes and struggles to address the problem of blurred boundaries between instances. Chinese patent CN117218377A proposes the MSDYOLOv4 real-time ship detection network, improving detection speed and accuracy through depthwise separable convolution and channel attention. However, this method focuses on real-time detection and does not optimize segmentation for occluded scenes, making it difficult to output detailed instance masks. Chinese patent CN118262299A enhances the detection capability of small ships by improving the neck network and loss function, but its focus is on the localization of small targets and it lacks effective treatment for the problems of target incompleteness and edge breakage caused by occlusion.
[0004] In summary, existing technologies are mainly designed for routine scenarios or specific weather conditions. In complex maritime scenarios with dense ship traffic, significant scale differences, and intertwined occlusion and interference, they still have the following shortcomings: First, their ability to fuse multi-scale features is limited, making it difficult to simultaneously consider the details of small targets and the structure of large targets; second, they lack refined modeling of occlusion region boundaries, leading to difficulties in instance separation and blurred segmentation boundaries; and third, under conditions of adverse weather and coupled occlusion, the integrity and continuity of the mask are insufficient. Therefore, there is an urgent need for a method that can achieve high-precision and robust ship instance segmentation in complex maritime scenarios. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a deep learning-based method for instance segmentation of multi-scale occluded ships at sea, thereby solving the technical problem of low model detection accuracy caused by occlusion due to the complex environment in which ships are located.
[0006] This invention provides a deep learning-based method for segmenting instances of multi-scale occluded ships at sea, comprising the following steps:
[0007] Step 1: Obtain a dataset of multi-scale ship occlusion;
[0008] Step 2: Preprocess the images in the dataset and divide them into training and test sets proportionally;
[0009] Step 3: Construct the MAF-ORDNet model, which includes: a backbone network, a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatial adaptive detection head;
[0010] Specifically, the SPPF module was replaced by a multi-branch adaptive feature fusion module; an occlusion region decoupling module was added to the neck network; and instance masks were generated using an RFAConv-based spatial adaptive detection head and mask branches.
[0011] Step 4: Train the MAF-ORDNet model based on the training set to obtain the ship instance segmentation model;
[0012] Step 5: Input the test set into the ship instance segmentation model and output the instance segmentation results of the ship target under occlusion conditions.
[0013] Furthermore, in step 3, the multi-branch adaptive feature fusion module gradually expands the receptive field through progressive pooling branches. Specifically, it uses progressive pooling and multi-branch dilated convolution to collaboratively handle the multi-scale feature fusion problem; it introduces a dynamic weighting mechanism to adaptively adjust the fusion process, restore the semantics of the occluded ships, and enhance the model's ability to perceive multi-scale features; and it uses a global channel attention mechanism to suppress background noise interference and improve the robustness of the model under complex occlusion.
[0014] Furthermore, the specific processing procedure of the multi-branch adaptive feature fusion module includes the following steps:
[0015] Step A1: By setting different dilation rates for each dilated convolution, each branch obtains an effective receptive field of 5×5, 7×7, and 9×9 respectively. Based on the effective receptive fields, feature maps of ships at multiple scales are extracted. The specific formula is as follows:
[0016] F i =C 1×1 (cat(DCr (N i )) r=2,3,4 ), i=1,2,3;
[0017] In the formula, F i This represents the feature map generated by concatenating three-branch dilated convolutions; C 1×1 (·) represents a 1×1 convolution operation; DC r (·) represents dilation convolution operation, r represents the dilation rate; N i This represents the feature map after the input feature F has undergone i rounds of max pooling and dimensionality reduction; cat(·) indicates that a concatenation operation is performed.
[0018] Step A2: Set learnable parameters w1, w2, and w3 to adaptively integrate features at different scales. The specific formula is as follows:
[0019] F c1 =cat(w1*F1,w2*F2,w3*F3,MP (3) (F));
[0020] In the formula, F c1 This represents the feature map after concatenating four branches; cat(·) indicates performing the concatenation operation; MP (3) (·) represents the feature map after three max pooling operations; F represents the feature map of the input module; F1, F2, and F3 are the feature maps generated by concatenating three-branch dilated convolutions;
[0021] Step A3: Suppress background noise interference through a global channel attention mechanism. The specific formula for feature map output is as follows:
[0022] F * =cat(σ(PW(PC(Avg(F))))⊙F,F c1 );
[0023] In the formula, F * The input module's feature map is represented by: Avg(·); PC(·); PW(·); σ(·); F; ... c1 This represents the feature map after concatenating the four branches; cat(·) indicates that the concatenation operation is performed.
[0024] Furthermore, in step 3, the occlusion area decoupling module includes:
[0025] The occlusion region localization unit is used to perform cross-normalization of the feature map in the spatial dimension and adaptively enhance the feature difference between the occluded and unoccluded regions through learnable parameters. The occlusion region localization unit independently calculates statistics for each channel of the feature map at different locations.
[0026] The edge enhancement unit introduces learnable offsets to the sampling points through deformable convolution, thereby enhancing the feature response of the ship's edges.
[0027] Furthermore, in the occlusion area localization unit, the specific method for enhancing the feature difference between the occluded area and the unoccluded area is as follows: the position features are adjusted by scaling factors and offsets, and learnable parameters are introduced into the residual connection to adaptively fuse the original features and the enhanced features, thereby increasing the feature difference between the occluded area and the unoccluded area.
[0028] Furthermore, the processing formula for the occlusion area positioning unit is as follows:
[0029] X1 = W2 * (ReLU(W1 * X));
[0030] Y = X + α*XNorm(X,X1,X1);
[0031] In the formula, X is the input feature of the occlusion region localization unit, with a size of 2C×H×W; W1 is the first 1×1 convolution kernel, with a size of 2C / 4×2C×1×1; ReLU(·) represents the ReLU activation function; W2 is the second 1×1 convolution kernel, with a size of 2C×2C / 4×1×1; X1 is the feature map output after 1×1 convolution, ReLU layer, and 1×1 convolution, with a size of 2C×H×W; XNorm(·) represents cross-normalization; Y represents the feature map after passing through the ship occlusion region localization module; α is a learnable parameter; H is the length of the feature map; W is the width of the feature map; and C is the number of channels of the feature map.
[0032] Furthermore, the specific processing procedure of the edge enhancement unit is as follows:
[0033] A learnable offset is introduced for each sampling point, enabling the convolution kernel to adaptively adjust the sampling position according to the actual contour of the ship. Within the sampling area, the response intensity of edge features is enhanced by weighted summation of input features and convolution weights, so that the ship can maintain clear feature representation even under occlusion conditions.
[0034] Furthermore, the processing formula for the edge enhancement unit is as follows:
[0035] Y1=DCN(σ(C 3×3 (σ(C 1×1 (x)))))+x;
[0036] In the formula, x represents the input feature map; C 1×1 (·) is a 1×1 convolution kernel; C 3×3 (·) is a 3×3 convolution kernel; σ(·) represents the ReLU activation function; DCN(·) represents a deformable convolution kernel; Y1 is the output feature map.
[0037] Furthermore, in step 3, the spatial adaptive detection head includes:
[0038] The RFAConv module is used to output the mask coefficients that characterize the morphological restoration.
[0039] The ProtoNet module is used to generate a denoised prototype mask.
[0040] The mask synthesis unit is used to linearly combine mask coefficients with the prototype mask to repair the texture features of ship fractures and output a complete instance mask.
[0041] Furthermore, the RFAConv module contains two layers, each of which includes an RFAConv layer, a batch normalization layer, and an activation layer connected in series, and then input to a two-dimensional convolutional layer for integration;
[0042] The ProtoNet module is connected in a series of four depthwise separable convolutional layers, which convolve each channel of the input feature map independently, avoiding interference between background noise in the channel dimension.
[0043] The beneficial effects of this invention are:
[0044] This invention employs a multi-branch adaptive feature fusion module, which uses a collaborative design of progressive pooling and multi-branch dilated convolution. It also introduces dynamic weighting and global channel attention mechanisms to adaptively fuse multi-scale features. This not only enhances the detailed features of small targets but also ensures the semantic integrity of the global structure of large targets, thus significantly improving the segmentation accuracy of ships at multiple scales.
[0045] This invention uses an occlusion region decoupling module to inject shallow texture and contour information into deep semantic features, and combines deformable convolution in the edge enhancement module to shift sampling points toward the contour direction, effectively solving the problem of blurred boundaries and difficulty in separating mutually occluded ship instances.
[0046] This invention utilizes a spatially adaptive detection head and the spatial feature transformation mechanism of the RFAConv module to generate mask coefficients with spatial representation capabilities. Combined with depthwise separable convolutional ProtoNet, it generates a denoised prototype mask, effectively alleviating the problem of incomplete segmentation caused by texture damage and edge breakage of ships under dense occlusion and severe weather conditions, and generating a more complete and continuous ship instance mask.
[0047] This invention achieves improved accuracy while maintaining lightweight design and high efficiency. The model has only 12.4M parameters and an inference speed of 110.2 FPS. Compared with mainstream methods such as Mask R-CNN and Cascade R-CNN, it achieves higher segmentation accuracy while significantly reducing computational complexity. It can be deployed on edge computing devices to meet the application needs of real-time monitoring of ships at sea. Attached Figure Description
[0048] The features and advantages of the invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the drawings:
[0049] Figure 1 This is a flowchart illustrating a specific embodiment of the present invention;
[0050] Figure 2 The occlusion simulation method in this specific embodiment generates a ship occlusion image;
[0051] Figure 3 This is an image under severe weather conditions processed by a generative adversarial network in a specific embodiment of the present invention;
[0052] Figure 4 This is a diagram illustrating the overall framework of the algorithm in a specific example of the present invention.
[0053] Figure 5 This is a structural diagram of the multi-branch adaptive feature fusion module in a specific embodiment of the present invention;
[0054] Figure 6 This is a diagram of the dilated convolution branch structure in a specific embodiment of the present invention;
[0055] Figure 7 This is a network structure diagram of the occlusion area decoupling module in a specific embodiment of the present invention;
[0056] Figure 8 This is a structural diagram of the occlusion area positioning module in a specific embodiment of the present invention;
[0057] Figure 9 This is a structural diagram of the edge enhancement module in a specific embodiment of the present invention;
[0058] Figure 10 This is a spatial adaptive detection head in a specific embodiment of the present invention;
[0059] Figure 11 This is a diagram of the improved ProtoNet structure in a specific embodiment of the present invention;
[0060] Figure 12 This is a comparative experiment between MAF-ORDNet and mainstream segmentation models on occluded images in a specific example of the present invention. Detailed Implementation
[0061] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0062] The present invention will be further illustrated below with reference to specific embodiments. Those skilled in the art should understand that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Modifications to the present invention in various equivalent forms all fall within the scope defined by the appended claims.
[0063] like Figure 1 As shown in the figure, a specific embodiment of the present invention provides a deep learning-based method for segmenting multiple target ship instances at sea; including the following steps:
[0064] Step 1: Obtain a dataset of multi-scale ship occlusion;
[0065] Since real data on mutual occlusion of ships in occlusion scenarios is relatively scarce, in order to expand the training samples and improve the generalization ability of the model, this invention adopts an occlusion simulation method based on Mixup to generate ship occlusion images:
[0066] ;
[0067] ;
[0068] In the formula, the region of interest (ROI) of the target ship is cropped from the source image. Its pixel range in the horizontal and vertical directions is defined by x1, x2 and y1, y2, respectively. The bounding box coordinates of this region in the original annotation are (x1, x2, y1, y2). c ,y c The region is defined as follows: (1) with width w and height h. A scaling factor α ∈ (0,1] is then introduced to scale the region, simulating occlusion targets of different sizes. The transformed ROI is pasted into other background images, thus constructing a synthetic sample that closely resembles the real occlusion relationship. During this process, the new bounding box information corresponding to the pasting location is recorded, and a corresponding annotation file is generated accordingly. The final synthesized ship occlusion image is shown below. Figure 2 As shown, the first and second columns are the original images, and the third column is the generated occluded image.
[0069] To improve the model's generalization ability on low-quality images and alleviate overfitting, a Generative Adversarial Network (GAN) is introduced to simulate severe weather conditions such as fog, rain, and snow, automatically generating ship images with diverse weather effects and corresponding annotations. This method enhances the model's perception of local textures and geometric edges by introducing illumination variations and simulated occlusion into the images, thereby improving the accuracy of distinguishing ship targets from complex backgrounds. The final weather-enhanced images are shown below. Figure 3 As shown, the first column is the original image, the second column is the image after fogging, the third column is the image after snowing, and the fourth column is the image after raining.
[0070] Step 2: Preprocess the images in the dataset and divide them into training and test sets proportionally;
[0071] Step 3: Construct the MAF-ORDNet model, which includes: a backbone network, a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatial adaptive detection head;
[0072] like Figure 4 As shown, MAF-ORDNet is designed and implemented using three modules: a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatial adaptive detection head.
[0073] The MAF-ORDNet network mainly consists of four parts: a backbone network, a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatially adaptive detection head. Taking an image of size H×W×C as input, the image is first fed into the Darknet53 backbone network, resulting in feature maps at five levels: C1, C2, C3, C4, and C5. Subsequently, these feature maps (C3, C4, and C5) are fed into the neck network, where cross-layer feature fusion is performed through methods such as concatenation to construct deep fused features containing rich semantic and spatial details. In this stage, the C5 feature map is processed by the multi-branch adaptive feature fusion module and then concatenated with the shallow feature maps C4 and C3 from bottom to top, finally generating three feature maps P3, P4, and P5 at different scales. Before entering the detection head, P3, P4, and P5 are processed by the occlusion region decoupling module to generate occlusion localization and edge enhancement feature maps O1, O2, and O3, respectively. Finally, these three feature maps are input into the spatial adaptive detection head in parallel. In the head network, the features are decoupled through a series of convolutional layers to predict the target category, bounding box coordinates, and instance mask, and output the final result map.
[0074] like Figure 5As shown, the process of the multi-branch adaptive feature fusion module in processing feature maps is as follows: First, the module pre-expands the basic receptive field through progressive pooling. Then, multi-branch dilated convolution, acting as a parallel feature extractor, extracts information at different scales using dilated convolutions of three different receptive fields. Finally, a dynamic weighting mechanism acts as a selection gate, adaptively adjusting the fusion ratio of each branch feature based on the actual scale of the input target. The synergistic effect of these three mechanisms allows the multi-branch adaptive feature fusion module to no longer be limited by a fixed-size convolution kernel, but instead obtain an adaptive receptive field that can change with the target size.
[0075] Assuming the input feature is F, for the feature map input to the multi-branch adaptive feature fusion module, three max-pooling operations are first concatenated and concatenated with the original feature to form four branches. This design uses multiple small-sized pooling operations in series to progressively expand the receptive field, replacing a single large-kernel pooling. This step-by-step expansion method significantly reduces the number of parameters and improves computational efficiency while achieving a similar receptive field. The formulas for the three max-pooling operations can be expressed as:
[0076] ;
[0077] ;
[0078] In the formula, X i The input feature F is represented by the feature map obtained after i max pooling operations; MP represents the max pooling operation; C 1×1 This represents a 1×1 convolution operation; This is the feature map after dimensionality reduction via 1×1 convolution.
[0079] To extract features of ships of different scales in the obscured area, this invention designs, as follows: Figure 6 The dilated convolutional structure shown is composed of three convolutional branches with different dilation rates (r=2, 3, 4), where each branch processes only 1 / 12 of the input channel dimension, thus significantly reducing the number of parameters and computational overhead. By setting different dilation rates, these branches obtain effective receptive fields of 5×5, 7×7, and 9×9 respectively, enabling them to capture local texture information of small targets and global contour features of large targets in parallel, thereby optimizing the model's ability to represent features of ships at multiple scales. The specific processing is shown in the following formula:
[0080] F i =C 1×1 (cat(DC r (N i )) r=2,3,4 ), i=1,2,3;
[0081] In the formula, F iThis represents the feature map generated by concatenating three-branch dilated convolutions; C 1×1 (·) represents a 1×1 convolution operation; DC r (·) represents dilation convolution operation, r represents the dilation rate; N i This represents the feature map after the input feature F has undergone i rounds of max pooling and dimensionality reduction; cat(·) indicates that a concatenation operation is performed.
[0082] Subsequently, by introducing learnable parameters w1, w2, and w3, a dynamic weighted fusion mechanism is constructed to adaptively integrate features from different scales. This mechanism generates feature maps rich in contextual semantics at a lower computational cost, avoiding redundant feature stacking or complex attention operations. The specific processing procedure is as follows:
[0083] F c1 =cat(w1*F1,w2*F2,w3*F3,MP (3) (F));
[0084] In the formula, F c1 This represents the feature map after concatenating four branches; cat(·) indicates performing the concatenation operation; MP (3) (·) represents the feature map after three max pooling operations; F represents the feature map of the input module; F1, F2, and F3 are the feature maps generated by concatenating three-branch dilated convolutions;
[0085] Complex maritime weather conditions can cause decreased contrast and blurred details in ship images. To prevent feature loss and missegmentation, this invention introduces PConv and PWConv to model channel dependencies. This method generates an adaptive channel weight matrix by statistically analyzing global context information and then uses sigmoid activation to suppress redundant channels, thereby improving the signal-to-noise ratio of foreground features. While enhancing key details such as contours and textures, it effectively suppresses interference from background and environmental noise. The feature map output of the module can be represented as:
[0086] F * =cat(σ(PW(PC(Avg(F))))⊙F,F c1 );
[0087] In the formula, F * The input module's feature map is represented by: Avg(·); PC(·); PW(·); σ(·); F; ... c1 This represents the feature map after concatenating the four branches; cat(·) indicates that the concatenation operation is performed.
[0088] like Figure 7As shown, the occlusion region decoupling module mainly consists of a ship occlusion region localization module and a ship edge enhancement module. It integrates the texture, structure and contour information of shallow features into deep semantic features, performs independent statistics in the spatial dimension to locate the occlusion region, and enhances the extraction capability of contour features with the help of the edge perception mechanism, thereby achieving accurate separation and boundary refinement of mutually occluding ship instances.
[0089] The occlusion area positioning module borrows from the Bottleneck structure. For example... Figure 8 As shown, this structure enhances the expressive power of features through channel compression and nonlinear mapping, helping to highlight the structural information of ships. Simultaneously, its dimensionality reduction design effectively controls computational complexity while achieving network depth expansion. To enhance the model's ability to identify occlusion boundaries, the occlusion region localization module uses cross-normalized XNorm for dynamic feature calibration in the spatial dimension. This method independently calculates statistics for each channel of the feature map at different locations to capture the spatial distribution information of ships more precisely. Furthermore, through learnable scaling factors γ and offset β, the features at each location are adaptively adjusted, significantly increasing the feature difference between occluded and unoccluded regions, improving the distinguishability of overlapping target boundaries, and achieving effective separation of mutually occluded ship instances. In addition, to avoid structural information loss due to over-fusion, this module introduces a learnable parameter α in the residual connection to adaptively fuse the enhanced features with the original input. The input feature of the occlusion region localization module is X, with dimensions of 2C×H×W. The processing formula is as follows:
[0090] X1 = W2 * (ReLU(W1 * X));
[0091] Y = X + α*XNorm(X,X1,X1);
[0092] In the formula, X is the input feature of the occlusion region localization unit, with a size of 2C×H×W; W1 is the first 1×1 convolution kernel, with a size of 2C / 4×2C×1×1; ReLU(·) represents the ReLU activation function; W2 is the second 1×1 convolution kernel, with a size of 2C×2C / 4×1×1; X1 is the feature map output after 1×1 convolution, ReLU layer, and 1×1 convolution, with a size of 2C×H×W; XNorm(·) represents cross-normalization; Y represents the feature map after passing through the ship occlusion region localization module; α is a learnable parameter; H is the length of the feature map; W is the width of the feature map; and C is the number of channels of the feature map.
[0093] The ship edge enhancement module predicts the sampling point offset using deformable convolution, and then applies the offset to a fixed sampling position p. k Adjust to p k +Δp kWhen a pixel gradient changes abruptly at the edge of the ship, this mechanism adaptively shifts the sampling points towards the contour, making them fit the real contour better, thereby enhancing the response strength to edge features. Its structural diagram is shown below. Figure 9 As shown. It works by sampling each point p n Introducing a learnable offset Δp n This allows the convolutional kernel to adaptively adjust its sampling position based on the actual contour of the ship, thereby obtaining higher-quality contour information. Within the sampling region, the input feature x(p0+p) is processed... n +Δp n ) and convolution weights w(p n Weighted summation significantly enhances the response intensity of edge features, enabling ships to maintain clear feature representation even under occlusion conditions. The specific formula for the ship edge enhancement module is as follows:
[0094] ;
[0095] ;
[0096] In the formula, Y1(p0) is the value of position p0 in the output feature map; w(p n ) indicates that the convolution kernel is at position p n The weights; x represents the input feature map; Δp n It is the learnable offset corresponding to the nth sampling point, represented as a two-dimensional vector (Δx, Δy); (p0+p n +Δp n ) represents the actual read position of the nth offset sampling point of the current output position p0; x(p0+p n +Δp n ) represents the feature value corresponding to that position; C 1×1 (·) is a 1×1 convolution kernel; C 3×3 (·) is a 3×3 convolution kernel; σ(·) represents the ReLU activation function; DCN represents a deformable convolution kernel; and Y1 is the output feature map.
[0097] To avoid over-reliance on a single path leading to missing edge information, one feature channel has a weight of w, and the other has a weight of 1-w. The feature fusion process of the occlusion region decoupling module can be represented as:
[0098] ;
[0099] To output a more complete instance mask, the output of the occlusion region decoupling module is connected to the spatial adaptive detection head.
[0100] like Figure 10As shown, the spatial adaptive detection head employs a decoupled structure with multi-task parallelism, where each category classification, bounding box regression, and mask coefficient generation task has an independent prediction branch. Unlike the traditional design of stacked standard convolutions, it introduces two cascaded RFAConvModules at the front end of each branch. First, it decouples the feature map into independent receptive field local structures through spatial unfolding. Then, it dynamically reweights the local structural features to adaptively enhance the features at the break points, thereby enabling the output mask coefficient mask_coef to represent the semantic features after morphological restoration.
[0101] In the prototype mask generation stage, to reduce the computational pressure and noise impact caused by RFAConv, this invention introduces four depthwise separable convolutional layers in ProtoNet instead of the standard 3×3 convolutional layers. By independently convolving each channel of the input feature map, background noise is avoided from interfering with each other along the channel dimension. Figure 11 As shown, the denoised prototype mask is generated.
[0102] Finally, the mask_coef output by the spatial adaptive detection head is multiplied by the prototype mask to obtain the instance mask, and its calculation formula is as follows:
[0103] ;
[0104] In the formula, M i P represents the instance mask after morphological restoration. i Prototype mask generated for ProtoNet; α i It is the corresponding mask coefficient predicted by the spatial adaptive detection head.
[0105] The prototype mask coefficients also utilize a spatial attention mechanism, ensuring that each generated P... i Both can more accurately reflect the local semantic structure of the ship, effectively alleviate the discontinuity of ship segmentation under dense occlusion conditions, and provide a more complete ship mask.
[0106] Step 4: Train the MAF-ORDNet model based on the training set to obtain the ship instance segmentation model;
[0107] Step 5: Input the test set into the ship instance segmentation model and output the instance segmentation results of the ship target under occlusion conditions.
[0108] The present invention will be demonstrated through experiments below.
[0109] The experiments used the OcclusionShipSeg dataset for training and evaluation. This dataset contains 1969 Real-Occlusion images, 2592 Occlusion-Train images, and 690 Occlusion-Test images. The statistical information of the sample dataset is shown in Table 1 below:
[0110] Table 1 Statistical information of the sample dataset
[0111]
[0112] The experiments of this invention were conducted on a Windows platform, utilizing a GPU (GTX4060) for computational tasks. Model training used the PyTorch 1.12.1 framework, accelerated using CUDA 11.6 and CUDNN 11.0. No pre-trained weights were loaded during training; the Adam optimizer was used, with a batch size of 16, an initial learning rate of 0.001, and a weight decay of 5 × 10⁻⁶. -5 The input images were uniformly adjusted to 640×640×3, the total number of training rounds was 600, and automatic mixed precision was enabled to improve training efficiency.
[0113] The evaluation metrics used in this invention include accuracy (AP) and AP in the COCO evaluation system. 50 AP 75 AP S (area<32 2 pixels), AP M (32 2 <area<96 2 pixels) and AP L (area>96 2 In addition, the model parameter count Para(M), the number of images processed per second (FPS), the single-frame latency T(ms), and the floating-point operation count FLOPs(G) were also calculated to comprehensively evaluate the algorithm's performance and efficiency.
[0114] Analysis of experimental results:
[0115] A. Ablation test
[0116] To verify the effectiveness of the proposed method, this invention conducted ablation experiments on the OcclusionShipSeg dataset using YOLOv8s (with CSP-C2f as the backbone network) as the baseline model, focusing on the multi-branch adaptive feature fusion module and the occlusion region decoupling module. The ablation experiment results are shown in Table 2 below:
[0117] Table 2 Ablation Experiment Results
[0118]
[0119] Compared to the baseline, when only the multi-branch adaptive feature fusion module is added, AP S and AP M The improvements of 2.5% and 2.4% respectively indicate that this module extracts multi-scale features through multi-dilation rate dilated convolution and retains more ship details by combining a global attention mechanism, effectively improving the segmentation accuracy of small and medium-sized targets. However, its AP... L Compared to adding only the multi-branch adaptive feature fusion module, the performance decreased by 1.6%. The analysis shows that because the multi-branch adaptive feature fusion module did not introduce an edge enhancement mechanism, the modeling of the large ship contour was not refined enough, resulting in a decrease in the edge quality of the large target mask.
[0120] After adding only the occlusion region decoupling module, the model's AP and AP 50 With AP 75 The speed improvements were 3.3%, 4.2%, and 6.3% respectively, with FPS only slightly decreasing from 118.2 to 111.7. This demonstrates that the module effectively enhances the model's ability to segment densely occluded ships. Although deformable convolution requires additional offset prediction, it reduces invalid computation through accurate sampling, improving feature extraction efficiency, thus resulting in minimal speed loss.
[0121] Compared to the baseline, by simultaneously introducing a multi-branch adaptive feature fusion module and an occlusion region decoupling module, the overall segmentation accuracy of the model is improved, with AP... S AP M and AP L The percentages increased by 4.1%, 3.4%, and 2.5%, respectively. This indicates that the two modules can work together to effectively extract multi-scale features and contextual information, thereby enhancing the model's ability to identify mutual occlusion between ships of different scales.
[0122] B. Comparative experiments with current mainstream models:
[0123] This invention conducted comparative experiments on the OcclusionShipSeg dataset to compare with current mainstream instance segmentation methods. The results of the comparative experiments with current mainstream models are shown in Table 3 below:
[0124] Table 3 shows the experimental results compared with current mainstream models.
[0125]
[0126] First, after training on a regular dataset, Mask R-CNN achieved an AP of 35.7% on the occlusion test set, while training on the occlusion dataset improved its performance to 39.1%, demonstrating the effectiveness of occlusion-specific training. Further comparisons show that, under the same occlusion training settings, Yolact and Mask2former achieved APs of 41.9% and 41.1% respectively, both outperforming Mask R-CNN, indicating that simulated occlusion and weather enhancement can improve the model's ability to recognize ships at multiple scales.
[0127] The improved MAF-ORDNet of this invention achieves an AP of 43.5% under the same experimental settings, a 3.8% improvement over the baseline YOLOv8s. In terms of multi-scale performance, MAF-ORDNet's AP... S AP M and AP L Compared to YOLOv8s, the performance improvements are 4.1%, 3.4%, and 2.5%, respectively; compared to Mask2former, the improvements are 4.6%, 2.8%, and 1.5%, respectively. This indicates that the MAF module can effectively take into account the key characteristics of ships of different scales. It is worth noting that MAF-ORDNet's AP... L Slightly lower than Yolact's 1.0%, but AP S It is significantly higher by 5.5%. This is because Yolac relies on deep, low-resolution features, which is good for semantic extraction of large targets but not good for preserving details of small targets; while MAF-ORDNet achieves more balanced multi-scale performance by weighted fusion of deep and shallow features, which enhances the representation of small targets while maintaining stable segmentation ability for large targets.
[0128] Table 4 Comparison of Time Complexity and Inference Efficiency of Each Model
[0129]
[0130] As shown in Table 4, MAF-ORDNet contains only 15.4M parameters, significantly fewer than Mask R-CNN and Mask2former models; its FLOPs are only 35.9G, about 1 / 5 of Mask R-CNN and 1 / 5 of Mask2former, and its computational complexity is significantly lower.
[0131] Specifically, while two-stage methods like Mask R-CNN offer high accuracy, their parameters are as high as 44M, FLOPs are 196G, and inference speed is only 14FPS, making it difficult to meet real-time requirements. Yolact has a faster inference speed, but it relies solely on deep, low-resolution features to generate masks, making it difficult to capture edge details and resulting in insufficient accuracy in occluded scenes. Mask2former achieves an accuracy of 41.1% AP, but its Transformer-based global attention computation is quite intensive, resulting in a speed of only 10.2FPS. YOLOv11s has a lightweight architecture and a speed of 121.6FPS, but due to limited model capacity, its accuracy is about 3.2% lower than MAF-ORDNet.
[0132] In contrast, MAF-ORDNet achieved an AP of 43.5% while maintaining a high inference speed of 110.2 FPS, achieving a good balance between accuracy and efficiency, and demonstrating its comprehensive advantages in real-time, multi-scale occluded ship segmentation tasks.
[0133] like Figure 12 As shown, the segmentation performance of Mask2former, Mask R-CNN, Yolact, YOLOv8s, YOLOv11s, and MAF-ORDNet in occluded scenes is compared. The first and second rows show the image and the annotated instance mask, respectively. From the third row to the last, the segmentation results of Mask2former, Mask R-CNN, Yolact, YOLOv8s, YOLOv11s, and MAF-ORDNet are shown in order. Yellow dashed lines represent incorrect segmentation, and red solid lines represent missed detections.
[0134] In the first set of test samples, Mask2former misidentified the background as two ships, reflecting insufficient suppression of background noise. MaskR-CNN failed to completely segment the large target ship, indicating its limited ability to model the contours of large-scale targets. Yolac misidentified the background region as a small ship, indicating a bias in its feature extraction of small targets. YOLOv8s missed two small boats due to overlap between small and large targets, indicating insufficient receptive field adaptation when handling scale overlap. YOLOv11s also missed one small target. In contrast, MAF-ORDNet completely identified one large ship and four small boats, achieving accurate segmentation. This is thanks to the MAF module's extraction of multi-scale features through multi-dilation rate dilated convolutions and the combination of a global attention mechanism to suppress background interference, enhancing the model's adaptability to scale differences. In the second column of samples, Mask2former missed detecting one occluded ship; MaskR-CNN incorrectly segmented a complete ship into two instances and misclassified three mutually occluded ships into four, indicating its difficulty in distinguishing instance boundaries in occluded scenes; Yolact also exhibited instance segmentation issues and failed to detect the occluded small boat; YOLOv8s and YOLOv11s both failed to identify occluded small targets, reflecting their insufficient feature response to occluded regions. MAF-ORDNet, however, accurately segmented all instances, attributed to the ORD module's edge enhancement mechanism strengthening the contour information of occluded regions, effectively guiding the model to distinguish boundaries between different ships. In the fourth column of samples, Mask2former missed detecting the occluded small boat; MaskR-CNN failed to identify the occluded ship in the middle; Yolact missed both the middle ship and the occluded small targets; YOLOv8s incorrectly segmented the mast and hull of the same ship into two independent instances; and YOLOv11s misclassified the occluded small target into two instances. Then, MAF-ORDNet did not exhibit any missed detections or missegments, achieving complete segmentation. In summary, MAF-ORDNet maintains high segmentation accuracy under complex occlusion conditions, significantly outperforming all the compared methods. This advantage mainly stems from the MAF module's ability to enhance multi-scale features and the ORD module's refined modeling of edge information; the two work synergistically to improve the model's segmentation performance for ships with multi-scale occlusion.
[0135] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.
Claims
1. A deep learning-based method for instance segmentation of multi-scale occluded ships at sea, characterized in that, Includes the following steps: Step 1: Obtain a dataset of multi-scale ship occlusion; Step 2: Preprocess the images in the dataset and divide them into training and test sets proportionally; Step 3: Construct the MAF-ORDNet model, which includes: a backbone network, a multi-branch adaptive feature fusion module, an occlusion region decoupling module, and a spatial adaptive detection head; Specifically, the SPPF module was replaced by a multi-branch adaptive feature fusion module; an occlusion region decoupling module was added to the neck network; and instance masks were generated using an RFAConv-based spatial adaptive detection head and mask branches. Step 4: Train the MAF-ORDNet model based on the training set to obtain the ship instance segmentation model; Step 5: Input the test set into the ship instance segmentation model and output the instance segmentation results of the ship target under occlusion conditions.
2. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 1, characterized in that, In step 3, the multi-branch adaptive feature fusion module gradually expands the receptive field through progressive pooling branches. Specifically, it uses progressive pooling and multi-branch dilated convolution to collaboratively process the multi-scale feature fusion problem; it introduces a dynamic weighting mechanism to adaptively adjust the fusion process, restore the semantics of the occluded ship, and enhance the model's ability to perceive multi-scale features. Global channel attention mechanism is used to suppress background noise interference and improve the robustness of the model under complex occlusion.
3. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 2, characterized in that, The specific processing steps of the multi-branch adaptive feature fusion module include the following: Step A1: By setting different dilation rates for each dilated convolution, each branch obtains an effective receptive field of 5×5, 7×7, and 9×9 respectively. Based on the effective receptive fields, feature maps of ships at multiple scales are extracted. The specific formula is as follows: F i =C 1×1 (cat(DC r (N i )) r=2,3,4 ),i=1,2,3; In the formula, F i This represents the feature map generated by concatenating three-branch dilated convolutions; C 1×1 (·) represents a 1×1 convolution operation; DC r (·) represents dilation convolution operation, r represents the dilation rate; N i This represents the feature map after the input feature F has undergone i rounds of max pooling and dimensionality reduction; cat(·) indicates that a concatenation operation is performed. Step A2: Set learnable parameters w1, w2, and w3 to adaptively integrate features at different scales. The specific formula is as follows: F c1 =cat(w1*F1,w2*F2,w3*F3,MP (3) (F)); In the formula, F c1 This represents the feature map after concatenating four branches; cat(·) indicates performing the concatenation operation; MP (3) (·) represents the feature map after three max pooling operations; F represents the feature map of the input module; F1, F2, and F3 are the feature maps generated by concatenating three-branch dilated convolutions; Step A3: Suppress background noise interference through a global channel attention mechanism. The specific formula for feature map output is as follows: F * =cat(σ(PW(PC(Avg(F))))⊙F,F c1 ); In the formula, F * The input module's feature map is represented by: Avg(·); PC(·); PW(·); σ(·); F; ... c1 This represents the feature map after concatenating the four branches; cat(·) indicates that the concatenation operation is performed.
4. The instance segmentation method for multi-scale occlusion vessels at sea based on deep learning as described in claim 1, characterized in that, In step 3, the occlusion area decoupling module includes: The occlusion region localization unit is used to perform cross-normalization of the feature map in the spatial dimension and adaptively enhance the feature difference between the occluded and unoccluded regions through learnable parameters. The occlusion region localization unit independently calculates statistics for each channel of the feature map at different locations. The edge enhancement unit introduces learnable offsets to the sampling points through deformable convolution, thereby enhancing the feature response of the ship's edges.
5. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 4, characterized in that, In the occlusion area localization unit, the specific method for enhancing the feature difference between the occluded area and the unoccluded area is as follows: the position features are adjusted by scaling factors and offsets, and learnable parameters are introduced into the residual connection to adaptively fuse the original features and the enhanced features, thereby increasing the feature difference between the occluded area and the unoccluded area.
6. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 5, characterized in that, The processing formula for the occlusion area positioning unit is as follows: X1 = W2 * (ReLU(W1 * X)); Y = X + α*XNorm(X,X1,X1); In the formula, X is the input feature of the occlusion region localization unit, with a size of 2C×H×W; W1 is the first 1×1 convolution kernel, with a size of 2C / 4×2C×1×1; ReLU(·) represents the ReLU activation function; W2 is the second 1×1 convolution kernel, with a size of 2C×2C / 4×1×1; X1 is the feature map output after 1×1 convolution, ReLU layer, and 1×1 convolution, with a size of 2C×H×W; XNorm(·) represents cross-normalization; Y represents the feature map after passing through the ship occlusion region localization module; α is a learnable parameter; H is the length of the feature map; W is the width of the feature map; and C is the number of channels of the feature map.
7. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 4, characterized in that, The specific processing procedure of the edge enhancement unit is as follows: A learnable offset is introduced for each sampling point, enabling the convolution kernel to adaptively adjust the sampling position according to the actual contour of the ship. Within the sampling area, the response intensity of edge features is enhanced by weighted summation of input features and convolution weights, so that the ship can maintain clear feature representation even under occlusion conditions.
8. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 7, characterized in that, The processing formula for the edge enhancement unit is as follows: Y1=DCN(σ(C 3×3 (σ(C 1×1 (x)))))+x; In the formula, x represents the input feature map; C 1×1 (·) is a 1×1 convolution kernel; C 3×3 (·) is a 3×3 convolution kernel; σ(·) represents the ReLU activation function; DCN(·) represents a deformable convolution kernel; Y1 is the output feature map.
9. The deep learning-based method for segmenting instances of multi-scale occlusion vessels at sea as described in claim 1, characterized in that, In step 3, the spatial adaptive detection head includes: The RFAConv module is used to output the mask coefficients that characterize the morphological restoration. The ProtoNet module is used to generate a denoised prototype mask. The mask synthesis unit is used to linearly combine mask coefficients with the prototype mask to repair the texture features of ship fractures and output a complete instance mask.
10. The deep learning-based instance segmentation method for multi-scale occlusion of ships at sea as described in claim 9, characterized in that, The RFAConv module contains two layers. Each layer includes an RFAConv layer, a batch normalization layer, and an activation layer connected in series, which are then input into a two-dimensional convolutional layer for integration. The ProtoNet module is connected in a series of four depthwise separable convolutional layers, which convolve each channel of the input feature map independently, avoiding interference between background noise in the channel dimension.