Method for detecting a target of a vessel

By improving the feature extraction and fusion modules of the YOLOv5 network and combining channel attention and Anchor-Free detection methods, the accuracy problem of small target detection in remote sensing images is solved, achieving efficient ship target recognition that is suitable for lightweight hardware.

CN119963874BActive Publication Date: 2026-06-23INST OF SEMICONDUCTORS - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF SEMICONDUCTORS - CHINESE ACAD OF SCI
Filing Date
2024-12-16
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing remote sensing image target detection methods suffer from high rates of feature loss, missed detection, and false detection for small targets, making it difficult to accurately identify ship targets. In particular, remote sensing images often have complex backgrounds, small targets, multiple scales, and diverse features, resulting in insufficient detection accuracy and precision.

Method used

We adopt an improved ship target detection method based on the YOLOv5 network. By replacing the two-dimensional convolutional layer with a depthwise separable large kernel convolution, adding a small target extraction layer and a channel attention module, configuring an anchor-free target detection head, using depthwise separable convolution and residual structure, and combining channel attention mechanism and anchor-free detection method, we optimize the feature extraction and fusion process.

Benefits of technology

It improves the detection accuracy and precision of ship targets in remote sensing images, enhances feature extraction capabilities, reduces computational load, and is suitable for lightweight hardware implementation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119963874B_ABST
    Figure CN119963874B_ABST
Patent Text Reader

Abstract

The application provides a ship target detection method, and relates to the fields of target detection and remote sensing image processing. The method comprises the following steps: acquiring a remote sensing image containing a ship target, and pre-processing the remote sensing image; inputting the pre-processed remote sensing image into a trained target detection model to obtain the category and position information of the ship target; wherein the target detection model is improved based on a YOLOv5 network and comprises a feature extraction module, a feature fusion module and a feature detection module; in the feature extraction module, a two-dimensional convolution layer of the YOLOv5 network is replaced by a deep separable large kernel convolution; in the feature fusion module, an additional small target extraction layer is added, and a channel attention module is added before each fusion for the fusion of multiple feature maps after up-sampling and down-sampling; and in the feature detection module, an anchor-free target detection head is configured for each input feature map. The application can improve the accuracy and precision of lightweight remote sensing image target detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of target detection and remote sensing image processing, and in particular to a method for detecting ship targets. Background Technology

[0002] With significant advancements in imaging technology, the ability to acquire optical remote sensing data has been greatly enhanced. High-resolution optical remote sensing images, rich in detail and texture, can now be obtained through sensors on satellites or aerial cameras on drones. Ships, as a primary mode of maritime transport, play a crucial role in fisheries management, waterway monitoring, and safety rescue. Therefore, real-time detection and identification of ships is of significant strategic importance, enabling rapid decision-making based on actual circumstances.

[0003] In the current development of image detection, deep learning is widely used for ship detection and classification tasks due to its advantages such as strong robustness and fewer prior conditions. Depending on the detection method, object detection can be divided into one-stage and two-stage algorithms. Compared to two-stage algorithms, one-stage algorithms directly extract features from the image and then use a classifier to predict the object's location and category, completing classification and regression simultaneously in one stage. This approach has the advantages of lower computational cost and faster detection speed, making it easier to meet the real-time requirements of practical applications. Among existing technologies, YOLO, compared to other mainstream algorithms, offers high speed, low parameter count, and high accuracy in one-stage algorithms.

[0004] The accuracy of target detection is strongly correlated with the clarity of extracted features. Remote sensing images are characterized by large image size and complex backgrounds, and the targets within them are characterized by small size, multiple scales, and diverse features. YOLO algorithms, however, are suitable for detecting natural images with simple backgrounds and large targets, which differ significantly from remote sensing images. Therefore, current remote sensing image target detection suffers from problems such as loss of small target features, missed detections, high false detection rates, and inability to accurately identify the category of small targets. Summary of the Invention

[0005] In view of the above problems, the present invention provides a method for detecting ship targets, which improves the accuracy and precision of target detection in lightweight remote sensing images.

[0006] This invention provides a method for detecting ship targets, comprising: acquiring a remote sensing image containing ship targets; preprocessing the remote sensing image; inputting the preprocessed remote sensing image into a trained target detection model to obtain the category and location information of the ship targets; wherein the target detection model is based on an improved YOLOv5 network and includes a feature extraction module, a feature fusion module, and a feature detection module; in the feature extraction module, the two-dimensional convolutional layers of the YOLOv5 network are replaced with depthwise separable large kernel convolutions; in the feature fusion module, an additional small target extraction layer is added, and for multiple feature map fusions after upsampling and downsampling, a channel attention module is added before each fusion; in the feature detection module, an anchor-free target detection head is configured for each input feature map.

[0007] According to an embodiment of the present invention, the remote sensing image is a visible light grayscale remote sensing image; preprocessing the remote sensing image includes: segmenting the visible light grayscale remote sensing image into multiple image blocks; and filling the edges of each of the multiple image blocks.

[0008] According to an embodiment of the present invention, the step of inputting the preprocessed remote sensing image into a trained target detection model to obtain the category and location information of the ship target includes: inputting the preprocessed remote sensing image into the feature extraction module to output multiple intermediate features at different depths; inputting the multiple intermediate features at different depths into the feature fusion module to output multiple scale feature maps with fused attention; and inputting the multiple scale feature maps into the feature detection module to output the category and location information of the ship target.

[0009] According to an embodiment of the present invention, the feature extraction module has five layers from shallow to deep, namely layer C1, layer C2, layer C3, layer C4, and layer C5. The C1 layer contains a CBS module; the C2, C3, and C4 layers have the same structure, each consisting of a CBS module and an improved CSP module; the C5 layer consists of a CBS module, an SPFF module, and an improved CSP module; the features output by the C2, C3, C4, and C5 layers are used as multiple intermediate features of different depths output by the feature extraction module.

[0010] According to an embodiment of the present invention, the CBS module includes a two-dimensional convolutional layer, a batch normalized layer, and a SiLU activation function; the two-dimensional convolutional layer is a convolution with a size of 3 and a stride of 1; the improved CSP module includes a first branch and a second branch; the first branch first performs a 1*1 convolution, and then performs multiple cycles of the BottleNeck module, including cycling the C2, C3, C4, and C5 layers with different intermediate iterations; the second branch performs a 1*1 convolution, and the convolution result is concatenated with the result of the first branch; wherein, the improved CSP module adjusts the intermediate iterations of the multiple BottleNeck module cycles in the YOLOv5 network; in the two-dimensional convolutional layer of the CBS module in the improved CSP module, the 1*1 convolution and 3*3 convolution are replaced with 3*3 convolution and 5*5 depthwise separable convolution.

[0011] According to an embodiment of the present invention, the feature fusion module includes multiple improved CSP modules, CBS modules, and ECA modules, wherein the ECA module is used to implement the channel attention mechanism; the multiple intermediate features of different depths are input into the feature fusion module, and the output of multiple scale feature maps of fused attention includes: processing the output features of the C5 layer through the first CBS module to obtain the fifth-level intermediate features; upsampling the fifth-level intermediate features and concatenating them with the output features of the C4 layer, then processing them through the first improved CSP module and the first ECA module, and processing the resulting features through the second CBS module to obtain the fourth-level intermediate features of fused attention; upsampling the fourth-level intermediate features and concatenating them with the output features of the C3 layer, then processing them through the second improved CSP module and the second ECA module, and processing the resulting features through the third CBS module to obtain the third-level intermediate features of fused attention; upsampling the third-level intermediate features and concatenating them with the output features of the C2 layer. Then, after processing by the third improved CSP module and the third ECA module, the resulting features are processed by the fourth CBS module to obtain the second-level intermediate features for attention fusion. The second-level intermediate features are then processed by the fifth CBS module to obtain the P2 feature map. The P2 feature map is downsampled and concatenated with the third-level intermediate features, then processed by the fourth improved CSP module and the fourth ECA module. The resulting features are then processed by the sixth CBS module to obtain the P3 feature map for attention fusion. The P3 feature map is downsampled and concatenated with the fourth-level intermediate features, then processed by the fifth improved CSP module and the fifth ECA module. The resulting features are then processed by the seventh CBS module to obtain the P4 feature map for attention fusion. The P4 feature map is downsampled and concatenated with the fifth-level intermediate features, then processed by the sixth improved CSP module and the sixth ECA module. The resulting features are then processed by the eighth CBS module to obtain the P5 feature map for attention fusion.

[0012] According to an embodiment of the present invention, the ECA module processes the input feature map in the following manner: adaptively calculating the kernel size of the one-dimensional convolution based on the number of channels of the input feature map; performing global pooling on the input feature map; performing one-dimensional convolution on the result of global pooling based on the kernel size of the one-dimensional convolution; calculating the result of the one-dimensional convolution using the sigmoid activation function; and concatenating the calculation result with the input feature map to obtain the output feature map corresponding to the input feature map.

[0013] According to an embodiment of the present invention, inputting the plurality of scale feature maps into the feature detection module and outputting the category and location information of the ship target includes: configuring a target detection head for each feature map in the plurality of scale feature maps, the target detection head comprising two detection sub-networks; wherein the two detection sub-networks are two parallel convolutions, namely a classification convolution and a localization convolution; the label allocation strategy adopted by the two detection sub-networks during training is the minimum loss allocation.

[0014] According to an embodiment of the present invention, the loss function used during training of the two detection sub-networks is the sum of classification loss and localization loss, wherein the classification loss function is Focal Loss, and the localization loss function includes L1 loss function and GIOU loss function; during inference, the result of the two parallel convolutions of the two detection sub-networks is the k ranked score bounding boxes, which are the category and location information of the ship target, where k is a positive integer.

[0015] According to an embodiment of the present invention, the target detection model is trained in the following manner: constructing a target detection model and defining a loss function; acquiring a historical remote sensing image set, labeling and preprocessing the historical remote sensing image set to obtain training samples; inputting the training samples into the constructed target detection model to obtain the predicted target location and predicted target category for each historical remote sensing image; and performing reverse training on the constructed target detection model based on the loss function, the predicted target location, the predicted target category, and the labeling information to obtain a trained target detection model.

[0016] As can be seen from the above technical solution, the ship target detection method provided by the present invention has at least the following beneficial effects:

[0017] (1) The feature extraction module of the present invention increases the receptive field and improves the feature extraction capability without increasing the number of parameters by introducing depth-separable large kernel convolution and residual structure.

[0018] (2) The feature fusion module of this invention, for the fusion of multiple feature maps after upsampling and downsampling of the path aggregation network, adds an efficient channel attention mechanism before each fusion to enhance information exchange between different layers, enhance effective information, and improve the accuracy and efficiency of the network. Adding additional feature extraction branches in the shallow layers yields a new framework for extracting features for small targets.

[0019] (3) The feature detection module of the present invention adds an additional small target detection head and uses an Anchor-free target detection head. By introducing a classification branch, the target is located directly, avoiding the use of non-maximum methods, which is hardware-friendly. Attached Figure Description

[0020] The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the invention with reference to the accompanying drawings, in which:

[0021] Figure 1 This diagram schematically illustrates the structure of a prior art YOLOv5 network;

[0022] Figure 2 A flowchart illustrating a method for detecting ship targets according to an embodiment of the present invention is shown schematically;

[0023] Figure 3 A flowchart illustrating the processing procedure of the target detection model according to an embodiment of the present invention is shown schematically.

[0024] Figure 4 The network structure diagram of the target detection model according to an embodiment of the present invention is illustrated schematically;

[0025] Figure 5 This diagram schematically illustrates the network architecture of the CSP module in a prior art YOLOv5 network.

[0026] Figure 6 A schematic diagram illustrating the network structure of the improved CSP module according to an embodiment of the present invention is shown.

[0027] Figure 7 A simplified network structure diagram of a feature fusion module according to an embodiment of the present invention is illustrated schematically;

[0028] Figure 8 The diagram illustrates the network structure of a feature detection module according to an embodiment of the present invention. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

[0030] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0031] All terms used herein, including technical and scientific terms, have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.

[0032] This invention provides a method for detecting ship targets, used for real-time detection of remotely sensed ships captured by UAV aerial photography. While maintaining a small model size, it optimizes and improves conventional target detection algorithms, enhancing feature extraction of the target during remote sensing target detection, improving the accuracy of target detection in remote sensing images, and enhancing the lightweight performance of the model algorithm.

[0033] Figure 1 The schematic diagram illustrates the structure of a prior art YOLOv5 network.

[0034] like Figure 1 As shown, in existing technologies, the YOLOv5 network is a lightweight small object detection algorithm based on an improved feature fusion mode. In the YOLOv5 network, the image is processed through the Input layer and sent to the Backnone network for feature extraction. The Backnone network obtains feature maps of different sizes {C3, C4, C5}, and then these features are fused through the neck feature fusion network to finally generate three feature maps {P3, P4, P5}, which are used to detect small, medium, and large objects in the image, respectively.

[0035] Then, after sending the three feature maps to the prediction head, the confidence score and bounding box are calculated for each pixel in the feature maps using preset prior anchors, thereby obtaining a multidimensional array BBoxes, which includes object category, category confidence, box coordinates, width and height information.

[0036] By setting appropriate thresholds (confthreshold, objthreshold) to filter out useless information in the multidimensional array BBoxes and performing non-maximum suppression (NMS) processing, the final detection information can be output.

[0037] Figure 2 A flowchart illustrating a method for detecting ship targets according to an embodiment of the present invention is shown schematically.

[0038] like Figure 2 As shown, the ship target detection method according to this embodiment may include operations S1 to S2.

[0039] In operation S1, a remote sensing image containing the ship target is acquired, and the remote sensing image is preprocessed.

[0040] In operation S2, the preprocessed remote sensing image is input into the trained target detection model to obtain the category and location information of the ship target.

[0041] The object detection model is an improvement on the YOLOv5 network, consisting of a feature extraction module, a feature fusion module, and a feature detection module. In the feature extraction module, the two-dimensional convolutional layers of the YOLOv5 network are replaced with depthwise separable large kernel convolutions. In the feature fusion module, an additional small object extraction layer is added, and a channel attention module is added before each fusion for multiple feature map fusions after upsampling and downsampling. In the feature detection module, an anchor-free object detection head is configured for each input feature map.

[0042] In this embodiment, the remote sensing image is a visible light grayscale remote sensing image; the above operation S1 preprocesses the remote sensing image, including: dividing the visible light grayscale remote sensing image into blocks and cropping them to form multiple image blocks; and filling the edges of each image block in the multiple image blocks.

[0043] For example, a high-resolution visible light grayscale remote sensing image with 3 channels and 1208*1024 pixels is cropped into blocks, with each cropped image block measuring 640*512 pixels. After cropping, the edges of each image block that need edge padding are filled. The image size after edge padding could be, for example, 640*640 pixels, and the edge padding process could involve filling the edges of the image with 0 pixels where pixels are missing.

[0044] At this point, the preprocessed remote sensing image can be 640*640*3, where 3 represents the number of channels.

[0045] Before applying the object detection model, the pre-built object detection model needs to be trained.

[0046] In this embodiment, the object detection model is trained in the following way:

[0047] Construct an object detection model and define a loss function;

[0048] Acquire a set of historical remote sensing images, annotate and preprocess the historical remote sensing images to obtain training samples;

[0049] The training samples are input into the constructed target detection model to obtain the predicted target location and predicted target category for each historical remote sensing image;

[0050] Based on the loss function, predicted target location, predicted target category, and annotation information, the constructed target detection model is back-trained to obtain a trained target detection model.

[0051] The historical remote sensing image set is a collection of multiple visible light grayscale remote sensing images acquired before applying the target detection model. The preprocessing method for the historical remote sensing image set is the same as that for operation S1 described above. Labeling the historical remote sensing image set involves marking the ground truth location and category of the ship targets contained in each visible light grayscale remote sensing image within the set.

[0052] This invention does not specifically limit the back-training process of the object detection model; all existing model back-training methods are applicable to this invention. For example, a loss function is used to measure the error between the predicted target location and the predicted target category and the ground truth value of the target in the labeled information, guiding the object detection model to optimize parameters through backpropagation. Common loss functions include mean squared error and cross-entropy error.

[0053] During the back-training process of an object detection model, the weight parameters of the model can be adjusted by continuously minimizing the loss function value, aiming to achieve an output that closely approximates the ground truth. After back-training is complete, the optimal weight parameters are obtained, thus determining the trained object detection model.

[0054] Figure 3 A flowchart illustrating the processing procedure of the target detection model according to an embodiment of the present invention is shown.

[0055] like Figure 3 As shown, in this embodiment, the above-described operation S2 inputs the preprocessed remote sensing image into the trained target detection model to obtain the category and location information of the ship target, including:

[0056] In operation S21, the preprocessed remote sensing image is input into the feature extraction module, which outputs multiple intermediate features at different depths.

[0057] In operation S22, multiple intermediate features of different depths are input into the feature fusion module, which outputs feature maps of multiple scales with fused attention.

[0058] In operation S23, feature maps of multiple scales are input into the feature detection module, which outputs the category and location information of the ship target.

[0059] Figure 4 The diagram illustrates the network structure of a target detection model according to an embodiment of the present invention.

[0060] The following is for reference Figure 4 The feature extraction module, feature fusion module, and feature detection module in the target detection network will be explained in detail in turn.

[0061] I. Feature Extraction Module

[0062] like Figure 4 As shown, in this embodiment, the feature extraction module has five layers from shallow to deep: C1, C2, C3, C4, and C5. Layer C1 contains a CBS module; layers C2, C3, and C4 have the same structure, each consisting of a CBS module and an improved CSP module; layer C5 consists of a CBS module, an SPFF module, and an improved CSP module. Then, the features output from layers C2, C3, C4, and C5 are used as multiple intermediate features of different depths output by the feature extraction module in operation S21. For ease of explanation, the improved CSP module can be denoted as CSP_L, and the output intermediate features can be denoted as {C2, C3, C4, C5}.

[0063] It can be seen that, compared to Figure 1 The present invention extracts additional output features from the C2 layer of the existing YOLOv5 network. These output features can detect ship targets that are smaller than those of the C3 and C4 layers, thus improving the feature extraction capability.

[0064] In this embodiment, the CBS module includes a two-dimensional convolutional layer (Conv2D), a batch normalization layer (BN), and a SiLU activation function, which can be denoted as Conv2D_BN_SiLU. The two-dimensional convolutional layer (Conv2D) is a convolution with a size k of 3 and a stride s of 1.

[0065] Figure 5 The diagram illustrates the network architecture of the CSP module in a prior art YOLOv5 network.

[0066] like Figure 5 As shown, in the prior art, the CSP module in the YOLOv5 network includes a first branch and a second branch. The first branch first performs a 1*1 convolution, and then performs N iterations of the BottleNeck module, where N is an integer greater than 1, representing the number of iterations of the BottleNeck module. The second branch performs a 1*1 convolution, and the convolution result is concatenated with the result of the first branch.

[0067] It should be noted that, in this embodiment of the invention, the splicing operation includes splicing two results and then connecting the spliced ​​result to the CBS module. For simplicity, in some embodiments, the subsequent CBS module step can be omitted.

[0068] In the existing technology, the BottleNeck module is a residual structure containing a 1*1 convolution and a 3*3 convolution. The loop of the BottleNeck module is as follows: C2 layer is performed 3 times, C3 layer is performed 9 times, C4 layer is performed 9 times, and C5 layer is performed 3 times.

[0069] Figure 6 The diagram illustrates the network structure of the improved CSP module according to an embodiment of the present invention.

[0070] like Figure 5 and Figure 6 As shown, in this embodiment, the first improvement to the CSP module is that the number of intermediate iterations of the BottleNeck module in the existing YOLOv5 network's CSP module has been adjusted. For example, the iteration of the BottleNeck module in the improved CSP module can be: C2 layer 3 times, C3 layer 4 times, C4 layer 4 times, and C5 layer 3 times.

[0071] In this embodiment, the second improvement of the CSP module is that, for the BottleNeck module of the existing YOLOv5 network's CSP module, the improved CSP module's BottleNeck module is a residual structure containing a 3*3 convolution and a 5*5 depthwise separable convolution.

[0072] As can be seen, in the prior art, the BottleNeck module is a residual structure containing a 1x1 convolution and a 3x3 convolution. In this embodiment, the BottleNeck module of the improved CSP module is a residual structure containing a 3x3 convolution and a 5x5 depthwise separable convolution. In this way, in the two-dimensional convolutional layer of the BottleNeck module in the improved CSP module, the 1x1 convolution and 3x3 convolution are replaced with a 3x3 convolution and a 5x5 depthwise separable convolution.

[0073] In this embodiment of the invention, the improved CSP module in the feature extraction module replaces the two-dimensional convolutional layer with a depthwise separable large kernel convolution. Since the introduction of depthwise separable convolution increases the number of model layers, this is balanced by reducing the fixed number of layers. The preprocessed remote sensing image is input into the feature extraction module, and after passing through this module, intermediate features of different depths are output.

[0074] Furthermore, the SPFF module in this embodiment of the invention enhances the feature representation capability of the feature extraction module, and its network structure is the same as that of the SPFF module in the YOLOv5 network, which will not be described in detail here.

[0075] At this point, the feature extraction module performs additional feature extraction on layer C2, extracting multiple intermediate features {C2, C3, C4, C5} at different depths from the preprocessed remote sensing image.

[0076] II. Feature Fusion Module

[0077] Figure 7A simplified network structure diagram of the feature fusion module according to an embodiment of the present invention is illustrated schematically.

[0078] Please combine Figure 4 and Figure 7 In this embodiment, the feature fusion module includes multiple improved CSP modules, CBS modules, and ECA modules. The ECA (Efficient Channel Attention Module) module is used to implement the channel attention mechanism. The above operation S22 inputs multiple intermediate features of different depths into the feature fusion module and outputs multiple scale feature maps of fused attention.

[0079] The output features of layer C5 are processed by the first CBS module to obtain the intermediate features of layer 5.

[0080] The fifth-level intermediate features are upsampled and concatenated with the output features of the C4 layer. Then, they are processed by the first improved CSP module and the first ECA module. The resulting features are then processed by the second CBS module to obtain the fourth-level intermediate features with fused attention.

[0081] The fourth-level intermediate features are upsampled and concatenated with the output features of the C3 layer. Then, they are processed by the second improved CSP module and the second ECA module. The resulting features are then processed by the third CBS module to obtain the third-level intermediate features with fused attention.

[0082] The third-level intermediate features are upsampled and concatenated with the output features of the C2 layer. Then, they are processed by the third improved CSP module and the third ECA module. The resulting features are then processed by the fourth CBS module to obtain the second-level intermediate features with fused attention.

[0083] The intermediate features of the second level are passed through the fifth CBS module to obtain the P2 feature map;

[0084] After downsampling the P2 feature map, it is concatenated with the third-level intermediate features, and then processed by the fourth improved CSP module and the fourth ECA module. The resulting features are then processed by the sixth CBS module to obtain the P3 feature map with fused attention.

[0085] After downsampling the P3 feature map, it is concatenated with the fourth-level intermediate features, and then processed by the fifth improved CSP module and the fifth ECA module. The resulting features are then processed by the seventh CBS module to obtain the P4 feature map with fused attention.

[0086] After downsampling the P4 feature map, it is concatenated with the intermediate features of the fifth level. Then, it is processed by the sixth improved CSP module and the sixth ECA module. The resulting features are then processed by the eighth CBS module to obtain the P5 feature map with fused attention.

[0087] In this way, the additionally extracted shallow feature map C2 is integrated into the path aggregation framework in the feature fusion module. During path aggregation, feature maps of different depths extracted from the feature extraction are fused by upsampling and downsampling of deep features, enhancing information exchange between different layers. Furthermore, an ECA module is added before each image fusion to enhance the information of the effective channels, resulting in feature maps of different scales.

[0088] In this embodiment, the ECA module processes the input feature map in the following way:

[0089] The kernel size of the one-dimensional convolution is adaptively calculated based on the number of channels in the input feature map.

[0090] Perform global pooling on the input feature map;

[0091] Based on the kernel size of the one-dimensional convolution, perform one-dimensional convolution on the result of global pooling;

[0092] The result of the one-dimensional convolution is calculated using the sigmoid activation function. The calculated result is then concatenated with the input feature map to obtain the output feature map corresponding to the input feature map.

[0093] In this way, the ECA module sequentially calculates the one-dimensional convolutional kernel size, performs global pooling, convolution, and sigmoid activation. Based on the relationships between different channels, it automatically adjusts the weights of each channel. This process helps highlight important features and remove redundant information. Adding an ECA channel attention module during path aggregation and adding an extra feature extraction branch to the shallow layers of the network results in a new framework.

[0094] Thus, as Figure 4 and Figure 7 As shown, after the feature fusion module, multiple intermediate features {C2,C3,C4,C5} of different depths are fused to obtain multiple scale feature maps of fused attention, namely feature map P2, feature map P3, feature map P4, and feature map P5, which can be denoted as {P2,P3,P4,P5}.

[0095] Through the embodiments of the present invention, conventional target detection model algorithms are optimized and improved while ensuring a small model size, thereby enhancing the feature extraction of the target during remote sensing target detection, improving the accuracy of target detection in remote sensing images, and enhancing the lightweight performance of the model algorithm.

[0096] III. Feature Detection Module

[0097] Figure 8 The diagram illustrates the network structure of a feature detection module according to an embodiment of the present invention.

[0098] like Figure 4 and 8 As shown, in this embodiment, the above-described operation S23 inputs multiple scale feature maps into the feature detection module, and outputs the category and location information of the ship target, including:

[0099] For each feature map across multiple scales, an AnchorFree_Head object detection head is configured for that feature map. The object detection head contains two detection sub-networks.

[0100] The two detection subnetworks are two parallel convolutions: a classification convolution and a localization convolution. The classification convolution is used to predict the probability that a ship target appears at each grid point. There are K target categories, where K is a positive integer. The localization convolution is used to predict the offset of each grid point from the four sides of the ground-truth bounding box.

[0101] The label assignment strategy used during training for the two detection subnetworks is minimum loss assignment. The loss function is the sum of the classification loss and the localization loss; the classification loss function is Focal Loss; the localization loss function includes the L1 loss function and the GIOU loss function.

[0102] In this way, in the feature detection module, the object detection head uses an anchor-free detection method. For each feature map in multiple scale feature maps {P2, P3, P4, P5}, the feature map is input into the classification branch and the location branch, and the loss function is modified to the minimum cost assignment. The classification branch uses a classification convolution to predict the probability of object candidates in K object categories; the location branch uses a location convolution to predict the offset from each object candidate to the four boundaries of the object's ground truth box. This invention directly locates the target, avoiding the use of non-maximum methods, and is hardware-friendly.

[0103] This invention introduces a strategy of minimum loss allocation, which allows for better selection of positive and negative samples for target detection during training.

[0104] At this point, during inference, the two detection sub-networks of the target detection head will output the top-k score bounding boxes through the results of the two parallel convolutions, which represent the category and location information of the ship target.

[0105] It should be noted that in this embodiment of the invention, an Anchor Free_Head is configured for each feature map. Each target detection head will generate top-K score boxes at the same time. Each score box has its own position and category. Thus, the four target detection heads have four top-K score boxes. The maximum union of all results can be used to obtain the category and position information of the ship target.

[0106] In summary, this invention provides a method for detecting ship targets in real time using remotely sensed ships captured by UAV aerial photography. The method includes preprocessing large-format remote sensing images; in the feature extraction module, depthwise separable large-kernel convolution and residual structures are introduced; in the feature fusion module, for multiple feature map fusions after upsampling and downsampling on the path aggregation network, an efficient channel attention mechanism is added before each fusion to enhance information exchange between different layers and increase effective information. An additional feature extraction branch is added to the shallow layers to obtain a new framework; in the feature detection module, an additional small target detection head is added, and an anchor-free target detection head is used. By introducing a classification branch, the target is directly located, avoiding the use of non-maximum methods, making it hardware-friendly.

[0107] As can be seen from the above description, the embodiments of the present invention achieve at least the following technical effects:

[0108] (1) The feature extraction module of the present invention increases the receptive field and improves the feature extraction capability without increasing the number of parameters by introducing depth-separable large kernel convolution and residual structure.

[0109] (2) The feature fusion module of this invention, for the fusion of multiple feature maps after upsampling and downsampling of the path aggregation network, adds an efficient channel attention mechanism before each fusion to enhance information exchange between different layers, enhance effective information, and improve the accuracy and efficiency of the network. Adding additional feature extraction branches in the shallow layers yields a new framework for extracting features for small targets.

[0110] (3) The feature detection module of the present invention adds an additional small target detection head and uses an Anchor-free target detection head. By introducing a classification branch, the target is located directly, avoiding the use of non-maximum methods, which is hardware-friendly.

[0111] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0112] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified. Furthermore, the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

[0113] The embodiments of the present invention have been described above. However, these embodiments are merely illustrative and not intended to limit the scope of the invention. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of the invention, and all such substitutions and modifications should fall within the scope of the invention.

Claims

1. A method for detecting ship targets, characterized in that, include: Acquire remote sensing images containing ship targets, and preprocess the remote sensing images; The preprocessed remote sensing image is input into the trained target detection model to obtain the category and location information of the ship target; The target detection model is based on an improved YOLOv5 network and includes a feature extraction module, a feature fusion module, and a feature detection module. In the feature extraction module, the two-dimensional convolutional layer of the YOLOv5 network is replaced with a depth-separable large kernel convolution. In the feature fusion module, an additional small target extraction layer is added, and a channel attention module is added before each fusion for multiple feature map fusions after upsampling and downsampling. In the feature detection module, an anchorless target detection head is configured for each input feature map; The feature extraction module is used to output the output features of layers C2, C3, C4, and C5 from shallow to deep; the feature fusion module includes multiple improved CSP modules, CBS modules, and ECA modules, and the ECA module is used to implement the channel attention mechanism; in the feature fusion module, The output features of the C5 layer are processed by the CBS module to obtain the intermediate features of the fifth layer. The fifth-level intermediate features are upsampled and then concatenated with the output features of the C4 layer. After processing by the improved CSP module and ECA module, the resulting features are processed by the CBS module to obtain the fourth-level intermediate features. The fourth-level intermediate features are upsampled and then concatenated with the output features of the C3 layer. After processing by the improved CSP module and ECA module, the resulting features are processed by the CBS module to obtain the third-level intermediate features. The third-level intermediate features are upsampled and then concatenated with the output features of the C2 layer. After processing by the improved CSP module and ECA module, the resulting features are processed by the CBS module to obtain the second-level intermediate features. The second-level intermediate features are processed through the CBS module to obtain the P2 feature map; The P2 feature map is downsampled and then concatenated with the third-level intermediate features. After processing by the improved CSP module and ECA module, the resulting features are processed by the CBS module to obtain the P3 feature map. The P3 feature map is downsampled and then concatenated with the fourth-level intermediate features. After processing by the improved CSP module and ECA module, the resulting features are processed by the CBS module to obtain the P4 feature map. The P4 feature map is downsampled and then concatenated with the fifth-level intermediate features. After processing by the improved CSP and ECA modules, the resulting features are processed by the CBS module to obtain the P5 feature map.

2. The method according to claim 1, characterized in that, The remote sensing image is a visible light grayscale remote sensing image; Preprocessing the remote sensing image includes: The visible light grayscale remote sensing image is divided into blocks and cropped to form multiple image blocks; Edge padding is applied to each of the plurality of image blocks.

3. The method according to claim 1, characterized in that, The step of inputting the preprocessed remote sensing image into the trained target detection model to obtain the category and location information of the ship target includes: The preprocessed remote sensing image is input into the feature extraction module, which outputs multiple intermediate features at different depths. The multiple intermediate features at different depths are input into the feature fusion module, which outputs multiple scale feature maps with fused attention. The multiple scale feature maps are input into the feature detection module, which outputs the category and location information of the ship target.

4. The method according to claim 3, characterized in that, The feature extraction module consists of five layers from shallow to deep: C1, C2, C3, C4, and C5. Layer C1 contains a CBS module; layers C2, C3, and C4 have the same structure, each consisting of a CBS module and an improved CSP module; and layer C5 consists of a CBS module, an SPFF module, and an improved CSP module. The features output by layers C2, C3, C4, and C5 are used as multiple intermediate features of different depths output by the feature extraction module.

5. The method according to claim 4, characterized in that, The CBS module includes a two-dimensional convolutional layer, a batch normalization layer, and a SiLU activation function; the two-dimensional convolutional layer is a convolution with a size of 3 and a stride of 1. The improved CSP module includes a first branch and a second branch; the first branch first performs a 1*1 convolution, and then performs multiple loops of the BottleNeck module, including looping the C2, C3, C4 and C5 layers with different intermediate iterations; the second branch performs a 1*1 convolution, and the convolution result is concatenated with the result of the first branch; Specifically, the improved CSP module adjusts the intermediate number of loops in the multiple BottleNeck modules of the YOLOv5 network; in the two-dimensional convolutional layer of the BottleNeck module in the improved CSP module, 1*1 convolutions and 3*3 convolutions are replaced with 3*3 convolutions and 5*5 depthwise separable convolutions.

6. The method according to claim 1, characterized in that, The ECA module processes the input feature map in the following way: The kernel size of the one-dimensional convolution is adaptively calculated based on the number of channels in the input feature map. The input feature map is then globally pooled. Based on the kernel size of the one-dimensional convolution, perform a one-dimensional convolution on the result of global pooling; The result of the one-dimensional convolution is calculated using the sigmoid activation function, and the calculation result is concatenated with the input feature map to obtain the output feature map corresponding to the input feature map.

7. The method according to claim 3, characterized in that, The feature detection module inputs the multiple scale feature maps and outputs the category and location information of the ship target, including: For each feature map among the multiple scale feature maps, a target detection head is configured for that feature map, and the target detection head contains two detection sub-networks; The two detection sub-networks are two parallel convolutions, namely a classification convolution and a localization convolution; the label allocation strategy used during training of the two detection sub-networks is the minimum loss allocation.

8. The method according to claim 7, characterized in that, The two detection sub-networks use a loss function that is the sum of classification loss and localization loss during training. The classification loss function is Focal Loss, and the localization loss function includes L1 loss function and GIOU loss function. During inference, the two detection subnetworks generate the first k score bounding boxes from the two parallel convolutions. These bounding boxes represent the category and location information of the ship targets, where k is a positive integer.

9. The method according to claim 1, characterized in that, The target detection model was trained in the following way: Construct an object detection model and define a loss function; A set of historical remote sensing images is acquired, and the set of historical remote sensing images is labeled and preprocessed to obtain training samples; The training samples are input into the constructed target detection model to obtain the predicted target location and predicted target category for each historical remote sensing image; Based on the loss function, the predicted target location, the predicted target category, and the annotation information, the constructed target detection model is back-trained to obtain a trained target detection model.