A method for detecting small maritime targets by unmanned aerial vehicles (UAVs) based on YOLOv13s.
By improving the lossless downsampling, hybrid sensing feature extraction, and adaptive gating attention module of the YOLOv13s model, the problems of feature information loss and noise interference in the detection of extremely small targets at sea were solved, and high-precision and stable target detection was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QUANZHOU INST OF EQUIP MFG
- Filing Date
- 2026-04-15
- Publication Date
- 2026-06-30
AI Technical Summary
Existing target detection models suffer from high false negative rates, loss of feature information, and insufficient detection stability in complex sea environments when detecting extremely small targets at sea. In particular, in UAV search and rescue scenarios, the geometric and texture information of extremely small targets is easily lost in deep feature maps, and it is difficult to suppress sea surface noise interference.
An improved YOLOv13s model is adopted, which combines a lossless downsampling module, a hybrid sensing feature extraction module, and an adaptive gated attention module with spatial attention masking and frequency domain global self-attention to achieve lossless preservation of extremely small targets and target reconstruction against complex sea surface backgrounds, thereby improving detection accuracy.
It significantly improves the accuracy and robustness of small target detection at sea, reduces the false negative rate, and effectively suppresses background noise interference in complex sea environments, thereby improving the model's detection stability and parameter efficiency.
Smart Images

Figure CN122067147B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of target detection, and specifically to a method for detecting small maritime targets by unmanned aerial vehicles (UAVs) based on YOLOv13s. Background Technology
[0002] Existing target detection models are insufficient in representing extremely small targets at sea. Existing single-stage detection models such as YOLO are mainly designed for medium and large-scale targets. They use multiple high-magnification downsampling in the backbone network to expand the receptive field. However, in maritime UAV search and rescue scenarios, people in distress or rescue equipment often occupy only a few pixels in the image. After multi-layer feature compression, their weak geometric and texture information is easily lost in the deep feature map, resulting in a high rate of missed detection for distant small targets.
[0003] Traditional downsampling methods are prone to losing feature information of small targets and are difficult to suppress sea surface noise interference. Existing detection networks generally use stride convolution or pooling operations to downsample feature maps. While compressing spatial resolution, this process is prone to feature aliasing, which couples the fine-grained information of extremely small targets with background textures such as sea surface waves and reflections. This makes it difficult for the model to effectively extract the salient features of the target in the shallow stage, thus affecting the detection accuracy.
[0004] Existing models lack global perception capabilities for complex sea environments and their feature adjustment mechanisms are not flexible enough. Limited by the local receptive field of convolution operations, the models cannot make full use of global context information to distinguish targets from dynamic interferences such as sea surface reflections and ripples. At the same time, traditional fixed-weight feature processing methods are difficult to adapt to changes in sea lighting and complex background conditions, resulting in insufficient detection stability and robustness of the algorithm in extreme environments. Summary of the Invention
[0005] The purpose of this invention is to provide an improved method for detecting small maritime targets by unmanned aerial vehicles (UAVs) based on YOLOv13s, which enhances detection accuracy.
[0006] To achieve the above objectives, the present invention adopts the following technical solution:
[0007] An improved method for detecting small targets at sea using unmanned aerial vehicles (UAVs) based on YOLOv13s is proposed. The method constructs a YOLOv13s-Sea model for detecting small targets at sea. The YOLOv13s-Sea model is an improved YOLOv13s model. The YOLOv13s-Sea model includes a backbone network for extracting features from the input image, a neck network for extracting and fusing features from the feature map, and a head network for detecting and classifying the fused feature map output by the neck network.
[0008] Remove the two A2C2f modules at the end of the original backbone network and the DSConv module between the two A2C2f modules. Replace the DS-C3k2 module in the original backbone network with a lossless downsampling module and a hybrid sensing feature extraction module connected in sequence. Replace the third Conv module in the original backbone network with a DSConv module. The end of the original backbone network is connected with a lossless downsampling module and a hybrid sensing feature extraction module in sequence.
[0009] The first DS-C3k2 module of the original neck network is replaced with a hybrid sensing feature extraction module, and the other DS-C3k2 modules of the original neck network are replaced with an adaptive gating attention module and a hybrid sensing feature extraction module connected in sequence.
[0010] Input the marine images taken by the drone into the YOLOv13s-Sea model, and output the detection results.
[0011] Preferably, the lossless downsampling module includes two branches. One branch performs a slicing operation with a stride of 2 on the input feature map, dividing the feature map into four complementary sub-blocks in the spatial dimension. The four sub-blocks are then recombined using channel splicing to obtain a rearranged feature map.
[0012] Another branch sequentially performs channel-dimensional average pooling, convolution, and activation function processing on the input feature map to generate a spatial weight mask. The spatial weight mask is then subjected to average pooling and cross-channel broadcasting to generate an attention map.
[0013] The rearranged feature map is multiplied element-wise with the attention map to obtain the sampled feature map.
[0014] Preferably, the hybrid sensing feature extraction module is an improvement on the original DS-C3k2 module, replacing the DS-C3k module in the original DS-C3k2 module with the DS-C3k-FSAS module;
[0015] The DS-C3k-FSAS module includes two branches. One branch uses a first convolutional module to compress the input features, and uses multiple DS-Bottleneck-FSAS units to extract features from the compressed features in sequence to obtain the first feature map.
[0016] Another branch uses a second convolutional module to compress the input features through channels, and then concatenates and fuses the compressed features with the first feature map, and integrates them through the convolutional module to output the second feature map;
[0017] The DS-Bottleneck-FSAS unit sequentially applies a first depthwise separable convolutional module, a third convolutional module, and a second depthwise separable convolutional module to the input features for channel adjustment and local feature extraction. The extracted features are mapped into three branches: Query, Key, and Value. The Query and Key branches are projected to the frequency domain using a Fast Fourier Transform, and the feature correlation between the two branches is calculated in the frequency domain. An Inverse Fourier Transform is used to map the calculation results back to the spatial domain. A fourth convolutional module is used to integrate the mapped features with the Value branch. The integrated features are then residually concatenated with the input features to output the enhanced features.
[0018] Preferably, the adaptive gating attention module performs global average pooling, global max pooling, power mean pooling, and low-pass filtering pooling on the input features, concatenates the pooled features along the channel dimension, and uses one-dimensional convolution to fuse the concatenated features to obtain the channel description vector.
[0019] The channel description vector is input into the kernel selector to generate weight coefficients for different kernels, and the channel attention weight map is obtained by weighted aggregation.
[0020] A gated modulation signal is generated based on the attention weight map. : ;
[0021] in, This represents the channel attention weight map. This represents the learnable channel bias parameters. This represents the Sigmoid activation function;
[0022] Define the residual calibration function as follows :
[0023] ;
[0024] in, This represents the learnable scaling factor. Represents the hyperbolic tangent activation function;
[0025] Using residual calibration function The input feature map is processed to output an attention feature map.
[0026] By adopting the aforementioned design scheme, the beneficial effects of the present invention are as follows: This application uses a lossless downsampling module for feature extraction. This lossless downsampling module combines spatial attention masking with pixel slice rearrangement to suppress sea surface background noise while achieving lossless preservation of the original pixel features of extremely small targets.
[0027] A hybrid sensing feature extraction module is used for feature extraction. This module integrates local depth convolution and frequency domain global self-attention operator in parallel. It uses spatial domain to capture edges and frequency domain to deal with ripples, thereby improving the target reconstruction capability in complex sea surface backgrounds.
[0028] The adaptive gating attention module uses global statistics aggregation and adaptive receptive field convolution to generate modulated signals, thereby achieving dynamic calibration of the channel feature response to eliminate the interference of extreme light and shadow at sea on the detection results and improve detection accuracy. Attached Figure Description
[0029] Figure 1 This is a diagram of the YOLOv13s-Sea model architecture of the present invention;
[0030] Figure 2 This is a flowchart of the lossless downsampling module of the present invention;
[0031] Figure 3 This is a schematic diagram of the hybrid sensing feature extraction module of the present invention;
[0032] Figure 4 This is a flowchart of the adaptive gating attention module of the present invention;
[0033] Figure 5 This is a heatmap of category activation mappings comparing the present invention with the baseline model;
[0034] Figure 6 This invention visualizes the detection effect in long-distance and water surface reflection scenarios. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0036] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.
[0037] A method for detecting small maritime targets by UAVs based on an improved version of YOLOv13s, constructing a system as follows: Figure 1 The YOLOv13s-Sea model shown is an improved YOLOv13s model. It includes a backbone network for feature extraction from the input image, a neck network for feature extraction and fusion of the feature map, and a head network for detection and classification of the fused feature map output by the neck network.
[0038] The two A2C2f modules at the end of the original backbone network and the DSConv module between the two A2C2f modules are removed, thereby reducing background interference caused by high-rate downsampling, reducing the number of network parameters and computational complexity, and avoiding deep semantic redundancy.
[0039] The original backbone network's DS-C3k2 module is replaced with a lossless downsampling module and a hybrid sensing feature extraction module connected in sequence, and the third Conv module in the original backbone network is replaced with a DSConv module. The original backbone network's ends are connected to the lossless downsampling module and the hybrid sensing feature extraction module in sequence.
[0040] The first DS-C3k2 module of the original neck network is replaced with a hybrid sensing feature extraction module, and the other DS-C3k2 modules of the original neck network are replaced with an adaptive gating attention module and a hybrid sensing feature extraction module connected in sequence.
[0041] Input the marine images taken by the drone into the YOLOv13s-Sea model, and output the detection results.
[0042] like Figure 2 As shown, to address the issue of small target feature loss caused by traditional stride convolution downsampling, this application proposes a spatial attention modulation lossless downsampling module (SAM-S2D). The lossless downsampling module includes two branches. One branch performs a stride-2 slicing operation on the input feature map, dividing the feature map into four complementary sub-blocks in the spatial dimension. These four complementary sub-blocks are then processed using... , , and This indicates that the four sub-blocks are reassembled using channel splicing to obtain a rearranged feature map. This process is represented by the following formula:
[0043] ;
[0044] ;
[0045] in, The input lossless downsampling module's feature map is represented by B, where B represents the batch size, C represents the number of channels, and H and W represent the height and width of the feature map, respectively. Indicates channel splicing;
[0046] To achieve noise suppression, another branch sequentially performs channel-dimensional average pooling, 3×3 convolution, and sigmoid activation on the input feature map to generate a spatial weight mask at the original resolution. This spatial weight mask is then subjected to average pooling with a stride of 2 and broadcast across channels to generate an attention map. ;
[0047] rearrange feature maps With the attention map Element-wise multiplication is performed to obtain the sampled feature map, a process represented by the following formula:
[0048] ;
[0049] in, This indicates element-wise multiplication.
[0050] This modulation process gives the downsampling operation adaptive selectivity, significantly improving the signal-to-noise ratio and discrimination sensitivity of the model in complex marine environments by enhancing the response of the target area and suppressing dynamic background noise such as wave reflection and shadows.
[0051] To address the issue of limited local receptive fields in convolutional networks, this application proposes a hybrid sensing feature extraction module, the structure of which is as follows: Figure 3 As shown, the hybrid sensing feature extraction module is an improvement on the original DS-C3k2 module, replacing the DS-C3k module in the original DS-C3k2 module with the DS-C3k-FSAS module.
[0052] The DS-C3k-FSAS module includes two branches. One branch uses a 1×1 first convolution module to compress the input features through channels, and uses multiple DS-Bottleneck-FSAS units to extract features from the compressed features in sequence to obtain the first feature map. DS-Bottleneck-FSAS is the core computing unit of the entire structure. Its main function is to establish a feature interaction mechanism between the frequency domain and the spatial domain, thereby achieving coordinated optimization of global context modeling and local detail preservation.
[0053] Another branch uses a 1×1 second convolution module to compress the input features through channels, and then concatenates and fuses the compressed features with the first feature map and integrates them through a 1×1 convolution module to output a second feature map, thereby improving the feature representation capability and reducing information loss.
[0054] The DS-Bottleneck-FSAS unit sequentially employs a 3×3 first depthwise separable convolutional module, a 1×1 third convolutional module, and a 3×3 second depthwise separable convolutional module to perform channel adjustment and local feature extraction on the input features. The extracted features are mapped to three branches: Query, Key, and Value. The Query and Key branches are projected to the frequency domain using a Fast Fourier Transform (FFT), and the feature correlation between the two branches is calculated in the frequency domain. An Inverse Fourier Transform (IFT) is then used to map the calculation results back to the spatial domain. This process maintains computational efficiency while modeling global spatial dependencies. A fourth convolutional module integrates the mapped features with the Value branch, and the integrated features are residually concatenated with the input features to output enhanced features.
[0055] Through this design, the DS-C3k2-FSAS module can achieve collaborative modeling of frequency domain global information and spatial local details while maintaining computational efficiency, thereby effectively improving the model's ability to detect small targets in complex sea environments.
[0056] To address the issues of weak signals from extremely small targets and high-frequency noise interference from strong light and waves on the sea surface during detection by maritime unmanned aerial vehicles (UAVs), this application designs, as follows: Figure 4 The adaptive gated attention module (SAGA) shown here overcomes the spatial structure information loss problem caused by the single global average pooling (GAP) in traditional channel attention. Instead, it performs global average pooling, global max pooling, power-mean pooling, and low-pass filtering pooling on the input features, respectively. This process is expressed by the following formula:
[0057] ;
[0058] in Let represent the feature value of the c-th channel in the b-th sample at spatial location (i, j). This represents a low-pass filter convolution kernel; the symbol * indicates the convolution operation. It is a very small positive number, used to prevent zero values or instability during numerical calculations. This represents the power exponent parameter in power-mean pooling, used to adjust the pooling's sensitivity to large response values. When the value is greater than 1, When the value is large, the pooling result is closer to the max pooling result. At that time, it degenerates into average pooling.
[0059] The pooled features are concatenated along the channel dimension, and one-dimensional convolution (Conv1D) is used to fuse the concatenated features to obtain the channel description vector, so as to comprehensively reflect the saliency features of sea surface targets and background statistical distribution information.
[0060] To adapt to targets of different scales and complex sea surface backgrounds, this adaptive gated attention module constructs a set of one-dimensional convolutional kernels with different receptive fields as a kernel selector to achieve multi-scale channel modeling. The channel description vectors are input into the kernel selector to generate weight coefficients for different kernels, and the channel attention weight map is obtained through weighted aggregation. It is expressed by the following formula:
[0061] ;
[0062] in, This represents the kernel selection weight vector output by the kernel weight generator, used to measure the importance of convolutional kernels with different receptive fields. Indicates the first The feature response obtained after applying a one-dimensional convolutional kernel to the fused descriptor;
[0063] Through the aforementioned adaptive kernel selection mechanism, the model can dynamically adjust the convolutional receptive field according to the statistical distribution of input features, thereby improving its ability to model targets of different scales and complex background interference.
[0064] To avoid excessive suppression of original features by attention weights and to improve network training stability, this module introduces a residual gating calibration mechanism. This mechanism generates a gated modulation signal based on the attention weight map. : ;
[0065] in, This represents the channel attention weight map. This represents the learnable channel bias parameter used to adjust the response strength of the gating signal. This represents the Sigmoid activation function;
[0066] Define the residual calibration function as follows :
[0067] ;
[0068] in, This represents the learnable scaling factor, used to control the modulation amplitude of the residual calibration term on the original features. Represents the hyperbolic tangent activation function;
[0069] Using residual calibration function For the input feature map The process involves processing the data to output an attention feature map, which is represented by the following formula:
[0070] ;
[0071] This design maintains a near-identical mapping in the early stages of training, thus ensuring stable network training. As training progresses, the attention mechanism gradually enhances key channels, thereby improving the model's robustness in complex environments such as strong light reflection and wave interference.
[0072] The experimental data in Table 1 below show that the YOLOv13s-Sea model of this application achieved 44.8% and 77.2% AP and AP50 scores on the SeaDronesSee dataset, respectively, which are 8.3% and 14.0% higher than the YOLOv13s model, and the number of model parameters decreased from 9.0M to 2.5M.
[0073] Table 1 shows the comparative experiments of the method proposed in this application (SeaDronesSee Object Detection v2 dataset):
[0074]
[0075] The experimental data in Table 2 below show that the YOLOv13s-Sea model of this application improves the AP and AP50 by 2.6% and 8.2% respectively on the TinyPerson dataset.
[0076] Table 2 shows the generalization experiments for the TinyPerson dataset of marine unmanned aerial vehicles:
[0077]
[0078] like Figure 5 The image shows a comparison of the feature responses of the original image, the baseline method YOLOv13s, and the proposed method YOLOv13s-Sea, illustrating the differences in how different methods focus on small target areas at sea. Compared to the baseline method YOLOv13s, the response area of the proposed method is more concentrated on the location of small targets at sea, with fewer invalid responses to background areas such as sea surface textures and reflections. This indicates that the proposed method can more effectively highlight the features of small targets and suppress interference from complex sea surface backgrounds.
[0079] like Figure 6The diagram shows a comparison of the detection results of the ground truth method, the baseline method YOLOv13s, and the proposed method YOLOv13s-Sea. Compared to the baseline method YOLOv13s, the proposed method's detection results are closer to the ground truth, the target localization is more accurate, and it can reduce false detections in complex sea surface backgrounds. This indicates that the proposed method has better detection performance and localization capabilities in small target detection tasks at sea.
[0080] In summary, this application uses a lossless downsampling module for feature extraction. This lossless downsampling module combines spatial attention masking with pixel slice rearrangement to suppress background noise on the sea surface while achieving lossless preservation of the original pixel features of extremely small targets.
[0081] A hybrid sensing feature extraction module is used for feature extraction. This module integrates local depth convolution and frequency domain global self-attention operator in parallel. It uses spatial domain to capture edges and frequency domain to deal with ripples, thereby improving the target reconstruction capability in complex sea surface backgrounds.
[0082] The adaptive gating attention module uses global statistics aggregation and adaptive receptive field convolution to generate modulated signals, thereby achieving dynamic calibration of the channel feature response to eliminate the interference of extreme light and shadow at sea on the detection results and improve detection accuracy.
[0083] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for detecting small maritime targets by unmanned aerial vehicles (UAVs) based on an improved version of YOLOv13s, characterized in that: A YOLOv13s-Sea model for small target detection at sea is constructed. This YOLOv13s-Sea model is an improved YOLOv13s model. The YOLOv13s-Sea model includes a backbone network for feature extraction of the input image, a neck network for feature extraction and fusion of the feature map, and a head network for detection and classification of the fused feature map output by the neck network. Remove the two A2C2f modules at the end of the original backbone network and the DSConv module between the two A2C2f modules. Replace the DS-C3k2 module in the original backbone network with a lossless downsampling module and a hybrid sensing feature extraction module connected in sequence. Replace the third Conv module in the original backbone network with a DSConv module. The end of the original backbone network is connected with a lossless downsampling module and a hybrid sensing feature extraction module in sequence. The first DS-C3k2 module of the original neck network is replaced with a hybrid sensing feature extraction module, and the other DS-C3k2 modules of the original neck network are replaced with an adaptive gating attention module and a hybrid sensing feature extraction module connected in sequence. Input the marine images captured by the drone into the YOLOv13s-Sea model, and output the detection results; The lossless downsampling module includes two branches. One branch performs a slicing operation with a stride of 2 on the input feature map, dividing the feature map into four complementary sub-blocks in the spatial dimension. The four sub-blocks are then recombined using channel splicing to obtain a rearranged feature map. Another branch sequentially performs channel-dimensional average pooling, convolution, and activation function processing on the input feature map to generate a spatial weight mask. The spatial weight mask is then subjected to average pooling and cross-channel broadcasting to generate an attention map. The rearranged feature map is multiplied element-wise with the attention map to obtain the sampled feature map; The hybrid sensing feature extraction module is an improvement on the original DS-C3k2 module, replacing the DS-C3k module in the original DS-C3k2 module with the DS-C3k-FSAS module; The DS-C3k-FSAS module includes two branches. One branch uses a first convolutional module to compress the input features, and uses multiple DS-Bottleneck-FSAS units to extract features from the compressed features in sequence to obtain the first feature map. Another branch uses a second convolutional module to compress the input features through channels, and then concatenates and fuses the compressed features with the first feature map, and integrates them through the convolutional module to output the second feature map; The DS-Bottleneck-FSAS unit sequentially applies a first depthwise separable convolutional module, a third convolutional module, and a second depthwise separable convolutional module to the input features for channel adjustment and local feature extraction. The extracted features are mapped into three branches: Query, Key, and Value. The Query and Key branches are projected to the frequency domain using a Fast Fourier Transform, and the feature correlation between the two branches is calculated in the frequency domain. The calculation result is mapped back to the spatial domain using an Inverse Fourier Transform. The fourth convolutional module integrates the mapped features with the Value branch. The integrated features are then residually concatenated with the input features to output the enhanced features. The adaptive gated attention module performs global average pooling, global max pooling, power mean pooling, and low-pass filtering pooling on the input features, concatenates the pooled features along the channel dimension, and uses one-dimensional convolution to fuse the concatenated features to obtain the channel description vector. The channel description vector is input into the kernel selector to generate weight coefficients for different kernels, and the channel attention weight map is obtained by weighted aggregation. A gated modulation signal is generated based on the attention weight map. : ; in, This represents the channel attention weight map. This represents the learnable channel bias parameters. This represents the Sigmoid activation function; Define the residual calibration function as follows : ; in, This represents the learnable scaling factor. Represents the hyperbolic tangent activation function; Using residual calibration function The input feature map is processed to output an attention feature map.