An unmanned aerial vehicle aerial image small target detection method based on self-modulation dynamic modeling and semantic detail collaborative enhancement

By employing a method of self-modulation dynamic modeling and semantic detail co-enhancement, a small target detection network for UAV aerial images is constructed. This addresses the issues of low detection accuracy and poor scale adaptability for small targets in UAV aerial images, achieving high-precision and robust detection results.

CN122200446APending Publication Date: 2026-06-12ANHUI UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANHUI UNIV OF SCI & TECH
Filing Date
2026-03-20
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Small target detection in UAV aerial images suffers from low accuracy, poor scale adaptability, and interference from complex backgrounds, leading to the easy loss of small target information. Traditional methods struggle to effectively preserve edge and fine-grained features, affecting detection accuracy and robustness.

Method used

By employing a method of self-modulation dynamic modeling and semantic detail co-enhancement, a target detection network is constructed through a multi-scale self-modulation dynamic modeling module, an attention-guided high-low layer feature fusion module, and a high-resolution small target detection head, thereby achieving adaptive modeling and high-precision detection of targets at different scales.

🎯Benefits of technology

It improves the accuracy and robustness of small target detection in UAV aerial photography scenarios, can adapt to complex backgrounds and scale changes, and significantly enhances the detection performance and stability of micro-targets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200446A_ABST
    Figure CN122200446A_ABST
Patent Text Reader

Abstract

The application discloses a kind of unmanned plane aerial image target detection methods based on self-modulation dynamic modeling and semantic details synergistic enhancement, which includes the following steps: step S1: prepare unmanned plane image dataset and carry out training set test set division;Step S2: design multi-scale self-modulation dynamic modeling module;Step S3: design attention guide high-low layer feature fusion module;Step S4: design high-resolution small target detection head;Step S5: construct target detection network model and train;Step S6: select evaluation index, obtain detection result;Compared with prior art, the application can obtain higher detection accuracy and robustness in the unmanned plane scene with large target scale change, complex background and small target dense.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of UAV image processing and target detection technology, and in particular to a method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail enhancement. Background Technology

[0002] Existing technologies for small target detection in drone aerial photography still have significant shortcomings: due to the randomness of drone shooting altitude and angle, the scale differences of similar targets in the image are significant, and large targets can easily obscure the fine-grained information of small targets; drone images usually contain high-density small targets and complex and variable backgrounds, and traditional convolutional operations can easily dilute small target information and lose detailed features during feature extraction; existing feature fusion methods are mostly fixed-weight top-down or lateral connections, which cannot adaptively distinguish the importance of different scales or channels, resulting in shallow texture information being suppressed by deep semantics, affecting the accuracy of small target detection; for tiny targets, the spatial resolution of traditional detection heads is limited, which easily leads to missed detections and false detections.

[0003] Therefore, in drone aerial photography scenarios, there is an urgent need for a detection method that can effectively address issues such as large target scale variations, high target density, complex backgrounds, and the easy loss of information about small targets. This method should simultaneously preserve the edge and fine-grained features of small targets, achieve adaptive fusion of shallow texture and deep semantic information, and accurately detect tiny targets on high-resolution features, thereby improving the accuracy and robustness of drone small target detection. This invention is proposed against this backdrop, aiming to solve the problems of low accuracy, poor scale adaptability, and interference from complex backgrounds in existing drone aerial small target detection technologies. It provides a drone aerial small target detection method based on self-modulation dynamic modeling and semantic detail co-enhancement, offering an effective solution for small target detection in complex drone scenarios. Summary of the Invention

[0004] In view of the problems in existing UAV aerial image target detection technology, small targets are prone to feature weakening, edge information loss, and insufficient semantic and detail fusion in complex backgrounds, drastic scale changes, and multiple downsampling processes, resulting in low detection accuracy and robustness. In view of this, the main purpose of this invention is to provide a UAV aerial image small target detection method based on self-modulation dynamic modeling and semantic detail co-enhancement, so as to improve the detection accuracy and stability of small targets in complex scenes.

[0005] To achieve the above objectives, the present invention adopts the following technical solution: 1. A method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement, characterized in that the method includes the following steps: Step S1, Dataset Preprocessing: Based on the drone aerial image dataset, the images are cleaned, labeled and organized, and divided into training set and test set; Step S2, design a multi-scale self-modulation dynamic modeling module: design a multi-scale self-modulation dynamic modeling module at the step-by-step downsampling point of the backbone network. The multi-scale self-modulation dynamic modeling module includes multi-scale parallel convolution, local feature enhancement branch, global feature attention branch, as well as normalization and activation operations; used to adaptively model small target structural features at different scales during the downsampling process. Step S3, design attention-guided high-low feature fusion module: In the network neck feature fusion stage, design attention-guided high-low feature fusion module. The attention-guided high-low feature fusion module includes channel mapping of high-level semantic features and low-level detail features, channel and spatial feature modeling based on attention mechanism, multi-branch candidate feature generation and adaptive weighted fusion structure, and sets residual branches to retain low-level details. The residual connection realizes the synergistic enhancement of semantic information and detail information. Step S4, design a high-resolution small target detection head: add a high-resolution small target detection branch to the detection head. The detection head performs multi-scale interaction by upsampling deep features and inputting attention-guided high and low-level feature fusion modules, and is refined by the C3k2 module. Finally, it classifies and regresses bounding boxes for small targets using a multi-scale prediction structure. Step S5: Construct and train the target detection network: Integrate the multi-scale self-modulation dynamic modeling module described in Step S2 into the backbone of the target detection network, embedding the module at the downsampling points of the backbone network to adaptively model the structural features of small targets at different scales; Integrate the attention-guided high-low layer feature fusion module described in Step S3 into the network neck to achieve synergistic enhancement of high-level semantic information and low-level detail information during the feature fusion stage; Simultaneously, integrate the high-resolution small target detection head described in Step S4 as a new branch into the detection head part; Train a small target detection model for UAV aerial images based on self-modulation dynamic modeling and semantic detail synergistic enhancement using a public UAV aerial image dataset; Step S6: Select evaluation metrics and obtain detection results: Perform inference on the test set using the trained model to output small target detection results for the drone aerial images, including target categories and bounding boxes; then use the average accuracy metric to evaluate the detection results to verify the model's detection performance.

[0006] Further, step S2 specifically includes: the multi-scale self-modulation dynamic modeling module is used to obtain features of different scales through multi-scale parallel convolution during the downsampling process of the backbone network, and to achieve adaptive modeling of local spatial structure and global semantic information through local feature enhancement branches and global feature attention branches.

[0007] Furthermore, in step S21, the multi-scale self-modulation dynamic modeling module specifically includes: Using the feature map output from the first-level downsampling of the backbone network as input, this input feature map... Inputting depthwise separable convolutional layers separately yields multi-scale features. Element-wise addition and fusion of these multi-scale features results in multi-scale fused features. As shown in the formula below: Indicates the core size is The depthwise separable convolution operation is used; multi-scale fused features are input into the local feature enhancement branch, and the sampling offset is learned through the offset convolution layer. The multi-scale fused features and the sampling offset are then input into the convolution layer to obtain the output feature map at the center position. Response value at As shown in the formula below: Refers to the kernel weights; Corresponding to the One predefined sampling point; It is the offset of the sampling point; This represents the learnable modulation weights at the nth sampling point (range of values). This is used to suppress the sampling response in invalid interference areas, allowing the model to focus more on the target itself; Represents the predefined set of sampling locations for the convolution kernel; This represents the learnable modulation weights at the nth sampling point (range of values). This is used to suppress the sampling response in invalid interference regions, allowing the model to focus more on the target itself and complete adaptive sampling; batch normalization and nonlinear activation operations are sequentially performed on the sampled features to output locally enhanced features; The multi-scale fused features are input into the global feature attention branch, as shown in the following formula: This represents the global statistical characteristics of the corresponding channel; Indicates the height of the feature map; Indicates the width of the feature map; Indicates the first The spatial location of each channel The feature response value at the specified location; global average pooling is performed on the multi-scale fused features to obtain channel aggregated features, and the channel aggregated features are then subjected to 1×1 convolution and nonlinear activation, and then... The function is shown in the following formula: and For learnable parameters, It is a non-linear activation function; it generates channel attention weights, uses the channel attention weights to perform channel weighting on local enhancement features, and outputs global attention enhancement features; it uses the global attention enhancement features as the final output of the module, which is fed into the next level downsampling layer of the backbone network on the one hand, and provides semantic feature input to the attention-guided high and low layer feature fusion module in step S2 on the other hand.

[0008] Furthermore, in step S3, the attention-guided high-low feature fusion module performs channel mapping and standardization on high-level features H and low-level features L, and then generates channel and spatial attention response maps using a spatial channel collaborative attention mechanism. Combined with residual branching and three-way information fusion strategies, it achieves adaptive fusion of high-low features, enhancing the ability to express small targets and edge details.

[0009] Further, in step S2, the channel mapping process is represented as follows: in, and These represent the low-level and high-level feature maps of the input, respectively. and For its journey Feature representations after batch normalization and nonlinear activation Used for channel mapping and dimension alignment; Used to stabilize feature distribution; This is used to enhance the non-linear expressive power of features. The low-level features after channel mapping are input into the channel attention module, and then processed by global average pooling, two stages of 1×1 convolution, and... The attention weights for the activated generation channel are shown in the following formula: Indicates the input feature map; This indicates that a global average pooling operation is performed on the feature map in the spatial dimension to extract global statistical information for each channel; W1 and W2 are learnable channel mapping weight matrices used to model the correlation between channels. Represents a nonlinear activation function; This is used to map the output to the (0,1) interval and generate channel attention weights. The channel-weighted low-level features are subjected to average pooling and max pooling along the channel dimension and then concatenated, followed by 3×3 convolution and... The activation spatial attention weights are shown in the following formula: in, and These represent the average pooling and max pooling operations along the channel dimension, respectively, used to extract the statistical and significant responses at different spatial locations; Indicates feature concatenation operation; Convolution is a computational operation used to model the relationships between spatial locations. The generated spatial attention weights are used to reflect the contribution of different spatial locations to the target detection task. The combination of the two can simultaneously focus on the importance of the channel and the criticality of the spatial location, effectively preserving the fine-grained information of small targets while suppressing background noise, realizing refined feature modeling and improving its discrimination ability. for: g in, This represents the element-wise multiplication operation; Output feature map after attention enhancement; output low-level features after attention enhancement. The channel-mapped high-level features are input into an attention module with the same structure, and the output is attention-enhanced high-level features. Based on learnable weight vectors, through The normalization yields the fusion weights, as shown in the following formula: The fused low-level features are obtained by combining three parts: the element-wise product of the channel-mapped low-level features, the element-wise product of the channel-mapped low-level features and the attention-enhanced low-level features, and the element-wise product of the 1-attention-enhanced low-level features, the upsampled attention-enhanced high-level features, and the channel-mapped high-level features. The result is shown in the following formula: The formula is as follows: the element-wise product of the channel-mapped high-level features, the element-wise product of the channel-mapped high-level features and the attention-enhanced high-level features, and the element-wise product of the 1-attention-enhanced high-level features, the upsampled attention-enhanced low-level features, and the channel-mapped low-level features. The high-level features after fusion are obtained; the high-level features after fusion are upsampled to the spatial size of the low-level features after fusion, and concatenated with the low-level features after fusion, and then convolved with 3×3 to obtain the concatenated fused features; the residual features are upsampled to the spatial size of the concatenated fused features, and added element by element with the concatenated fused features to output the high-low level fused features. These features are fed into the subsequent modules of the neck network to provide feature input for the high-resolution small target detection head in step S4.

[0010] Further, in step S4, the high-resolution small target detection head takes the deep features output by the backbone network as input, first performs an upsampling operation on the deep features to increase the spatial resolution of the feature map to a preset size; the upsampled deep features are input into the attention-guided high-low feature fusion module in step S3, and multi-scale feature splicing and weighted fusion are completed with the low-level features input by the module to output fused features; the fused features are input into the C3k2 module, and stacked convolution operations and residual connection operations are performed in sequence to complete the depth extraction of features and output the feature map after depth extraction; the feature map after depth extraction is input into the small target detection head, first through a convolutional layer to complete the feature dimension adjustment, and then input into the classification branch and regression branch respectively. The classification branch outputs the target category probability distribution through convolution operation, and the regression branch outputs the target bounding box coordinate parameters through convolution operation, and finally outputs the category classification result and the bounding box regression result.

[0011] Furthermore, in step S5, the target detection network includes a backbone network, a neck feature fusion network, and a multi-scale target detection head. The backbone network sequentially includes a multi-scale self-modulation dynamic modeling module through stepwise downsampling to extract multi-scale offset features and retain fine-grained small target information. The neck network performs top-down and bottom-up high-low layer feature fusion through an attention-guided high-low layer feature fusion module to achieve cross-layer information interaction and multi-scale semantic enhancement. The detection stage includes a high-resolution small target detection head, which is used to perform category classification and bounding box regression prediction of small targets on high-resolution feature maps, thereby completing the accurate detection of multi-scale targets in complex UAV scenarios.

[0012] Compared with existing technologies, this invention provides a method for small target detection in UAV aerial photography based on self-modulation dynamic modeling and semantic detail co-enhancement. The beneficial effects of this invention include at least the following: (1) This invention proposes a small target detection network structure for UAV aerial photography scenarios. By introducing a multi-scale self-modulation feature modeling and semantic detail collaborative enhancement mechanism into the target detection network, a detection framework including a backbone network, a neck feature fusion network and a high-resolution small target detection head is constructed. This effectively alleviates the problem of decreased detection accuracy caused by small target size, complex background and drastic scale changes in UAV aerial photography scenarios, and improves the overall detection performance.

[0013] (2) The present invention designs a multi-scale self-modulation convolution module in the backbone network. By combining the multi-scale parallel convolution structure with the local feature enhancement branch and the global feature attention branch for modeling, it realizes the adaptive enhancement of the edge structure and fine-grained features of small targets. In the process of step-by-step downsampling, it effectively preserves the key information of small targets, overcomes the problem of easy loss of small target features in the traditional convolution downsampling process, and thus improves the ability to represent small targets in complex backgrounds.

[0014] (3) In the neck feature fusion stage, the present invention introduces an attention-guided high-low feature fusion module. By performing multi-scale interaction and adaptive weighted fusion of high-level semantic features and low-level detail features, the semantic expression capability is enhanced while the edge and structural details are effectively preserved. This enables the collaborative modeling of semantic information and spatial details, significantly improving the detection accuracy of small targets and their boundary areas in UAV aerial photography scenarios.

[0015] (4) In the detection stage, the present invention designs a high-resolution small target detection head. By upsampling the deep features and combining the high and low layer feature fusion module and feature refinement module, the classification of small targets and bounding box regression prediction are completed in the high-resolution feature space. This effectively improves the stability and accuracy of small targets in the multi-scale prediction process and enhances the detection capability of densely distributed targets with significant scale differences.

[0016] (5) The UAV aerial photography small target detection method proposed in this invention has shown good detection performance and generalization ability in different complex scenarios. It can adapt to different shooting heights, angle changes and background interference conditions. It has strong robustness and adaptability to multi-scale and small-sized targets and has high engineering application value. Attached Figure Description

[0017] The present invention will be further described below with reference to the accompanying drawings and embodiments; Figure 1 The flowchart shows a method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement. Figure 2 The class distribution of the training and test sets for the VisDrone2019 dataset; Figure 3 These are schematic diagrams of the target detection network structure and the high-resolution small target detection head structure constructed in this invention. Figure 4 This is a schematic diagram of the multi-scale self-modulation convolution module structure constructed in this invention; Figure 5 This is a schematic diagram of the attention-guided high-low layer feature fusion module structure constructed in this invention; Figure 6For the VisDrone2019 dataset, our model's map50 accuracy compared to YOLO11n across different categories; Detailed Implementation

[0018] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0019] Furthermore, the technical solutions of the various embodiments of the present invention can be combined with each other, but only if they are feasible for those skilled in the art. If the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.

[0020] The following specific embodiments illustrate the solution proposed in this invention: like Figure 1 As shown in the figure, the invention implementation example provides a flowchart of a method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement. The method specifically includes the following 5 steps: Step S1, Dataset Preprocessing: Based on the VisDrone2019 drone aerial image dataset, the images are cleaned, labeled and organized, and divided into training and testing sets; In step S1, each image is cleaned, its size is normalized, and it is paired and stored with the original annotation information to form training samples adapted to the network input. The dataset is divided into a training set and a validation set to ensure that the target scale and scene distribution are consistent in each set, such as... Figure 2 The image shows the class distribution of the training and test sets in the VisDrone2019 dataset. Step S2, design a multi-scale self-modulation dynamic modeling module: such as Figure 4 This module first expands the receptive field through parallel multi-scale convolution, extracts and fuses initial features at different scales, and then performs refined feature modeling through local feature enhancement branches and global feature attention branches, realizing offset convolution dynamic space modeling and attention-guided key features. The multi-scale self-modulation dynamic modeling module includes the following four specific sub-steps: (1) The parallel multi-scale convolution calculation process is represented as: input feature map Inputting depthwise separable convolutional layers separately yields multi-scale features. Element-wise addition and fusion of these multi-scale features results in multi-scale fused features. As shown in the formula below: Indicates the core size is Depthwise separable convolution operations are used to extract multi-scale contextual features while reducing computational complexity.

[0021] (2) The local feature enhancement unit enhances the model’s spatial modeling capability in non-rigid target and viewpoint change scenarios, and uses offset convolution to dynamically adjust the sampling position to achieve spatial adaptive modeling.

[0022] The formula for offset convolution is: Refers to the kernel weights; Corresponding to the One predefined sampling point; It is the offset of the sampling point; This represents the learnable modulation weights at the nth sampling point (range of values). This is used to suppress the sampling response in invalid interference areas, allowing the model to focus more on the target itself; Represents the predefined set of sampling locations for the convolution kernel; This represents the learnable modulation weights at the nth sampling point (range of values). This is used to suppress the sampling response in invalid interference areas, allowing the model to focus more on the target itself.

[0023] (3) The global feature attention branch performs channel adaptive weighting and global average pooling on multi-scale features to obtain channel aggregated features. Then, 1×1 convolution is used to compress and map the channel features, introducing richer nonlinear representations, and then... The function generates channel attention weights.

[0024] The formula for global feature attention is: This represents the global statistical characteristics of the corresponding channel; Indicates the height of the feature map; Indicates the width of the feature map; Indicates the first The spatial location of each channel The characteristic response value at that location; (4) The channel statistical vectors are mapped by 1×1 convolution and a nonlinear transformation is introduced, followed by... The function generates channel weights.

[0025] The formula for calculating the channel weights is: and For learnable parameters, It is a non-linear activation function.

[0026] Step S3, design an attention-guided high- and low-level feature fusion module: such as Figure 5 First, channel mapping is performed on the input high- and low-level features to unify their channel dimensions to a consistent level. Batch normalization and activation functions are then applied to improve feature distribution stability and non-linear expressive power. For low-level features, residual connections are introduced to ensure the fidelity of fine-grained features of small targets. Second, the channel-mapped high- and low-level features are processed through channel attention and spatial attention to generate corresponding attention response maps. Subsequently, 1×1 convolutions are used to adjust dimensions and model spatial saliency distribution, highlighting key regions and suppressing redundant background. Next, the high- and low-level features after convolutional transformation undergo a three-way information fusion operation. This fusion strategy integrates original feature information, attention-enhanced feature information, and cross-scale supplementary feature information to effectively preserve fine-grained details of small targets and enhance key feature regions. The attention-guided high- and low-level feature fusion module includes the following three specific sub-steps: (1) The channel mapping process is expressed as follows: in, and These represent the low-level and high-level feature maps of the input, respectively. and For its journey Feature representations after batch normalization and nonlinear activation Used for channel mapping and dimension alignment; Used to stabilize feature distribution; Used to enhance the nonlinear expressive power of features.

[0027] (2) In the synergistic effect of channel attention and spatial attention, attention response maps are generated for high-level semantic features and low-level detail features based on the joint modeling mechanism of channel attention and spatial attention, so as to highlight the key areas that have discriminative power for small target detection, while suppressing background noise and redundant information; among them, channel attention is used to model the importance of different semantic channels, and spatial attention is used to enhance the spatial response of the target area.

[0028] The channel attention formula is: Indicates the input feature map; This indicates that a global average pooling operation is performed on the feature map in the spatial dimension to extract global statistical information for each channel; W1 and W2 are learnable channel mapping weight matrices used to model the correlation between channels. Represents a nonlinear activation function; This is used to map the output to the (0,1) interval and generate channel attention weights. This is to characterize the relative importance of each channel in target discrimination.

[0029] The formula for spatial attention is: in, and These represent the average pooling and max pooling operations along the channel dimension, respectively, used to extract the statistical and significant responses at different spatial locations; Indicates feature concatenation operation; Convolution is a computational operation used to model the relationships between spatial locations. The generated spatial attention weights are used to reflect the degree of contribution of different spatial locations to the target detection task.

[0030] Combining the two approaches allows for simultaneous attention to the importance of channels and the criticality of spatial location. While suppressing background noise, it effectively preserves fine-grained information of small targets, enabling refined feature modeling and enhancing their discriminative capabilities.

[0031] for: in, This represents the element-wise multiplication operation; This is the output feature map after attention enhancement.

[0032] (3) The module constructs a three-way information fusion representation, including an information path that retains the original features, a feature path that introduces attention enhancement, and a cross-scale semantic interaction path. It also performs normalized weighted fusion of the multi-way features through learnable weight parameters, thereby achieving adaptive complementarity and collaborative expression between high- and low-level features.

[0033] Set low-level high-resolution feature maps High-level low-resolution feature maps are The corresponding spatial channel attention weights are respectively Three-way information fusion formula as follows: This branch directly preserves the original representation of low-level features to maintain the fine-grained texture and edge information of small targets to the greatest extent possible, avoiding the loss of details during the fusion process.

[0034] This branch uses attention weights to weight low-level features, highlighting key regions and salient targets while suppressing background noise, making the features more focused on potential target regions.

[0035] This branch utilizes high-level semantic features to supplement low-level features. Among them, This indicates an upsampling operation used to align high-level features to low-level spatial resolution; The difference is used to emphasize regions in the lower layers where feature responses are weak but may lack semantic support, thereby achieving cross-scale semantic compensation by introducing higher-level semantic information. The adaptive weight normalization formula is as follows: pass The sum of the normalization constraint weights is 1, which enables the network to adaptively adjust the relative importance of the three features in the fusion process according to the data distribution.

[0036] The formula is as follows: This process preserves low-level details while introducing attention enhancement and high-level semantic supplementation, enabling the fused features to possess both refined spatial representation and discriminative semantic information. While maintaining the original semantic expression, high-level features further enhance key channels through an attention mechanism, while introducing high-resolution detail information from low-level features to achieve bidirectional supplementation of semantic and spatial information.

[0037] Step S4, construct a high-resolution small target detection head: such as Figure 3First, the deep features output by the backbone are upsampled to improve the spatial resolution of small targets in the feature map, thus preserving more detailed information and laying a solid foundation for subsequent detection. Then, the upsampled features are input into an attention-guided high-low feature fusion module. This module effectively integrates multi-scale information by building an interaction channel between deep and shallow features, which not only improves the overall expressive power of the features but also significantly enhances the network's sensitivity to small targets. The fused features then flow through the C3k2 module for deep feature extraction. This module relies on stacked convolutions and residual structures to further extract rich semantic information and local details, making the feature representation of small targets more prominent. Finally, the high-resolution features, after multiple rounds of optimization, are input into the small target detection head, simultaneously completing category classification and bounding box regression prediction, achieving accurate detection of high-resolution small targets while significantly reducing the occurrence of missed detections and false detections.

[0038] Step S5, construct and train the object detection network: as follows Figure 4 As shown, firstly, the multi-scale self-modulation dynamic modeling module described in step S2 is embedded into the backbone structure of the target detection network to replace the original conventional feature extraction module. This enables the network to simultaneously possess multi-receptive field feature modeling capabilities and dynamic spatial adaptation capabilities during the step-by-step downsampling process. Specifically, the multi-scale self-modulation convolution module first models the input features through parallel multi-scale convolution to obtain contextual information at different scales. Based on this, offset convolution is introduced to dynamically adjust the sampling position, enhancing the network's spatial adaptability to targets with large scale variations and irregular shapes in aerial photography scenes. Simultaneously, after multi-scale feature fusion, global average pooling is used to compress the feature map in the spatial dimension, extracting channel-level global statistical information. Channel self-modulation weights are generated by combining channel mapping and normalization operations, adaptively weighting the multi-scale features to suppress redundant background responses, highlight key semantic channels, and achieve synergistic enhancement of global semantic information and local structural features. Simultaneously, the attention-guided high-low layer feature fusion module described in step S3 is introduced into the network neck structure to construct an adaptive interactive channel between features of different scales. This module dynamically weights high-level semantic features and low-level fine-grained features through the synergistic effect of channel attention and spatial attention, and combines cross-scale information supplementation and learnable weight fusion mechanisms to achieve effective integration of multi-layer features, enhancing the saliency representation ability of small targets in complex backgrounds. Finally, the high-resolution small target detection head described in step S4 is integrated into the detection head part to complete target category prediction and bounding box regression on shallow high-resolution feature maps. By preserving richer spatial details and edge information, the positioning accuracy and detection recall rate of small aerial targets are improved, thus forming a small target detection network for UAV aerial images based on self-modulation dynamic modeling and semantic detail synergistic enhancement.

[0039] Step S6: Select evaluation metrics and obtain detection results: Perform inference on the test set using the trained model, evaluate the detection results using average accuracy (mAP50), and obtain the final UAV small target detection results, such as... Figure 6 As shown.

[0040] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0041] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement, characterized in that, The method includes the following steps: Step S1, Dataset Preprocessing: Based on the drone aerial image dataset, the images are cleaned, labeled and organized, and divided into training set and test set; Step S2, design a multi-scale self-modulation dynamic modeling module: A multi-scale self-modulation dynamic modeling module is designed at the step-by-step downsampling points of the backbone network. This module includes multi-scale parallel convolution, local feature enhancement branches, global feature attention branches, and... Normalization and activation operations; used to adaptively model the structural features of small targets at different scales during downsampling. Step S3, design attention-guided high-low feature fusion module: In the network neck feature fusion stage, design attention-guided high-low feature fusion module. The attention-guided high-low feature fusion module includes channel mapping of high-level semantic features and low-level detail features, channel and spatial feature modeling based on attention mechanism, multi-branch candidate feature generation and adaptive weighted fusion structure, and sets residual branches to retain low-level details. The residual connection realizes the synergistic enhancement of semantic information and detail information. Step S4, design a high-resolution small target detection head: add a high-resolution small target detection branch to the detection head. The detection head performs multi-scale interaction by upsampling deep features and inputting attention-guided high and low-level feature fusion modules, and is refined by the C3k2 module. Finally, it classifies and regresses bounding boxes for small targets using a multi-scale prediction structure. Step S5: Construct and train the target detection network: Integrate the multi-scale self-modulation dynamic modeling module described in Step S2 into the backbone of the target detection network, embedding the module at the downsampling points of the backbone network to adaptively model the structural features of small targets at different scales; Integrate the attention-guided high-low layer feature fusion module described in Step S3 into the network neck to achieve synergistic enhancement of high-level semantic information and low-level detail information during the feature fusion stage; Simultaneously, integrate the high-resolution small target detection head described in Step S4 as a new branch into the detection head part; Train a small target detection model for UAV aerial images based on self-modulation dynamic modeling and semantic detail synergistic enhancement using a public UAV aerial image dataset; Step S6: Select evaluation metrics and obtain detection results: Perform inference on the test set using the trained model to output small target detection results for the drone aerial images, including target categories and bounding boxes; then use the average accuracy metric to evaluate the detection results to verify the model's detection performance.

2. The method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement as described in claim 1, characterized in that: Step S2 specifically includes: the multi-scale self-modulation dynamic modeling module is used to obtain features of different scales through multi-scale parallel convolution during the downsampling process of the backbone network, and to achieve adaptive modeling of local spatial structure and global semantic information through local feature enhancement branches and global feature attention branches.

3. The method for small target detection in UAV aerial photography based on self-modulation dynamic modeling and semantic detail co-enhancement as described in claim 2, characterized in that: In step S21, the multi-scale self-modulation dynamic modeling module specifically includes: Using the feature map output from the first-level downsampling of the backbone network as input, this input feature map... Inputting depthwise separable convolutional layers separately yields multi-scale features. Element-wise addition and fusion of these multi-scale features results in multi-scale fused features. As shown in the formula below: Indicates the core size is The depthwise separable convolution operation is used; multi-scale fused features are input into the local feature enhancement branch, and the sampling offset is learned through the offset convolution layer. The multi-scale fused features and the sampling offset are then input into the convolution layer to obtain the output feature map at the center position. Response value at As shown in the formula below: Refers to the kernel weights; Corresponding to the One predefined sampling point; It is the offset of the sampling point; This represents the learnable modulation weights at the nth sampling point; Represents the predefined set of sampling locations for the convolution kernel; This represents the learnable modulation weights for the nth sampling point; batch normalization and nonlinear activation operations are sequentially performed on the sampled features to output locally enhanced features; The multi-scale fused features are input into the global feature attention branch, as shown in the following formula: This represents the global statistical characteristics of the corresponding channel; Indicates the height of the feature map; Indicates the width of the feature map; Indicates the first The spatial location of each channel The feature response value at the specified location; global average pooling is performed on the multi-scale fused features to obtain channel aggregated features, and the channel aggregated features are then subjected to 1×1 convolution and nonlinear activation, and then... The function is shown in the following formula: and For learnable parameters, It is a non-linear activation function; it generates channel attention weights, uses the channel attention weights to perform channel weighting on local enhancement features, and outputs global attention enhancement features; it uses the global attention enhancement features as the final output of the module, which is fed into the next level downsampling layer of the backbone network on the one hand, and provides semantic feature input to the attention-guided high and low layer feature fusion module in step S2 on the other hand.

4. The method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement as described in claim 1, characterized in that, In step S3, the attention-guided high-low feature fusion module performs channel mapping and standardization on high-level features H and low-level features L, and then uses a spatial channel collaborative attention mechanism to generate channel and spatial attention response maps. Combined with residual branching and three-way information fusion strategies, it achieves adaptive fusion of high-low features and enhances the ability to express small targets and edge details.

5. The UAV aerial photography small target detection network according to claim 3, characterized in that, In S2, the channel mapping process is represented as follows: in, and These represent the low-level and high-level feature maps of the input, respectively. and For its journey Feature representations after batch normalization and nonlinear activation Used for channel mapping and dimension alignment; Used to stabilize feature distribution; This is used to enhance the non-linear expressive power of features. The low-level features after channel mapping are input into the channel attention module, and then processed by global average pooling, two stages of 1×1 convolution, and... The attention weights for the activated generation channel are shown in the following formula: Indicates the input feature map; This indicates that a global average pooling operation is performed on the feature map in the spatial dimension to extract global statistical information for each channel; W1 and W2 are learnable channel mapping weight matrices used to model the correlation between channels. Represents a nonlinear activation function; This is used to map the output to the (0,1) interval and generate channel attention weights. The channel-weighted low-level features are subjected to average pooling and max pooling along the channel dimension and then concatenated, followed by 3×3 convolution and... The activation spatial attention weights are shown in the following formula: in, and These represent the average pooling and max pooling operations along the channel dimension, respectively, used to extract the statistical and significant responses at different spatial locations; Indicates feature concatenation operation; Convolution is a computational operation used to model the relationships between spatial locations. For the generated spatial attention weights; The formula for calculating the output feature map after attention enhancement is as follows: g in, This represents the element-wise multiplication operation; Output feature map after attention enhancement; output low-level features after attention enhancement. The channel-mapped high-level features are input into an attention module with the same structure, and the output is attention-enhanced high-level features. Based on learnable weight vectors, through The normalization yields the fusion weights, as shown in the following formula: The fused low-level features are obtained by combining three parts: the element-wise product of the channel-mapped low-level features, the element-wise product of the channel-mapped low-level features and the attention-enhanced low-level features, and the element-wise product of the 1-attention-enhanced low-level features, the upsampled attention-enhanced high-level features, and the channel-mapped high-level features. The result is shown in the following formula: The formula is as follows: the element-wise product of the channel-mapped high-level features, the element-wise product of the channel-mapped high-level features and the attention-enhanced high-level features, and the element-wise product of the 1-attention-enhanced high-level features, the upsampled attention-enhanced low-level features, and the channel-mapped low-level features. The high-level features after fusion are obtained; the high-level features after fusion are upsampled to the spatial size of the low-level features after fusion, and concatenated with the low-level features after fusion, and then convolved with 3×3 to obtain the concatenated fused features; the residual features are upsampled to the spatial size of the concatenated fused features, and added element by element with the concatenated fused features to output the high-low level fused features. These features are fed into the subsequent modules of the neck network to provide feature input for the high-resolution small target detection head in step S4.

6. The method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement as described in claim 1, characterized in that, In step S4, the high-resolution small target detection head takes the deep features output by the backbone network as input, first performs an upsampling operation on the deep features to increase the spatial resolution of the feature map to a preset size; the upsampled deep features are then input into the attention-guided high-low feature fusion module in step S3, and multi-scale feature splicing and weighted fusion are performed with the low-level features input by this module to output fused features; the fused features are then input into the C3k2 module, and stacked convolution operations and residual connection operations are performed sequentially to complete the depth extraction of features and output the feature map after depth extraction; the feature map after depth extraction is input into the small target detection head, first undergoes feature dimension adjustment through convolutional layers, and then is input into the classification branch and regression branch respectively. The classification branch outputs the target category probability distribution through convolution operations, and the regression branch outputs the target bounding box coordinate parameters through convolution operations, finally outputting the category classification result and the bounding box regression result.

7. The method for small target detection in UAV aerial images based on self-modulation dynamic modeling and semantic detail co-enhancement as described in claim 1, characterized in that, In step S5, the target detection network includes a backbone network, a neck feature fusion network, and a multi-scale target detection head. The backbone network sequentially includes a multi-scale self-modulation dynamic modeling module through stepwise downsampling to extract multi-scale offset features and retain fine-grained small target information. The neck network performs top-down and bottom-up high-low layer feature fusion through an attention-guided high-low layer feature fusion module to achieve cross-layer information interaction and multi-scale semantic enhancement. The detection stage includes a high-resolution small target detection head, which is used to classify the category of small targets and predict bounding boxes on high-resolution feature maps, thereby completing the accurate detection of multi-scale targets in complex UAV scenarios.