Unmanned aerial vehicle lightweight target detection method and device based on adaptive sparse attention
By adopting a lightweight target detection method with adaptive sparse attention, the problems of low accuracy in small target detection and poor generalization under complex lighting conditions on UAV embedded platforms are solved, achieving efficient and real-time target detection results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 无锡先进内燃动力技术创新中心
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing target detection technologies on UAV embedded platforms suffer from low accuracy in detecting small targets, poor generalization under complex lighting conditions, and excessive computational overhead, making it difficult to achieve efficient target detection in complex scenarios.
A lightweight target detection method with adaptive sparse attention is adopted. This method constructs a lightweight feature extraction backbone network, a scale-adaptive feature fusion module, a feature attention encoder module, and a sparse attention decoder module. It combines adaptive pseudo-sample enhancement, multi-scale feature processing, and dynamic sparse decoding to optimize the loss function and improve detection performance.
It improves the target detection accuracy and real-time performance of UAVs in complex scenarios, reduces the computational load of the model, adapts to the computing power limitations of UAV embedded platforms, and achieves efficient recognition of small targets and camouflaged targets.
Smart Images

Figure CN122244729A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision target detection technology, specifically to low-altitude remote sensing image analysis technology based on deep learning. It is particularly suitable for real-time target detection scenarios on UAV embedded platforms. The invention relates to a lightweight target detection method that integrates cross-sample dual-branch cross-attention encoding and dynamic sparse decoding, which can effectively solve the problems of missed detection of small targets, difficulty in identifying camouflaged targets, poor generalization under complex lighting conditions, and low deployment efficiency of embedded platforms in UAV scenarios. Background Technology
[0002] The rapid development of drone technology has made it a core data source for low-altitude remote sensing, with increasingly urgent needs in applications such as agricultural plant protection, urban security, and emergency rescue. Currently, most mainstream target detection technologies rely on deep learning algorithms, such as Faster R-CNN and YOLO, which are based on convolutional neural networks. While these models can achieve good detection accuracy in environments with ample computing power, they suffer from three major drawbacks in embedded drone platform applications:
[0003] First, there is a contradiction between feature extraction and scale adaptability: Although the traditional lightweight model reduces the computation to 6.4 GFLOPs, it has poor adaptability to the target scale difference of 10-500 pixels in the drone scene, the average accuracy value of small targets (≤32 pixels) is low, and the ability to capture the features of camouflaged targets covered by vegetation is insufficient.
[0004] Second, the computational overhead of the attention mechanism is out of control: the existing Vision Transformer derivative model uses full attention computation, and the number of parameters increases quadratically with the feature scale. On 640×480 resolution UAV images, the computational load exceeds the carrying capacity limit of the embedded platform.
[0005] Third, the data augmentation scenario adaptability is poor: traditional cropping and flipping methods tend to cause the model to learn naive decision rules, resulting in poor generalization of targets in drone scenarios with dynamic lighting (backlighting in the morning and evening) and complex backgrounds (densely built areas, farmland vegetation), and a significant decrease in cross-scenario detection accuracy.
[0006] Therefore, there is an urgent need to develop a dedicated target detection method for UAVs that takes into account lightweight design, multi-scale adaptability, and generalization ability in complex scenarios. Summary of the Invention
[0007] This invention provides a lightweight target detection method and apparatus for UAVs based on adaptive sparse attention. This invention solves the problems of low accuracy, poor real-time performance, and difficult model deployment in UAV image detection. Details are described below:
[0008] A lightweight target detection method for UAVs based on adaptive sparse attention, the method comprising:
[0009] An image-level object detection model is constructed, consisting of a lightweight feature extraction backbone network, a scale-adaptive feature fusion module, a feature attention encoder module, and a sparse attention decoder module;
[0010] Image samples from the training dataset are extracted using a lightweight feature extraction backbone network; pseudo-anomaly samples are input into the lightweight feature extraction backbone network, and multi-scale features of the pseudo-anomaly samples are output.
[0011] The scale-adaptive feature fusion module processes features at different scales differently during the fusion process, ultimately obtaining feature maps f corresponding to normal samples and pseudo-abnormal samples. n and f s ;
[0012] The feature attention encoder module utilizes a cross-attention mechanism to... n and f s The information from the two feature maps is cross-fused to obtain the cross-fused features;
[0013] The sparse attention decoder module decodes the cross-fused features, generates reconstructed features, calculates the loss on the reconstructed features and the input of the feature attention encoder module, and obtains the final image-level object detection result.
[0014] The method further includes: constructing an image-level target detection model using multiple loss function constraints, and detecting UAV images based on the constrained image-level target detection model.
[0015] The loss is calculated on the output of the sparse attention decoder module and the input of the feature attention encoder to obtain the final image-level object detection result.
[0016] The method generates pseudo-abnormal samples through adaptive patch generation, adaptive pixel jitter and rotation, and adaptive patch pasting.
[0017] The scale-adaptive feature fusion module includes:
[0018] Differential feature processing: Adaptive dilation rate dilated convolution is used, and the dilation rate is adjusted according to the average scale of the target.
[0019] An enhanced CBS module is adopted, where CBS represents a convolutional module containing 3×3 convolutional layers, batch normalization, and SiLu activation function. An additional 1×1 channel adjustment convolution is added on the basis of CBS.
[0020] Constraint layers are used to restrict intermediate network layers, forcing the model to learn a deep representation of the data. A "channel compression-restoration-residual fusion" structure is employed, as shown in the following formula:
[0021]
[0022] CBS1 compresses the input channels to half through a 1×1 convolution, while CBS2 restores the original channels through a 1×1 convolution. The residual connections preserve the original features.
[0023] The feature attention encoder module has a dual-branch structure, which cross-integrates feature information from normal samples and pseudo-abnormal samples.
[0024] After obtaining sparse neighborhood attention, the feature vector is sequentially passed through residual connection, normalization, feedforward neural network and multilayer perceptron to complete a decoding layer; the decoder finally outputs the decoded reconstructed feature vector, and the loss value is calculated by comparing it with the input of the feature attention encoder to calculate the reconstruction error and supervise training.
[0025] Furthermore, the calculated loss includes: localization loss and reconstruction loss.
[0026] The localization loss is used to measure the spatial difference between the predicted bounding box and the ground truth bounding box, and it employs a combination of four IoU variants:
[0027]
[0028] in, It is 0.5; For intersection, union, and comparison; It is 0.3; For generalized intersection and comparison; It is 0.3; The distance intersection-union ratio; It is 0.1; For complete intersection and union;
[0029] The reconstruction loss is used to calculate the reconstruction loss between the final reconstructed feature sequence and the input of the feature attention encoder, using the mean squared log error as the loss function:
[0030]
[0031] Among them, S r S represents the final reconstructed feature sequence. n The normal samples represent the input to the feature attention encoder, where N is the number of samples.
[0032] In a second aspect, a lightweight target detection device for unmanned aerial vehicles based on adaptive sparse attention, the device comprising: a processor and a memory, the memory storing program instructions, the processor calling the program instructions stored in the memory to cause the device to perform the method described in any one of the first aspects.
[0033] Third aspect, a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method described in any one of the first aspects.
[0034] The beneficial effects of the technical solution provided by this invention are:
[0035] 1. Enhance generalization ability in complex scenarios: The adaptive pseudo-sample enhancement module abandons the traditional fixed parameter enhancement mode and dynamically generates pseudo-samples based on the characteristics of UAV targets; it solves the problem of data scarcity in special scenarios through scale adaptive patching, illumination-level pixel jitter and semantic matching strategies.
[0036] 2. Balanced feature representation and lightweight design: The multi-scale lightweight backbone network adopts a "depth-separable convolution + dynamic channel adjustment" structure, which configures channels differently for different pixel targets, reducing the model's computation to 3.2G FLOPs (50% less than ShuffleNet V2), thus improving the feature representation capability of small targets;
[0037] 3. Controlling attention computing power and enhancing features: The cross-sample dual-branch attention encoder fuses normal and pseudo-sample features to mine latent features of small targets; the dynamic sparse decoder transforms computing power into linear computation through sparse masking to adapt to the computing power of UAVs.
[0038] 4. Improve training stability and deployment value: The adaptive weighted loss function, combined with the IoU four-variant localization loss and mean square logarithmic error reconstruction loss, improves performance by 15%-22% compared to mainstream lightweight models, with a detection speed of 30FPS+, and can be deployed in scenarios such as plant protection and security. Attached Figure Description
[0039] Figure 1 This is a flowchart of a lightweight target detection method for UAVs based on adaptive sparse attention;
[0040] Figure 2 This is a structural diagram of the scale-adaptive feature fusion module;
[0041] Figure 3 This is a structural diagram of a dual-branch feature attention encoder;
[0042] Figure 4 This is a structural diagram of the feature attention fusion layer. Detailed Implementation
[0043] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.
[0044] Example 1
[0045] A lightweight target detection method for UAVs based on adaptive sparse attention, see [link to relevant documentation]. Figure 1 This method achieves efficient target detection in UAV scenarios through the collaborative implementation of six modules: adaptive pseudo-sample enhancement, multi-scale lightweight feature extraction, scale-adaptive feature fusion, cross-sample dual-branch attention encoding, dynamic sparse decoding, and adaptive weighted loss optimization. The specific steps are as follows:
[0046] Step 101: Construct a training dataset for UAV image target detection and label the UAV images in the dataset; generate adaptive pseudo-anomaly samples based on the target distribution characteristics of the UAV images;
[0047] The steps include: constructing a training dataset for drone image target detection, labeling the drone images in the training dataset, and calculating the loss of the model output using the labels of normal samples; generating adaptive pseudo-abnormal samples based on the distribution characteristics of drone image targets, and using them as input for subsequent training in step 102, thereby solving the problem of insufficient scene coverage in the training data.
[0048] Step 102: Construct a lightweight feature extraction backbone network for the UAV image target detection model, and extract image samples from the training dataset based on the lightweight feature extraction backbone network; input the pseudo-anomaly samples into the lightweight feature extraction backbone network, and output the multi-scale features of the pseudo-anomaly samples;
[0049] Among them, the lightweight feature extraction backbone network for constructing the UAV image target detection model adopts a "depth-separable convolution + dynamic channel adjustment" structure.
[0050] Step 103: Construct a scale-adaptive feature fusion module to fuse features at different scales for normal samples and features at different scales for pseudo-anomaly samples. Different processing methods are applied to features at different scales during the fusion process, ultimately obtaining the feature maps f corresponding to normal samples and pseudo-anomaly samples. n and f s ;
[0051] Step 104: Construct a feature attention encoder module, and use a cross-attention mechanism to... n and f s The information from the two feature maps is cross-fused to obtain the cross-fused features;
[0052] This step helps the model learn features such as small, camouflaged targets that are not easily detected in drone images.
[0053] Step 105: Construct a sparse attention decoder module to decode the cross-fused features, generate reconstructed features, calculate the loss on the reconstructed features and the input of the feature attention encoder module, and obtain the final image-level object detection result;
[0054] This step reduces the number of parameters and the computational burden on the model, while also decreasing the model's dependence on information within the feature domain, thus generating reconstructed features. The loss is calculated on the output of the sparse attention decoder module and the input of the feature attention encoder to obtain the final image-level object detection result.
[0055] Step 106: Use multiple loss functions to constrain the image-level target detection model constructed in steps 102-105, and use the constrained image-level target detection model to detect UAV images.
[0056] This step dynamically adjusts the loss weights based on the training stage and target scale, taking into account both localization accuracy and feature reconstruction quality.
[0057] In summary, the embodiments of the present invention solve the problems of low accuracy, poor real-time performance, and difficulty in model deployment of UAV image detection through the above steps 101-106.
[0058] Example 2
[0059] The following section provides a further explanation of the scheme in Example 1, using specific calculation formulas and examples. See the description below for details: I. Adaptive Pseudo-Sample Enhancement Module
[0060] Unlike traditional fixed-parameter enhancement, this module generates pseudo-anomaly samples through adaptive patch generation, adaptive pixel jitter and rotation, and adaptive patch pasting. Specific steps are as follows:
[0061] Adaptive Patch Generation: Based on the statistical distribution of UAV target scales in the training set (small targets 10-32 pixels, medium targets 32-64 pixels, large targets 64-500 pixels), the size range of rectangular patches is determined: small target patches 16-32 pixels, medium target patches 32-64 pixels, and large target patches 64-128 pixels; based on the statistical distribution of target aspect ratios (vehicles 1:2, pedestrians 1:3, buildings 2:1), the aspect ratio range of patches is set from 1:3 to 3:1 to ensure that the patches match the shape of the real targets.
[0062] Adaptive pixel jitter and rotation: The mean light intensity I is calculated through the local grayscale histogram of the image: when I < 50 (dark area), the pixel value jitter amplitude is 10-20; when 50 ≤ I ≤ 200 (normal area), the jitter amplitude is 5-10; when I > 200 (bright area), the jitter amplitude is 10-15; based on the target orientation statistics (such as vehicles being mostly distributed along the flight direction), the rotation angle is adaptively selected, and the rotation probability is consistent with the proportion of the main orientation of the target in the sample.
[0063] Adaptive patch pasting: A pre-trained lightweight candidate box generator (computation cost ≤ 0.2G FLOPs) is used to obtain background regions in the image without real targets; patches are preferentially pasted to background regions with similar patch semantics (e.g., vehicle patches are pasted to road backgrounds, pedestrian patches are pasted to green backgrounds) to avoid background semantic conflicts, and the pseudo sample generation efficiency is improved by 40% compared with traditional methods.
[0064] This method has relatively low computational cost, is simple and intuitive, and is easy to implement. It prevents the model from learning naive decision rules that simply identify image enhancements (cropping, rotation, and flipping, etc.) and encourages the model to learn to detect irregularities in images.
[0065] II. Multi-scale adaptive lightweight backbone network
[0066] As the core module of the network, its main function is to transform the input image into multi-scale feature maps. These feature maps can capture different levels of features, from low-level edge and texture information to high-level semantic information. This module balances computational cost and feature representation through "dynamic channel adjustment + lightweight convolutional structure". The input consists of normal sample images and pseudo-abnormal samples, both of which are RGB images with 3 channels. After multi-stage downsampling, feature maps F of 5 sizes are obtained. k (k=1, 2, 3, 4, 5), the specific structure is shown in Table 1. A lower number of channels (32-56) is set for small targets (corresponding to F1-F2) to avoid redundant computation; a higher number of channels (272-448) is set for the semantic features of large targets (corresponding to F4-F5) to improve semantic expression; the entire network adopts depthwise separable convolution (reducing parameters by 80%) and channel shuffle (reducing channel redundancy), reducing the total computation to 3.2 GFLOPs (a 50% reduction compared to ShuffleNet V2), and improving the channel expressive power of small target features by 30%.
[0067] Table 1
[0068] Feature map resolution Number of channels Convolutional structure <![CDATA[F1]]> 112×112 32 3×3 depthwise separable convolutions (2) + channel shuffle <![CDATA[F2]]> 56×56 56 1×1 point convolution + 3×3 depthwise separable convolution (2 layers) <![CDATA[F3]]> 28×28 112 1×1 point convolution + 3×3 depthwise separable convolution (2 layers) <![CDATA[F4]]> 14×14 272 1×1 point convolution + 3×3 depthwise separable convolution (2 layers) <![CDATA[F5]]> 7×7 448 1×1 point convolution + 3×3 depthwise separable convolution (2 layers)
[0069] III. Scale-Adaptive Feature Fusion Module
[0070] To address the scale differences between F1 and F5, a strategy of "differentiated processing + cross-scale alignment" is adopted, as shown in Figure 2.
[0071] 1. Differentiated feature processing:
[0072] F1 (small target), F2 (medium-small target), F3 (medium target): Adaptive dilation rate dilated convolution is used, adjusting the dilation rate according to the average scale of the target. Specifically, F1, F2, and F3 use dilated convolutional layers with dilation rates of 4, 3, and 2, respectively, to expand the receptive field while avoiding the grid effect.
[0073] F4 (medium-large targets) and F5 (large targets): Due to their small size, dilated convolution is no longer used. Instead, an enhanced CBS module (ECBS) is adopted, where CBS represents a convolution module containing 3×3 convolutional layers, batch normalization, and SiLu activation functions. An additional 1×1 channel adjustment convolution is added to the CBS, compressing the F4 channel from 272 to 112 and the F5 channel from 448 to 112, thereby reducing the computational cost of fusion.
[0074] 2. Constraint Layer Design:
[0075] Constraint layers are used to restrict intermediate network layers, forcing the model to learn deeper representations of the data. A "channel compression-reduction-residual fusion" structure is employed, as shown in the following formula:
[0076]
[0077] CBS1 compresses the input channels to half through 1×1 convolution (e.g., F1 changes from 32 to 16), while CBS2 restores the original channels through 1×1 convolution (16 becomes 32). The residual connection preserves the original features and avoids gradient vanishing.
[0078] 3. Cross-scale alignment and splicing
[0079] Feature maps of different sizes are upsampled and downsampled and then stitched together. After stitching, the SENet attention layer adaptively allocates the feature weights of each scale (the feature weights of small targets are increased by 20%), and finally outputs the fusion features of normal samples and the fusion features of pseudo-abnormal samples.
[0080] IV. Cross-sample dual-branch cross-attention encoder
[0081] During the encoding phase, the feature attention encoder is responsible for combining the node sequence information corresponding to normal samples and pseudo-abnormal samples, and generating comprehensive feature vectors for subsequent modules. The structure of this module is similar to the encoding layer of the Vision Transformer, but it improves upon the original multi-head attention module. The two input feature sequences first enter the feature attention fusion layer, and the merged feature vector then passes through a feedforward neural network, residual connections and normalization, and a multilayer perceptron to complete an encoder layer. Before entering subsequent encoder layers, the feature vector is concatenated with the input feature sequences to form the input for the next layer.
[0082] The feature attention fusion module in the encoder layer has a dual-branch structure, as shown below. Figure 4 As shown, the input feature sequences are processed separately. After passing through the multi-head attention module, the feature vectors are cross-concatenated with the inputs of the other branches, and then concatenated again after passing through a multilayer perceptron to obtain the final output feature vector. The multi-head attention module enhances the model's ability to focus on different aspects of the input data, and the multilayer perceptron can perform non-linear transformations and processing on the extracted information, improving the model's expressive power.
[0083] The feature attention encoder cross-fuses feature information from normal samples and pseudo-abnormal samples, enhancing the model's ability to understand and express images, thus laying the foundation for accurate target detection in the future.
[0084] V. Dynamic Sparse Attention Decoder
[0085] In the decoding stage, the decoder transforms the feature representation generated by the encoder into the final output. By combining the input embedding vector with the currently generated output features, the model can better understand the global context of the image. The classic Transformer decoder for natural language processing uses a masked attention module, which uses a lower triangular matrix as a mask to ensure that the current word is associated only with words at the current position and those preceding it. This ensures that the model does not "see" future information when generating sequences, thus avoiding information leakage and maintaining the correctness of autoregressive generation. Based on this, this invention designs a sparse attention decoder that masks certain regions when calculating the attention map. This reduces the model's reliance on information within the feature neighborhood during prediction, preventing the model from failing to learn deeper information from the samples.
[0086] In the self-attention module, when feature vectors are used to calculate the attention map, each pixel is fully associated with all locations via a "query-key" relationship, resulting in a fully self-attention map. By restricting pixel association to only locations within a neighborhood with a distance greater than n / 2 (where n is an adjustable hyperparameter), a sparse neighborhood attention map can be obtained. Based on this, a sparse neighborhood mask can be designed to mask out some information from the fully self-attention map, achieving sparse attention. The calculation formula for sparse attention is as follows:
[0087]
[0088] Where Q, K, and V are the query, key, and value matrices of the input vector, respectively; M is the sparse mask; and ⊙ represents the Hadamard product (element-wise multiplication). After obtaining sparse neighborhood attention, the feature vector is sequentially passed through residual connections, normalization, a feedforward neural network, and a multilayer perceptron to complete a decoding layer. The decoder finally outputs the reconstructed feature vector, which is compared with the input of the feature attention encoder to calculate the reconstruction error and determine the loss value for supervised training.
[0089] VI. Adaptive Weighted Hybrid Loss
[0090] In deep learning, a loss function is used to evaluate the difference between the model's predicted value and the actual target value. It guides the model during training on how to adjust its internal parameters through backpropagation to minimize this difference, gradually bringing the predicted value closer to the actual target value, thereby improving model performance. This invention primarily includes two loss functions: localization loss and reconstruction loss.
[0091] Intersection over Union (IoU) is an algorithm that calculates the proportion of overlap between different images, and is frequently used in object detection or semantic segmentation tasks in deep learning. This invention will combine IoU and its variations GIou, DIoU, and CIoU. A brief introduction to these variations is provided below:
[0092] GIoU (Generalized IoU) introduces a minimum closed-box area on top of IoU, and applies a negative penalty to the non-overlapping case, so that the optimization process can still generate effective gradients.
[0093] DIoU (Distance IoU) adds a center point Euclidean distance penalty to the IoU to accelerate the convergence of the bounding box position and suppress center offset.
[0094] CIoU (Complete IoU) combines the overlap area, center point distance, and aspect ratio differences to impose more comprehensive constraints on the shape and position of the bounding box, thereby further improving regression accuracy.
[0095] This loss function is used to train the regression accuracy of the detection boxes, addressing issues such as inaccurate localization, box offset, and low overlap in UAV detection. The localization loss measures the spatial difference between the predicted and ground truth boxes, employing a combination of four IoU variants:
[0096]
[0097] The reconstruction loss is used to calculate the reconstruction loss between the final reconstructed feature sequence and the input of the feature attention encoder. The mean squared log error is used as the loss function, and the calculation formula is as follows:
[0098]
[0099] Among them, S r S represents the final reconstructed feature sequence. n Normal samples represent the input to the feature attention encoder. The mean squared logarithmic error (MSE) is applied logarithmically to the loss function, resulting in a more stable gradient descent characteristic during optimization. This helps improve the training stability of the model, especially when dealing with complex data.
[0100] The final loss function L total The weighted sum of the two loss functions mentioned above is calculated as follows:
[0101]
[0102] in, and To balance the weights, they are set to 1 and 0.8 respectively in this embodiment of the invention.
[0103] VII. Conclusion
[0104] This invention focuses on the core technical pain points of lightweight target detection in UAV scenarios. Addressing three key shortcomings of existing technologies—poor feature extraction adaptability (low AP for small targets, weak detection of camouflaged targets), computational overload of attention mechanisms, and insufficient generalization of data augmentation—this invention proposes a lightweight target detection scheme that integrates adaptive sparse attention and cross-sample feature interaction. This forms a complete optimization system from data construction to model decoding. The specific technical contributions and core values are as follows:
[0105] Data layer innovation: Through a lightweight enhancement strategy of variable-size patch pruning, scene-based pixel rotation / jitter, and random position pasting, training samples containing pseudo-abnormal targets are generated. This effectively avoids the naive decision rule learning problem caused by traditional pruning and flipping methods, injects the model with dynamic lighting and complex background interference features unique to UAV scenes, improves cross-scene generalization ability, and this enhancement method has low computational cost and is easy to embed into the training process of embedded platforms.
[0106] Core innovations in the encoding-decoding layer: The encoding end adopts a dual-branch cross-attention structure, which initially extracts the feature sequences of normal samples and pseudo-abnormal samples through multi-head attention, cross-branch feature cross-concatenation, and nonlinear enhancement by multilayer perceptron, so as to achieve complementary information between the two types of samples, enhance the feature discrimination of small targets and disguised targets, and overcome the limitation of traditional single-branch encoding in understanding the features of complex scenes. The decoding end is designed with an adaptive sparse attention mechanism, which reduces the model's over-reliance on feature neighborhood information by masking local information through sparse neighborhood masking, dynamic adjustment of distance threshold (based on target scale), and sparse attention calculation, and forces the model to learn deep features at long distances, while reducing the computational cost of full attention and adapting to the computational constraints of UAV embedded platforms.
[0107] Loss optimization layer guarantee: Construct a weighted hybrid loss of localization loss and reconstruction loss: The localization loss integrates four variants of IoU, GIoU, DIoU and CIoU to solve the problems of inaccurate target localization and bounding box offset of UAVs; the reconstruction loss adopts the mean square logarithmic error to improve training stability; the two types of loss are optimized together to ensure that the model can balance detection accuracy and real-time performance under the premise of lightweight design.
[0108] Example 3
[0109] A lightweight target detection device for unmanned aerial vehicles (UAVs) based on adaptive sparse attention is disclosed. The device includes a processor and a memory, the memory storing program instructions. The processor invokes the program instructions stored in the memory to cause the device to execute the following method steps in Embodiment 1:
[0110] An image-level object detection model is constructed, consisting of a lightweight feature extraction backbone network, a scale-adaptive feature fusion module, a feature attention encoder module, and a sparse attention decoder module;
[0111] Image samples from the training dataset are extracted using a lightweight feature extraction backbone network; pseudo-anomaly samples are input into the lightweight feature extraction backbone network, and multi-scale features of the pseudo-anomaly samples are output.
[0112] The scale-adaptive feature fusion module processes features at different scales differently during the fusion process, ultimately obtaining feature maps f corresponding to normal samples and pseudo-abnormal samples. n and f s ;
[0113] The feature attention encoder module utilizes a cross-attention mechanism to... n and f s The information from the two feature maps is cross-fused to obtain the cross-fused features;
[0114] The sparse attention decoder module decodes the cross-fused features, generates reconstructed features, calculates the loss on the reconstructed features and the input of the feature attention encoder module, and obtains the final image-level object detection result.
[0115] The device also includes an image-level target detection model constructed using multiple loss function constraints, which is used to detect UAV images based on the constrained image-level target detection model.
[0116] The loss is calculated on the output of the sparse attention decoder module and the input of the feature attention encoder to obtain the final image-level object detection result.
[0117] The device generates pseudo-abnormal samples through adaptive patch generation, adaptive pixel jitter and rotation, and adaptive patch pasting.
[0118] The scale-adaptive feature fusion module includes:
[0119] Differential feature processing: Adaptive dilation rate dilated convolution is used, and the dilation rate is adjusted according to the average scale of the target.
[0120] An enhanced CBS module is adopted, where CBS represents a convolutional module containing 3×3 convolutional layers, batch normalization, and SiLu activation function. An additional 1×1 channel adjustment convolution is added on the basis of CBS.
[0121] Constraint layers are used to restrict intermediate network layers, forcing the model to learn a deep representation of the data. A "channel compression-restoration-residual fusion" structure is employed, as shown in the following formula:
[0122]
[0123] CBS1 compresses the input channels to half through a 1×1 convolution, while CBS2 restores the original channels through a 1×1 convolution. The residual connections preserve the original features.
[0124] The feature attention encoder module has a dual-branch structure, which cross-integrates feature information from normal samples and pseudo-abnormal samples.
[0125] After obtaining sparse neighborhood attention, the feature vector is sequentially passed through residual connection, normalization, feedforward neural network and multilayer perceptron to complete a decoding layer; the decoder finally outputs the decoded reconstructed feature vector, and the loss value is calculated by comparing it with the input of the feature attention encoder to calculate the reconstruction error and supervise training.
[0126] Furthermore, the calculated loss includes: localization loss and reconstruction loss.
[0127] The localization loss is used to measure the spatial difference between the predicted bounding box and the ground truth bounding box, and it employs a combination of four IoU variants:
[0128]
[0129] in, It is 0.5; For intersection, union, and comparison; It is 0.3; For generalized intersection and comparison; It is 0.3; The distance intersection-union ratio; It is 0.1; For complete intersection and union;
[0130] The reconstruction loss is used to calculate the reconstruction loss between the final reconstructed feature sequence and the input of the feature attention encoder, using the mean squared log error as the loss function:
[0131]
[0132] Among them, S r S represents the final reconstructed feature sequence. nThe normal samples represent the input to the feature attention encoder, where N is the number of samples.
[0133] It should be noted that the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention will not be repeated here.
[0134] The execution entities of the aforementioned processor and memory can be devices with computing functions such as computers, microcontrollers, and single-chip microcomputers. In specific implementations, the embodiments of the present invention do not limit the execution entities and can select them according to the needs of actual applications.
[0135] Data signals are transmitted between the memory and the processor via a bus, which will not be elaborated upon in this embodiment of the invention.
[0136] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, the storage medium including a stored program, which, when the program is running, controls the device where the storage medium is located to execute the method steps in the above embodiments.
[0137] The computer-readable storage medium includes, but is not limited to, flash memory, hard disk, solid-state drive, etc.
[0138] It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the embodiments of the present invention will not be repeated here.
[0139] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of the present invention is generated.
[0140] A computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. Computer instructions can be stored in or transmitted through a computer-readable storage medium. A computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be magnetic or semiconductor, etc.
[0141] Unless otherwise specified, the model numbers of the various devices in this embodiment of the invention are not limited, and any device that can perform the above functions is acceptable.
[0142] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0143] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A lightweight target detection method for UAVs based on adaptive sparse attention, characterized in that, The method includes: An image-level object detection model is constructed, consisting of a lightweight feature extraction backbone network, a scale-adaptive feature fusion module, a feature attention encoder module, and a sparse attention decoder module; Image samples from the training dataset are extracted using a lightweight feature extraction backbone network; pseudo-anomaly samples are input into the lightweight feature extraction backbone network, and multi-scale features of the pseudo-anomaly samples are output. The scale-adaptive feature fusion module processes features at different scales differently during the fusion process, ultimately obtaining feature maps f corresponding to normal samples and pseudo-abnormal samples. n and f s ; The feature attention encoder module utilizes a cross-attention mechanism to... n and f s The information from the two feature maps is cross-fused to obtain the cross-fused features; The sparse attention decoder module decodes the cross-fused features, generates reconstructed features, calculates the loss on the reconstructed features and the input of the feature attention encoder module, and obtains the final image-level object detection result.
2. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The method further includes: constructing an image-level target detection model using multiple loss function constraints, and detecting UAV images based on the constrained image-level target detection model.
3. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The loss is calculated on the output of the sparse attention decoder module and the input of the feature attention encoder to obtain the final image-level object detection result.
4. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The method generates pseudo-abnormal samples through adaptive patch generation, adaptive pixel jitter and rotation, and adaptive patch pasting.
5. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The scale-adaptive feature fusion module includes: Differential feature processing: Adaptive dilation rate dilated convolution is used, and the dilation rate is adjusted according to the average scale of the target. An enhanced CBS module is adopted, where CBS represents a convolutional module containing 3×3 convolutional layers, batch normalization, and SiLu activation function. An additional 1×1 channel adjustment convolution is added on the basis of CBS. Constraint layers are used to restrict intermediate network layers, forcing the model to learn deep representations of the data. A "channel compression-reduction-residual fusion" structure is employed, as shown in the following formula: ; CBS1 compresses the input channels to half through a 1×1 convolution, while CBS2 restores the original channels through a 1×1 convolution. The residual connections preserve the original features.
6. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The feature attention encoder module has a dual-branch structure, which cross-integrates feature information from normal samples and pseudo-abnormal samples.
7. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, After obtaining sparse neighborhood attention, the feature vector is sequentially passed through residual connection, normalization, feedforward neural network and multilayer perceptron to complete a decoding layer; the decoder finally outputs the decoded reconstructed feature vector, and the loss value is calculated by comparing it with the input of the feature attention encoder to calculate the reconstruction error and supervise training.
8. The lightweight target detection method for UAVs based on adaptive sparse attention according to claim 1, characterized in that, The calculated loss includes: localization loss and reconstruction loss. The localization loss is used to measure the spatial difference between the predicted bounding box and the ground truth bounding box, and it employs a combination of four IoU variants: ; in, It is 0.5; For intersection, union, and comparison; It is 0.3; For generalized intersection and comparison; It is 0.3; The distance intersection-union ratio; It is 0.1; For complete intersection and union; The reconstruction loss is used to calculate the reconstruction loss between the final reconstructed feature sequence and the input of the feature attention encoder, using the mean squared log error as the loss function: ; Among them, S r S represents the final reconstructed feature sequence. n The normal samples represent the input to the feature attention encoder, where N is the number of samples.
9. A lightweight target detection device for unmanned aerial vehicles based on adaptive sparse attention, characterized in that, The device includes a processor and a memory, the memory storing program instructions, the processor invoking the program instructions stored in the memory to cause the device to perform the method according to any one of claims 1-8.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method described in any one of claims 1-8.