A spatial perception and component attention fusion unmanned aerial vehicle and bird identification method

CN122244592APending Publication Date: 2026-06-19CIVIL AVIATION FLIGHT UNIV OF CHINA

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CIVIL AVIATION FLIGHT UNIV OF CHINA
Filing Date: 2026-03-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Mar 2026

Application

19 Jun 2026

Publication

CN122244592A

IPC: G06V10/774; G06V10/764; G06V10/82; G06V10/80; G06V10/44; G06V20/70; G06V10/72; G06V10/70; G06V10/52; G06V10/766; G06N3/045; G06N3/048

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244592A_ABST

Patent Text Reader

Abstract

This invention relates to the field of infrared target detection and discloses a method for identifying drones and birds by fusing spatial perception and component attention. By introducing spatial perception, the global contextual information of infrared images can be fully utilized, enabling the model to stably capture the thermal radiation features of drones and birds even in complex backgrounds, significantly improving target positioning accuracy. By introducing component spatial attention, the thermal radiation distribution features of key differentiated components such as drone propellers and bird wings are specifically focused, automatically enhancing the unique feature expression of the two types of targets and suppressing interference from similar backgrounds and redundant features. This fundamentally reduces the feature coupling between drones and birds, greatly improving the model's ability to distinguish between the two and effectively reducing the false recognition rate.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of infrared target detection technology, and in particular to a method for identifying drones and birds by fusing spatial perception and component attention. Background Technology

[0002] With the rapid popularization of drone technology and the increasing demand for low-altitude flight control, unauthorized drone intrusions have posed a significant threat to airport airspace security. Drones and birds are both typical low-altitude, slow-moving, and small flying targets, making them easily confused in airspace monitoring. Failure to accurately distinguish between the two can easily lead to multiple safety hazards, including flight misjudgments, excessively high false alarm rates, and even inadequate emergency response. Infrared thermal imaging technology does not rely on visible light environments and can stably capture the thermal radiation characteristics of targets under low-light conditions such as nighttime, fog, haze, and backlighting, effectively compensating for the limitations of visible light sensors. It has now become a core technology for all-weather detection of drones and birds in airport airspace.

[0003] Currently, deep learning-based target detection algorithms are the mainstream technology for infrared image target recognition. Among them, the YOLO series models are widely used in real-time low-altitude target detection scenarios at airports due to their advantages such as fast detection speed, high recognition accuracy, and lightweight design. Existing research often combines the YOLO model with multi-scale feature fusion, target tracking, and filtering / denoising techniques to improve target detection performance in infrared scenes.

[0004] However, despite the progress made in infrared target detection, existing technologies still have significant shortcomings in drone and bird recognition tasks in airport scenarios. These shortcomings are as follows: First, infrared images generally suffer from low contrast, lack of texture information, and strong interference from background heat sources. As small-pixel heat source targets, drones and birds are easily lost during multiple downsampling processes in the backbone network, resulting in insufficient target localization accuracy and low overall detection accuracy under the multi-intersection comparison threshold. Second, both drones and birds appear as bright heat sources in infrared images, with highly coupled overall thermal radiation characteristics. Existing attention mechanisms do not have dedicated perception structures designed for the differentiated thermal distribution of key components such as propellers and wings, resulting in weak model differentiation between the two types of targets and a high false positive rate. Third, existing improvement methods struggle to balance detection accuracy and model lightweighting. High-precision schemes have excessive parameters and computational loads, making them unsuitable for edge deployment, while lightweight schemes suffer from high false negative rates and insufficient recall in small target detection. Summary of the Invention

[0005] The purpose of this invention is to improve the existing identification methods, which suffer from problems such as the easy loss of fine features of small infrared targets leading to insufficient positioning accuracy, weak discrimination ability due to the coupling of thermal radiation features of UAVs and birds, and difficulty in balancing detection accuracy and model lightweighting. The invention provides a UAV and bird identification method that integrates spatial perception and component attention.

[0006] To achieve the above-mentioned objectives, the embodiments of the present invention provide the following technical solutions:

[0007] A method for identifying drones and birds by fusing spatial perception and component attention includes the following steps:

[0008] Collect bird and drone data, introduce bird and drone data into various environmental scenarios, label suspected drone and bird areas in the environmental scenarios, construct the collected dataset through the labeled environmental scenarios, perform preprocessing and enhancement operations on the collected dataset to obtain the training dataset;

[0009] Based on the improved backbone network, neck network, and detection head, and combined with spatial perception and component spatial attention, a drone and bird recognition model is constructed.

[0010] The drone and bird recognition model is trained by inputting the training dataset, and the detection model parameters of the drone and bird recognition model are iteratively adjusted by using the loss function and training strategy.

[0011] The trained drone and bird recognition model is used to detect drones and birds in real-time images, and the detection results of drone and bird recognition are finally output.

[0012] To address the technical problems of low contrast and lack of texture in infrared images, and the easy loss of fine features of drones and birds as small-pixel heat source targets, resulting in insufficient positioning accuracy, this invention optimizes the backbone network structure to enhance the extraction of thermal radiation features of small-pixel targets. At the same time, it introduces a spatial attention mechanism to focus on the target area, reduce feature loss, accurately capture the thermal radiation details of small targets, improve detection accuracy, solve the problems of feature loss and inaccurate positioning of small targets, and meet high positioning requirements.

[0013] To address the technical problem that existing attention mechanisms lack specificity and are difficult to accurately distinguish between the two types of targets due to the easily confused thermal radiation characteristics of drones and birds, this invention optimizes the feature extraction logic, focuses on the core feature differences between drones and birds, strengthens the recognition capability of exclusive features, simplifies redundant calculations, and achieves accurate matching of features and category labels. This effectively solves the problems of confusion and high misjudgment rate between the two types of targets, and improves the distinguishability.

[0014] To address the technical challenges of large model parameters, difficult edge deployment, and low recall and high false negative rates in small target detection, this invention incorporates spatial perception to create a lightweight backbone network design. This simplifies redundant parameters, optimizes the feature extraction process, and reduces model complexity while maintaining detection accuracy. Furthermore, it embeds component spatial attention to enhance the model's ability to distinguish between drones and birds, reducing the false positive rate. The spatial perception and component spatial attention achieve a balance between lightweight design and high accuracy, meeting edge deployment requirements while simultaneously improving recall and reducing false negatives in small target detection.

[0015] Compared with existing technologies, the beneficial effects of this invention are as follows: The introduction of spatial awareness into the network structure can fully utilize the global contextual information of infrared images, enabling the model to stably capture the thermal radiation features of drones and birds even in complex backgrounds, significantly improving target positioning accuracy; The introduction of component spatial attention into the network structure focuses on the thermal radiation distribution features of key differentiated components of drone propellers and bird wings, automatically strengthening the unique feature expression of the two types of targets and suppressing interference from similar backgrounds and redundant features, thereby reducing the feature coupling between drones and birds from the root, greatly improving the model's ability to distinguish between the two, and effectively reducing the false recognition rate.

[0016] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention, wherein the improved backbone network includes a Conv layer, an SP module, a CSA module, and an SPPF layer; the Conv layer includes a Conv1 layer, a Conv2 layer, a Conv3 layer, a Conv4 layer, and a Conv5 layer; the SP module includes an SP1 module, an SP2 module, an SP3 module, and an SP4 module; the Conv1 layer, Conv2 layer, SP1 module, Conv3 layer, SP2 module, CSA module, Conv4 layer, SP3 module, Conv5 layer, SP4 module, and SPPF layer are connected sequentially.

[0017] In the aforementioned scheme, traditional backbone networks suffer from technical problems such as easy feature loss, insufficient extraction of component-specific features, and high computational redundancy in the detection of small targets like infrared UAVs and birds. This invention addresses these issues by constructing a lightweight backbone network architecture that combines spatial perception and component attention. It employs multi-stage Conv layers to progressively extract infrared thermal radiation features and compress channel dimensions. A layered SP module is introduced to achieve complementary fusion of spatial and semantic information, enhancing the feature retention capability for small-pixel heat source targets and alleviating the problem of fine feature loss during downsampling. A CSA module is embedded to adaptively focus on key differentiated thermal radiation regions of UAV propellers and bird wings, strengthening the feature discrimination between the two types of targets. In conjunction with the SPPF layer, multi-scale feature pooling and fusion are completed, ensuring the ability to express the features of small infrared targets while simplifying redundant computations, achieving an effective balance between detection accuracy and model lightweighting.

[0018] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention, wherein the SP module includes a Split module, a ConvS layer, a DWConv layer, an AdaptiveAvgPool layer, and a BN layer; the ConvS layer includes ConvS1 layer, ConvS2 layer, ConvS3 layer, ConvS4 layer, ConvS5 layer, and ConvS6 layer; the processing procedure of the SP module is as follows:

[0019] The feature map X is input to the Split module, which splits the feature map X into the left branch. and right branch ;

[0020] Left branch The detailed spatial features of small targets are extracted sequentially through ConvS1 and ConvS2 layers. These detailed spatial features are then convolved through ConvS3 layer to obtain an intermediate feature map. ;

[0021] intermediate feature map Depthwise separable convolutions are performed through the DWConv layer. The features after the depthwise convolution operation are then subjected to adaptive average pooling through the AdaptiveAvgPool layer to aggregate global contextual information. The average-pooled features are then activated by an activation function to obtain semantic information weights. ;

[0022] right branch Preliminary channel-dimensional mapping is performed using the ConvS4 layer to obtain intermediate feature maps. ;

[0023] intermediate feature map The feature representation is optimized through a ConvS5 layer, and the optimized features are then batch normalized through a BN layer to obtain spatial information weights. ;

[0024] intermediate feature map Spatial information weights Perform element-wise multiplication, intermediate feature map and semantic information weights Element-wise multiplication is performed, and the resulting features are concatenated. These concatenated features are then subjected to convolution through a ConvS6 layer to obtain fused features. .

[0025] In the above scheme, the traditional feature extraction process suffers from technical problems such as unclear feature division, inaccurate fusion of spatial details and global semantics, and small target features being easily obscured by redundant information. This invention achieves efficient extraction and accurate fusion of infrared UAV and bird features through a standardized and streamlined SP module processing procedure: First, the input feature map is split into left and right branches using the Split module, clarifying the core division of labor: the left branch focuses on the spatial features of small target details, and the right branch focuses on spatial information mapping, avoiding mutual interference in feature extraction; the left branch gradually mines the thermal radiation features of small targets such as UAV propellers and bird wings through ConvS1 and ConvS2 layers, and after optimizing the feature dimension through ConvS3 layer, the feature discrimination capability is enhanced by DWConv layer and the global context information is aggregated by AdaptiveAvgPool layer. Then, precise semantic information weights are generated through activation functions to achieve effective quantification of global semantic features. The right branch completes the initial mapping of feature channels through ConvS4 layer, optimizes feature expression through ConvS5 layer, and eliminates gradient offset through BN layer to generate targeted spatial information weights, accurately capturing the spatial location features of the target. Through element-wise multiplication of the features of the two branches with the corresponding weights, the precise matching of detailed features with spatial and semantic weights is achieved. After feature concatenation and dimensional integration through ConvS6 layer, a fused feature with both spatial details and global semantics is finally obtained. This effectively preserves the fine thermal radiation features of small targets, alleviates the problem of easy loss of infrared small target features, and improves the targeting and efficiency of feature fusion. This lays a solid foundation for the subsequent component attention module to accurately distinguish between drones and birds, and further improves the recognition accuracy and positioning stability of the model.

[0026] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention, wherein the CSA module includes a drone propeller branch, an aspect ratio attention weight branch, and a bird wing detection branch; the processing procedure of the CSA module is as follows:

[0027] Input the feature map Y to the CSA module, and the CSA module will input the feature map Y to the UAV propeller branch, the aspect ratio attention weight branch, and the bird wing detection branch respectively;

[0028] Feature map Y generates a propeller attention map by focusing on the feature pattern of a low-brightness symmetrical elliptical thermal distribution through the branches of the UAV propeller. ;

[0029] Feature map Y generates a wing attention map by focusing on the asymmetric strip-shaped thermal distribution feature pattern through the bird wing detection branch. ;

[0030] Feature map Y generates aspect ratio attention weights through a link of global feature aggregation, dimension transformation, and category score normalization in the aspect ratio attention weight branch;

[0031] The propeller attention map and the drone tendency weights are multiplied element-wise, and the wing attention map and the bird tendency weights are multiplied element-wise. The results of the element-wise multiplications are then summed to obtain the enhanced features. .

[0032] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention is provided. The drone propeller branch includes ConvC1, BN1, and ConvC2 layers; the aspect ratio attention weight branch includes AdaptiveAvgPool, Flatten, Linear1, and Linear2 layers; and the bird wing detection branch includes ConvC3, BN2, and ConvC4 layers. The ConvC1, BN1, and ConvC2 layers are connected sequentially, the AdaptiveAvgPool, Flatten, Linear1, and Linear2 layers are connected sequentially, and the ConvC3, BN2, and ConvC4 layers are connected sequentially.

[0033] In the above-mentioned solutions, traditional attention modules suffer from technical problems in drone and bird identification, including inaccurate capture of component differentiation features, lack of targeted attention guidance, and inability to effectively distinguish the core components of the two types of targets. This invention addresses these issues by constructing a multi-branch collaborative CSA module architecture to achieve accurate extraction and attention empowerment of key component features for both drones and birds: a dedicated drone propeller branch is set up, and the thermal radiation features of the propeller are extracted, normalized, and optimized sequentially through the ConvC1, BN1, and ConvC2 layers, accurately capturing the rigid and symmetrical differentiated thermal distribution features of the drone propeller; an aspect ratio attention weight branch is designed, which aggregates global context information through the AdaptiveAvgPool layer, flattens features through the Flatten layer, and Linear... Layer dimensionality reduction and Linear2 layer mapping generate aspect ratio attention weights, assigning targeted attention weights to the features of two types of target components, thus strengthening the expression of key region features. A bird wing detection branch is built, and the flexible and asymmetric thermal radiation features of bird wings are gradually extracted through ConvC3, BN2, and ConvC4 layers to optimize the recognition of wing features. The three branches work together to achieve specialized extraction of features of two core components, namely UAV propellers and bird wings, and to achieve precise enhancement of component features through aspect ratio attention weights. This effectively breaks the dilemma of coupling thermal radiation features of the two types of targets, greatly improves the model's ability to distinguish between UAVs and birds, reduces the false positive rate, simplifies redundant calculations, balances feature extraction accuracy and model efficiency, and is suitable for infrared small target detection scenarios.

[0034] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention involves generating a propeller attention map Y by focusing on the feature pattern of a low-brightness symmetrical elliptical thermal distribution of the drone propeller branches. Includes the following sub-steps:

[0035] The feature map Y is input to the ConvC1 layer to extract the attention map of the thermal radiation region. The extracted features are then batch-normalized through the BN1 layer. The batch-normalized features are activated by the ReLU function. The activated features are then enhanced by the ConvC2 layer to strengthen the low-brightness symmetrical elliptical thermal distribution. The enhanced features are then activated by the Sigmoid function to obtain the propeller attention map. .

[0036] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention, wherein the feature map Y generates aspect ratio attention weights through a link of global feature aggregation, dimensional transformation, and category score normalization of aspect ratio attention weight branches, includes the following sub-steps:

[0037] Feature map Y undergoes adaptive average pooling through the AdaptiveAvgPool layer to aggregate global context information. The average pooled features are flattened into a one-dimensional vector through the Flatten layer. The one-dimensional vector is then reduced in dimension through the Linear1 layer. The dimension-reduced vector is activated by combining it with the ReLU function. The activated vector is then transformed into a two-dimensional vector through the Linear2 layer. The two-dimensional vector is used to generate aspect ratio attention weights through the Softmax function.

[0038] Furthermore, in a method for identifying drones and birds by fusing spatial perception and component attention, the processing formula for the aspect ratio attention weight branch is as follows:

[0039] ;

[0040] in, For the Flatten layer flattening operation, Dimensionality reduction operation for Linear1 layer, For Linear2 layer mapping operations, For the Softmax function, Weighting of drones For birds, the tendency to favor weight. This is the original score vector of the Linear2 layer. B represents the batch size. For drones, the raw score is preferred. Let R be the original score for the bird tendency, and R be the set of real numbers.

[0041] In the aforementioned schemes, the traditional aspect ratio attention weight generation process suffers from problems such as chaotic steps, inaccurate feature extraction, and a lack of targeted weight generation, making it difficult to accurately adapt to the feature differentiation requirements of drones and birds in infrared scenes. This invention generates aspect ratio attention weights through a standardized process, specifically as follows: First, the feature map is input into the AdaptiveAvgPool layer, where adaptive average pooling fully aggregates the global contextual information of the target, effectively filtering background interference and preserving the target's core thermal radiation features; then, the average-pooled features are input into the Flatten layer, flattening the multi-dimensional features into a one-dimensional vector, facilitating subsequent feature processing and dimensionality compression; subsequently, the one-dimensional vector enters the Linear1 layer for feature dimensionality reduction, removing redundant feature information and reducing model computation; the dimensionality-reduced vector is then activated using the ReLU function. The non-linear expressive power of the features is enhanced to highlight the difference between the target and the background. The activated vector is then transformed into a two-dimensional vector through the Linear2 layer to meet the generation requirements of attention weights. Finally, the two-dimensional vector is normalized by the Softmax function to generate accurate aspect ratio attention weights. These weights can accurately focus on the core area of the target, strengthen the target features, and suppress background interference, providing strong support for subsequent target recognition and differentiation. This effectively solves the problems of weak targeting and low feature utilization in the traditional weight generation process, and further improves the model's differentiation accuracy and localization accuracy between drones and birds.

[0042] Furthermore, a method for identifying drones and birds by fusing spatial perception and component attention, wherein the final output of the drone and bird identification detection results includes the following sub-steps:

[0043] The improved backbone network extracts features from the input drone and bird images and outputs multi-scale feature maps {C3,C4,C5}.

[0044] The neck network performs top-down and bottom-up multi-scale feature fusion on the input multi-scale feature map {C2,C3,C4} and outputs the target feature {P2,P3,P4}.

[0045] The detection head trains and optimizes the input target features {P3,P4,P5} to build classification and regression branches. The classification branch outputs the probabilities of birds and drones, and the regression branch outputs the bounding box coordinates and confidence scores.

[0046] In the aforementioned scheme, traditional target detection models suffer from inaccurate feature fusion, poor multi-scale feature integration, and insufficient detection accuracy when handling drone and bird identification tasks. This invention addresses these issues by constructing a multi-scale feature extraction and fusion system to achieve accurate identification and localization of drones and birds. Specifically, firstly, an improved backbone network extracts multi-dimensional features from the input drone and bird images, fully capturing the thermal radiation characteristics of the targets and avoiding feature loss. Secondly, a neck network performs multi-scale fusion of the extracted features, integrating features at different levels to achieve efficient transfer from basic features to core features. Finally, through training and optimization of the detection head, accurate target identification and localization are achieved. Simultaneously, an aspect ratio attention mechanism is incorporated to enhance the feature representation of the target region, effectively solving the problems of poor feature integration and insufficient recognition accuracy in traditional detection. This ensures accurate differentiation between drones and birds, improves detection stability and reliability, and meets the recognition needs of practical applications. Attached Figure Description

[0047] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 A flowchart for a drone and bird recognition method that integrates spatial perception and component attention.

[0049] Figure 2 This is a schematic diagram of the structure of a drone and bird recognition model.

[0050] Figure 3 This is a schematic diagram of the SP module.

[0051] Figure 4 This is a schematic diagram of the CSA module. Detailed Implementation

[0052] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0053] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, the terms "first," "second," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance, or suggesting any such actual relationship or order between these entities or operations. Additionally, the terms "connected," "linked," etc., can refer to a direct connection between elements or an indirect connection via other elements.

[0054] It should be noted that those prefixed with Conv are all Conv layers with the same structure, those prefixed with SP are all SP modules with the same structure, those prefixed with Upsample modules are all Upsample modules with the same structure, those prefixed with Linear are all Linear layers with the same structure, those prefixed with BN are all BN layers with the same structure, those prefixed with Detect are all Detect modules with the same structure, and those prefixed with C2f are all C2f layers with the same structure. The numbers such as "1, 2, S1, S2, C1, C2, S" are added after the prefix to distinguish the connection relationship.

[0055] Example 1: A method for identifying drones and birds by fusing spatial perception and component attention.

[0056] This invention is achieved through the following technical solutions, such as... Figure 1 As shown, a method for identifying drones and birds by fusing spatial perception and component attention includes the following steps:

[0057] S1: Collect data on birds and drones, introduce the data on birds and drones into various environmental scenarios, label areas in the environmental scenarios that are suspected to be drones and birds, construct a collection dataset through the labeled environmental scenarios, perform preprocessing and enhancement operations on the collection dataset, and obtain a training dataset.

[0058] Specifically, S1 includes the following sub-steps:

[0059] S11: Construct a multi-camera acquisition array to simultaneously acquire bird and drone scenes from multiple angles, including eye-level, top-down, and oblique views, capturing the spatial distribution characteristics of drone propellers and fuselage, and the dynamic flight attitude characteristics of bird wings and torso.

[0060] S12: Set a continuous time window for the data collection points to cover the entire process of birds gliding, flapping their wings, turning, and drones hovering, moving horizontally, and accelerating, ensuring coverage of the temporal characteristics of different flight states;

[0061] S13: Repeatedly collect bird and drone scenes by adjusting different observation distances, angles, and times to establish a variability dataset;

[0062] S14: Apply the variable dataset to diverse environmental scenarios such as forests, urban areas, industrial areas, and open airspace to ensure that the dataset can reflect the visual performance of birds and drones under different geographical features, meteorological conditions (such as sunny days, cloudy days, and haze) and background complexity (such as complex buildings, dense vegetation, and solid-color skies).

[0063] S15: For typical scenarios that are prone to false alarms in practical applications, difficult negative samples such as kites, balloons, fallen leaves, ribbons, and insect swarms are collected, and visually similar but essentially different interference scenarios are introduced to effectively improve the model's discrimination ability.

[0064] S16: Perform fine-grained bounding box annotations on drone and bird areas in normal environmental scenes, distinguish foreground targets from background environment, and accurately label interference samples in interference environmental scenes.

[0065] S17: Integrate and construct the collection dataset by combining the labeled normal environment scenarios and interference environment scenarios;

[0066] S18: Expand the collected dataset by outward scaling of images, random cropping, horizontal mirroring, and Gaussian noise injection to obtain the training dataset.

[0067] S2: Based on the improved backbone network, neck network, and detection head, combined with spatial perception and component spatial attention, a drone and bird recognition model is constructed.

[0068] Specifically, such as Figure 2 As shown, the drone and bird recognition model includes an improved backbone network, neck network, and detection head.

[0069] The improved backbone network includes a Conv layer, a spatial awareness module (SP module), a component spatial attention module (CSA module), and an SPPF layer; the Conv layer includes Conv1 layer, Conv2 layer, Conv3 layer, Conv4 layer, and Conv5 layer; the SP module includes SP1 module, SP2 module, SP3 module, and SP4 module.

[0070] It should be noted that the CSA module here is not Coherent Semantic Attention, but rather a component spatial attention layer unique to this invention.

[0071] Specifically, the improved backbone network connection structure is as follows: the Conv1 layer, Conv2 layer, SP1 module, Conv3 layer, SP2 module, CSA module, Conv4 layer, SP3 module, Conv5 layer, SP4 module, and SPPF layer are connected in sequence; the SP2 module outputs a multi-scale feature map C3, the SP3 module outputs a multi-scale feature map C4, and the SPPF layer outputs a multi-scale feature map C5.

[0072] More specifically, the improved backbone network connection structure is as follows: the output of Conv1 layer is connected to the input of Conv2 layer, the output of Conv2 layer is connected to the input of SP1 module, the output of SP1 module is connected to the input of Conv3 layer, the output of Conv3 layer is connected to the input of SP2 module, the first output of SP2 module is connected to the input of CSA module, the output of CSA module is connected to the input of Conv4 layer, the output of Conv4 layer is connected to the input of SP3 module, the first output of SP3 module is connected to the input of Conv5 layer, the output of Conv5 layer is connected to the input of SP4 module, and the output of SP4 module is connected to the input of SPPF layer.

[0073] It is important to note that the input of the CSA module is connected to the output of the SP2 module, rather than other SP modules, because the multi-scale feature map C3 output by the SP2 module has a higher resolution than the multi-scale feature maps C4 and C5, and can more accurately capture the fine-grained thermal radiation features of drone propellers and bird wings.

[0074] like Figure 3 As shown, the SP module includes a Split module, a ConvS layer, a DWConv layer, an Adaptive Average Pooling layer, and a BN layer; the ConvS layer includes ConvS1, ConvS2, ConvS3, ConvS4, ConvS5, and ConvS6 layers.

[0075] In the aforementioned schemes, traditional feature extraction modules suffer from technical problems in infrared UAV and bird detection, including insufficient fusion of spatial and semantic information, weak extraction of small target detail features, and low computational efficiency due to feature redundancy. This invention addresses these issues by constructing a compact and functionally synergistic SP module architecture. The Split module splits the input feature map into two branches, decoupling feature extraction from weight generation. The left branch, composed of ConvS1, ConvS2, and ConvS3 layers, progressively extracts the spatial details of small targets (UAVs and birds), further enhanced by the DWConv layer and the AdaptiveAvgPool. The system aggregates global semantic information to generate precise semantic information weights. Through the right branch consisting of ConvS4, ConvS5, and BN layers, feature channel mapping and optimization are completed to generate targeted spatial information weights. The features of the two branches are multiplied element-wise with their corresponding weights and then concatenated and fused. The feature dimensions are then integrated through ConvS6 layer. This achieves complementary enhancement of spatial detail features and global semantic features, effectively preserving the fine thermal radiation features of small targets, while simplifying redundant calculations and improving feature extraction efficiency. At the same time, it lays the foundation for the subsequent CSA module to accurately capture the differentiated features of components, further improving the recognition accuracy and positioning accuracy of UAVs and birds.

[0076] More specifically, the connection structure of the SP module is as follows:

[0077] The first output of the Split module is connected to the input of the ConvS1 layer. The output of the ConvS1 layer is connected to the input of the ConvS2 layer. The output of the ConvS2 layer is connected to the input of the ConvS3 layer. The first output of the ConvS3 layer is connected to the input of the DWConv layer. The output of the DWConv layer is connected to the input of the AdaptiveAvgPool layer. The second output of the Split module is connected to the input of the ConvS4 layer. The first output of the ConvS4 layer is connected to the input of the ConvS5 layer. The output of the ConvS5 layer is connected to the input of the BN layer. The second output of the ConvS3 layer is connected to the output of the BN layer as branch one. The second output of the ConvS4 layer is connected to the output of the AdaptiveAvgPool layer as branch two. The outputs of branch one and branch two are then connected to the input of the ConvS6 layer.

[0078] like Figure 4As shown, the CSA module includes a drone propeller branch, an aspect ratio attention weight branch, and a bird wing detection branch; the drone propeller branch includes ConvC1 layer, BN1 layer, and ConvC2 layer; the aspect ratio attention weight branch includes AdaptiveAvgPool layer, Flatten layer, Linear1 layer, and Linear2 layer; the bird wing detection branch includes ConvC3 layer, BN2 layer, and ConvC4 layer.

[0079] The neck network includes an Upsample module, a C2f layer, and a Conv layer; the Upsample module includes an Upsample1 module and an Upsample2 module; the C2f layer includes a C2f1 layer, a C2f2 layer, a C2f3 layer, and a C2f4 layer; and the Conv layer includes a Conv6 layer and a Conv7 layer.

[0080] Specifically, the connection structure of the neck network is as follows:

[0081] The output of Upsample1 module and the second output of SP3 module are spliced together and then connected to the input of C2f1 layer. The first output of C2f1 layer is connected to the input of Upsample2 module. The output of Upsample2 module is spliced together with the second output of SP2 module and then connected to the input of C2f2 layer. The first output of C2f2 layer is connected to the detection head.

[0082] The second output of layer C2f2 is connected to the input of layer Conv6. The output of layer Conv6 and the second output of layer C2f1 are spliced together and then connected to the input of layer C2f3. The first output of layer C2f3 is connected to the detection head.

[0083] The second output of layer C2f3 is connected to the input of layer Conv7. The output of layer Conv7 and the output of SPPF are spliced together and then connected to the input of layer C2f4. The output of layer C2f4 is connected to the detection head.

[0084] The detection head includes a Detect1 module, a Detect2 module, and a Detect3 module.

[0085] The first output of layer C2f2 is connected to the input of module Detect1, the first output of layer C2f3 is connected to the input of module Detect2, and the output of layer C2f4 is connected to the input of module Detect3.

[0086] S3: Input the training dataset to train the drone and bird recognition model, and iteratively adjust the detection model parameters of the drone and bird recognition model through loss function and training strategy.

[0087] Specifically, S3 includes the following sub-steps:

[0088] S31: Input the training dataset into the UAV and bird recognition model to perform multi-level feature extraction and complex feature reasoning to obtain detection results;

[0089] S32: By comparing the detection results with the real label data through the loss function, the gradient descent algorithm is used for backpropagation to dynamically adjust the weight parameters of each layer of the drone and bird recognition model.

[0090] S33: Repeatedly train to continuously optimize the parameters of the drone and bird recognition model until the training loss converges to a stable state or reaches the preset maximum number of iterations.

[0091] It should be noted that throughout the training process, overfitting is prevented by monitoring performance on the validation set, ensuring that the final drone and bird recognition model has excellent generalization ability and robustness.

[0092] S4: Use the trained drone and bird recognition model to perform drone and bird recognition and detection on the real-time acquired images, and finally output the detection results of drone and bird recognition.

[0093] Specifically, S4 includes the following sub-steps:

[0094] S41: The improved backbone network extracts features from the input drone and bird images and outputs multi-scale feature maps {C3,C4,C5}.

[0095] More specifically, S41 includes the following sub-steps:

[0096] like Figure 2As shown, the input drone and bird images (640×640×3) are sent to the input of Conv1 layer. The output of Conv1 layer outputs a feature map (320×320×16) to the input of Conv2 layer. The output of Conv2 layer outputs a feature map (160×160×32) to the input of SP1 module. The output of SP1 module outputs a feature map (160×160×32) to the input of Conv3 layer. The output of Conv3 layer outputs a feature map (80×80×64) to the input of SP2 module. The first output of SP2 module outputs a multi-scale feature map C3 (80×80×64) to the input of CSA module. The output of the module outputs a feature map (80×80×64) to the input of the Conv4 layer. The output of the Conv4 layer outputs a feature map (40×40×128) to the input of the SP3 module. The first output of the SP3 module outputs a multi-scale feature map C4 (40×40×128) to the input of the Conv5 layer. The output of the Conv5 layer outputs a feature map (20×20×256) to the input of the SP4 module. The output of the SP4 module outputs a feature map (20×20×256) to the input of the SPPF layer. The first and second outputs of the SPPF layer output multi-scale feature maps C5 to the (20×20×128) neck network.

[0097] S411: The SP module enhances the input feature map X with both spatial details and semantic information through a chain of branch segmentation, dual-weight extraction, and feature weighted fusion, generating fused features that possess both spatial and semantic mappings. .

[0098] like Figure 3 As shown, more specifically, the processing procedure of the SP module is as follows:

[0099] S4111: Input feature map X to the Split module, which splits feature map X into the left branch. and right branch The formula is:

[0100] ;

[0101] in, H is the height of the feature map, W is the width of the feature map, C is the number of channels, and R is the set of real numbers. For the Split module's splitting operation, .

[0102] S4112: Left Branch The detailed spatial features of small targets are extracted sequentially through ConvS1 and ConvS2 layers. These detailed spatial features are then convolved through ConvS3 layer to obtain an intermediate feature map. The formula is:

[0103] ;

[0104] in, For the 3×3 convolution operation of ConvS1 layer, For the 3×3 convolution operation of the ConvS2 layer, For 1×1 convolution operations in ConvS1 layer, ;

[0105] S4113: Intermediate Feature Map Depthwise separable convolutions are performed through the DWConv layer. The features after the depthwise convolution operation are then subjected to adaptive average pooling through the AdaptiveAvgPool layer to aggregate global contextual information. The average-pooled features are then activated by an activation function to obtain semantic information weights. The formula is:

[0106] ;

[0107] in, For a 3×3 depthwise separable convolution operation in the DWConv layer, i=0,1,2...H, j=0,1,2...W, For activation function, ;

[0108] S4114: Right Branch Preliminary channel-dimensional mapping is performed using the ConvS4 layer to obtain intermediate feature maps. The formula is:

[0109] ;

[0110] in, For 1×1 convolution operations in ConvS4 layers, ;

[0111] S4115: Intermediate Feature Map The feature representation is optimized through a ConvS5 layer, and the optimized features are then batch normalized through a BN layer to obtain spatial information weights. The formula is:

[0112] ;

[0113] in, This is a 1×1 convolution operation for 5 layers of ConvS. For batch normalization operations in the BN layer;

[0114] S4115: Intermediate Feature Map Spatial information weights Perform element-wise multiplication, intermediate feature map and semantic information weights Element-wise multiplication is performed, and the resulting features are concatenated. These concatenated features are then subjected to convolution through a ConvS6 layer to obtain fused features. The formula is:

[0115] ;

[0116] in, For element-wise multiplication, , This is an element-wise addition.

[0117] It is important to note that fusion features Specifically, it is a fusion feature with both spatial and semantic mapping.

[0118] More specifically, such as Figure 4 As shown, the processing procedure of the CSA module is as follows:

[0119] S412: The CSA module learns the infrared thermal radiation characteristics of UAV propellers and bird wings in parallel, and performs dynamic weighted fusion by combining aspect ratio priors to output enhanced features. .

[0120] S4121: Input feature map Y to the CSA module. The CSA module inputs feature map Y to the UAV propeller branch, aspect ratio attention weight branch and bird wing detection branch respectively.

[0121] S4122: Feature map Y generates a propeller attention map by focusing on the feature pattern of a low-brightness symmetrical elliptical thermal distribution through the UAV propeller branches. ;

[0122] Specifically, the feature map Y is input to the ConvC1 layer to extract the attention map of the thermal radiation region. The extracted features are then batch-normalized through the BN1 layer. The batch-normalized features are activated by the ReLU function. The activated features are then enhanced by the ConvC2 layer to strengthen the low-brightness symmetrical elliptical thermal distribution. The enhanced features are then activated by the Sigmoid function to obtain the propeller attention map. The formula is:

[0123] ;

[0124] in, This is a 3×3 convolution operation for the ConvC1 layer. For ReLU function, For the 3×3 convolution operation of ConvC2 layer, For the Sigmoid function, This is a batch normalization operation for the BN1 layer.

[0125] S4123: Feature map Y focuses on the asymmetric strip-shaped thermal distribution feature pattern through bird wing detection branches, generating a wing attention map. ;

[0126] Specifically, the feature map Y is input to the ConvC2 layer to extract the attention of the thermal radiation region. The extracted features are then batch-normalized through the BN2 layer. The batch-normalized features are activated by the ReLU function. The activated features are then enhanced by the ConvC3 layer to strengthen the asymmetric strip thermal distribution. The enhanced features are then activated by the Sigmoid function to obtain the wing attention map. The formula is:

[0127] ;

[0128] in, This is a 3×3 convolution operation for ConvC3 layers. This is a 3×3 convolution operation for ConvC4 layers. For batch normalization operations of BN2 layer;

[0129] It is worth noting that the drone propeller branch and the bird wing detection branch have the same structure, but the two branches learn completely independent learnable convolutional kernel parameters. Preliminary experiments show that the similarity of the convolutional kernel parameters of the two branches is only 12.7%, proving that the two branches have learned differentiated infrared feature extraction rules.

[0130] It is important to note that the aspect ratio was chosen as an auxiliary distinguishing feature based on the inherent morphological differences between drone propeller targets and bird wing targets. Statistical data shows that the aspect ratio of drones is concentrated between 1.2 and 2.0 (accounting for 89.7%), exhibiting a near-circular shape, while the aspect ratio of birds is concentrated between 0.8 and 1.4 (accounting for 82.3%), exhibiting a long, narrow shape. This difference is not affected by thermal radiation blurring in infrared images and serves as a stable basis for class distinction. The role of aspect ratio attention weight is to dynamically adjust the contribution of component attention when the thermal features of a single component are ambiguous (such as when the hot spot on a bird's wing is close to an ellipse), thereby avoiding misjudgment.

[0131] S4124: Feature map Y generates aspect ratio attention weights through a link of global feature aggregation, dimension transformation, and category score normalization in the aspect ratio attention weight branch;

[0132] Specifically, the feature map Y undergoes adaptive average pooling through the AdaptiveAvgPool layer to aggregate global context information. The average-pooled features are flattened into a one-dimensional vector through the Flatten layer. This one-dimensional vector is then reduced to a dimension through the Linear1 layer. The reduced-dimensional vector is activated by the ReLU function. The activated vector is then transformed into a two-dimensional vector through the Linear2 layer. Finally, the two-dimensional vector is used to generate aspect ratio attention weights through the Softmax function, as shown in the formula:

[0133] ;

[0134] in, For the Flatten layer flattening operation, Dimensionality reduction operation for Linear1 layer, For Linear2 layer mapping operations, For the Softmax function, Weighting of drones For birds, the tendency to favor weight. This is the original score vector of the Linear2 layer. B represents the batch size. For drones, the raw score is preferred. The original score is for birds;

[0135] S4125: The propeller attention map and the drone tendency weights are multiplied element-wise, and the wing attention map and the bird tendency weights are multiplied element-wise. The results of the element-wise multiplications are then added together to obtain the enhanced features. The formula is:

[0136] ;

[0137] in, To enhance features.

[0138] S42: The neck network performs top-down and bottom-up multi-scale feature fusion on the input multi-scale feature map {C2,C3,C4} and outputs the target feature {P2,P3,P4}.

[0139] Specifically, the processing structure of the neck network is as follows:

[0140] The output feature map (40×40×128) from the Upsample1 module and the multi-scale feature map C4 (40×40×128) from the second output of the SP3 module are spliced together to obtain a spliced feature (40×40×256). The spliced feature (40×40×256) is input to the input of the C2f1 layer. The first output feature map (40×40×128) from the C2f1 layer is input to the Upsample2 module. The output feature map (80×80×128) from the Upsample2 module and the multi-scale feature map C3 (80×80×64) from the second output of the SP2 module are spliced together to obtain a spliced feature (80×80×192). The spliced feature (80×80×192) is input to the input of the C2f2 layer. The first output of the C2f2 layer outputs a small target feature P3 (80×80×64) to the detection head.

[0141] The second output of layer C2f2 outputs small target feature P3 (80×80×64) to the input of layer Conv6. The output of layer Conv6 outputs feature map (40×40×128) and the second output of layer C2f1 outputs feature map (40×40×128) to obtain spliced feature (40×40×256). The spliced feature (40×40×256) is input to the input of layer C2f3. The first output of layer C2f3 outputs medium target feature P4 (40×40×128) to the detection head.

[0142] The second output of layer C2f3 outputs the target feature P4 (40×40×128) to the input of layer Conv7. The output of layer Conv7 outputs a feature map (20×20×256) and the output of SPPF outputs a multi-scale feature map C5 (20×20×128), which are then spliced together to obtain a spliced feature (20×20×384). The spliced feature (20×20×384) is then input to the input of layer C2f4. The output of layer C2f4 outputs a large target feature (20×20×128) to the detection head.

[0143] S43: The detection head trains and optimizes the input target features {P3,P4,P5} to build classification and regression branches. The classification branch outputs the probabilities of birds and drones, and the regression branch outputs the bounding box coordinates and confidence scores.

[0144] More specifically, the target features {P3, P4, P5} are input into the detection head module, which is divided into a classification branch and a regression branch. The classification branch outputs the probabilities of birds and drones, while the regression branch outputs the bounding box coordinates and confidence scores. The probabilities of birds and drones are optimized using the classification loss function, while the bounding box coordinates are optimized using the bounding box regression loss function and the distribution focus loss function.

[0145] The hyperparameters were configured as follows: 100 training epochs, batch size 16, image input size 640×640, initial learning rate 0.01, momentum 0.937, weight decay 0.0005, bounding box regression loss function weight 7.5, classification loss function weight 0.5, and distribution focus loss function weight 1.5.

[0146] The bounding box coordinates are {x,y,w,h}, which are the center coordinates, top-left corner coordinates, length, and width of the bounding box, respectively.

[0147] Specific Implementation Example 2: Verification of the UAV and Bird Recognition Model.

[0148] To verify the effectiveness of the SP and CSA modules, they were embedded into the YOLOv8n backbone network. The ablation experiment results are shown in Table 1.

[0149] Table 1: Ablation test results.

[0150] ;

[0151] The evaluation metrics are precision (P), recall (R), mean precision at an intersection-union threshold of 0.5 (mAP@0.5), mean precision at multiple intersection-union thresholds (mAP@0.5:0.95), number of parameters (Param / M), and computational cost (GFLOPs).

[0152] YOLOv8n+SP: Compared to the YOLOv8n model, the number of parameters is reduced from 3.01M to 2.60M, and the computational cost is reduced from 8.1 to 7.6, achieving lightweighting. At the same time, the accuracy and average precision are slightly improved, proving that the SP module retains better features while reducing the number of parameters.

[0153] YOLOv8n+CSA: Compared to the YOLOv8n model, the average precision improved by 0.5% when the intersection-union ratio threshold was 0.5%, and the average precision improved significantly by 2.5% when the multiple intersection-union ratio threshold was 0.5%. Precision and recall were also improved, proving that the CSA module significantly enhanced the model's detection capability and localization accuracy through the component attention mechanism.

[0154] YOLOv8n+SP+CSA: With only a slight increase in the number of parameters to 3.21M, the average accuracy reached 96.5% when the cross-over-union (CUI) threshold was 0.5, and the average accuracy reached 52.3% when the CUI threshold was 52.3%, which is an improvement of 0.7% and 3.0% respectively compared to the YOLOv8n model, demonstrating the synergistic effect of using the SP and CSA modules together.

[0155] It should be noted that the SP and CSA modules can also be migrated to other similar single-stage object detection networks, such as YOLOv5 and YOLOv11.

[0156] To verify the advantages of the drone and bird recognition models, YOLO series models (YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv9-tiny, YOLOv10n) of the same scale as the drone and bird recognition models were compared under the same dataset and environment. The results of the comparison experiment are shown in Table 2.

[0157] ;

[0158] The average accuracy of the UAV and bird recognition model of this invention under the multiple intersection-union (MIU) threshold reaches 52.3, which is higher than all the comparison models and 1.1% higher than the second-best YOLOv9-tiny model. This shows that the UAV and bird recognition model has the strongest comprehensive positioning capability under the MPU threshold. While achieving the best positioning accuracy, the number of model parameters and computational cost are within a reasonable range, which verifies that the UAV and bird recognition model has achieved an effective balance between detection accuracy and model complexity.

[0159] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for identifying drones and birds by fusing spatial perception and component attention, characterized in that, Includes the following steps: Collect bird and drone data, introduce bird and drone data into various environmental scenarios, label suspected drone and bird areas in the environmental scenarios, construct the collected dataset through the labeled environmental scenarios, perform preprocessing and enhancement operations on the collected dataset to obtain the training dataset; Based on the improved backbone network, neck network, and detection head, and combined with spatial perception and component spatial attention, a drone and bird recognition model is constructed. The drone and bird recognition model is trained by inputting the training dataset, and the detection model parameters of the drone and bird recognition model are iteratively adjusted by using the loss function and training strategy. The trained drone and bird recognition model is used to detect drones and birds in real-time images, and the detection results of drone and bird recognition are finally output.

2. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 1, characterized in that, The improved backbone network includes a Conv layer, SP modules, CSA modules, and an SPPF layer; the Conv layer includes Conv1, Conv2, Conv3, Conv4, and Conv5 layers; the SP modules include SP1, SP2, SP3, and SP4 modules; the Conv1, Conv2, SP1, Conv3, SP2, CSA, Conv4, SP3, Conv5, SP4 modules, and SPPF layers are connected sequentially.

3. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 2, characterized in that, The SP module includes a Split module, a ConvS layer, a DWConv layer, an AdaptiveAvgPool layer, and a BN layer; the ConvS layer includes ConvS1, ConvS2, ConvS3, ConvS4, ConvS5, and ConvS6 layers; the processing procedure of the SP module is as follows: The feature map X is input to the Split module, which splits the feature map X into the left branch. and right branch ; Left branch The detailed spatial features of small targets are extracted sequentially through ConvS1 and ConvS2 layers. These detailed spatial features are then convolved through ConvS3 layer to obtain an intermediate feature map. ; intermediate feature map Depthwise separable convolutions are performed through the DWConv layer. The features after the depthwise convolution operation are then subjected to adaptive average pooling through the AdaptiveAvgPool layer to aggregate global contextual information. The average-pooled features are then activated by an activation function to obtain semantic information weights. ; right branch Preliminary channel-dimensional mapping is performed using the ConvS4 layer to obtain intermediate feature maps. ; intermediate feature map The feature representation is optimized through a ConvS5 layer, and the optimized features are then batch normalized through a BN layer to obtain spatial information weights. ; intermediate feature map Spatial information weights Perform element-wise multiplication, intermediate feature map and semantic information weights Element-wise multiplication is performed, and the resulting features are concatenated. These concatenated features are then subjected to convolution through a ConvS6 layer to obtain fused features. .

4. The method for UAV and bird recognition based on spatial perception and component attention fusion according to claim 2, characterized in that, The CSA module includes a UAV propeller branch, an aspect ratio attention weight branch, and a bird wing detection branch; the processing procedure of the CSA module is as follows: Input the feature map Y to the CSA module, and the CSA module will input the feature map Y to the UAV propeller branch, the aspect ratio attention weight branch, and the bird wing detection branch respectively; Feature map Y generates a propeller attention map by focusing on the feature pattern of a low-brightness symmetrical elliptical thermal distribution through the branches of the UAV propeller. ; Feature map Y generates a wing attention map by focusing on the asymmetric strip-shaped thermal distribution feature pattern through the bird wing detection branch. ; Feature map Y generates aspect ratio attention weights through a link of global feature aggregation, dimension transformation, and category score normalization in the aspect ratio attention weight branch; The propeller attention map and the drone tendency weights are multiplied element-wise, and the wing attention map and the bird tendency weights are multiplied element-wise. The results of the element-wise multiplications are then summed to obtain the enhanced features. .

5. The method for UAV and bird recognition based on spatial perception and component attention fusion according to claim 4, characterized in that, The drone propeller branch includes ConvC1 layer, BN1 layer, and ConvC2 layer; the aspect ratio attention weight branch includes AdaptiveAvgPool layer, Flatten layer, Linear1 layer, and Linear2 layer; the bird wing detection branch includes ConvC3 layer, BN2 layer, and ConvC4 layer; the ConvC1 layer, BN1 layer, and ConvC2 layer are connected in sequence, the AdaptiveAvgPool layer, Flatten layer, Linear1 layer, and Linear2 layer are connected in sequence, and the ConvC3 layer, BN2 layer, and ConvC4 layer are connected in sequence.

6. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 5, characterized in that, The feature map Y is generated by focusing the feature pattern of a low-brightness symmetrical elliptical thermal distribution through the UAV propeller branches, thus generating a propeller attention map. Includes the following sub-steps: The feature map Y is input to the ConvC1 layer to extract the attention map of the thermal radiation region. The extracted features are then batch-normalized through the BN1 layer. The batch-normalized features are activated by the ReLU function. The activated features are then enhanced by the ConvC2 layer to strengthen the low-brightness symmetrical elliptical thermal distribution. The enhanced features are then activated by the Sigmoid function to obtain the propeller attention map. .

7. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 5, characterized in that, The feature map Y generates aspect ratio attention weights through a chain of global feature aggregation, dimensional transformation, and category score normalization via aspect ratio attention weight branches, including the following sub-steps: Feature map Y undergoes adaptive average pooling through the AdaptiveAvgPool layer to aggregate global context information. The average pooled features are flattened into a one-dimensional vector through the Flatten layer. The one-dimensional vector is then reduced in dimension through the Linear1 layer. The dimension-reduced vector is activated by combining it with the ReLU function. The activated vector is then transformed into a two-dimensional vector through the Linear2 layer. The two-dimensional vector is used to generate aspect ratio attention weights through the Softmax function.

8. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 4, characterized in that, The processing formula for the aspect ratio attention weight branch is as follows: ； in, For the Flatten layer flattening operation, Dimensionality reduction operation for Linear1 layer, For Linear2 layer mapping operations, For the Softmax function, Weighting of drones For birds, the tendency to favor weight. This is the original score vector of the Linear2 layer. B represents the batch size. For drones, the raw score is preferred. Let R be the original score for the bird tendency, and R be the set of real numbers.

9. The method for identifying UAVs and birds by spatial perception and component attention fusion according to claim 1, characterized in that, The final output of the detection results for drone and bird identification includes the following sub-steps: The improved backbone network extracts features from the input drone and bird images and outputs multi-scale feature maps {C3,C4,C5}. The neck network performs top-down and bottom-up multi-scale feature fusion on the input multi-scale feature map {C2,C3,C4} and outputs the target feature {P2,P3,P4}. The detection head trains and optimizes the input target features {P3,P4,P5} to build classification and regression branches. The classification branch outputs the probabilities of birds and drones, and the regression branch outputs the bounding box coordinates and confidence scores.