Structure enhancement method and system for anti-drone infrared imaging
By introducing an infrared single-channel adaptive projection module, a local thermal entropy-driven ultra-small target hotspot enhancement module, and an infrared feature dynamic recalibration network, infrared feature extraction is optimized, solving the problems of insufficient texture edge information and background interference in infrared imaging, and achieving efficient detection of small infrared targets.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
Smart Images

Figure CN122243848A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of target detection technology, and in particular relates to a structural enhancement method for infrared imaging against unmanned aerial vehicles. Background Technology
[0002] With the continuous breakthroughs of deep learning in the field of visible light target detection, researchers have begun to introduce mature detection frameworks into thermal infrared imaging scenarios to leverage the more stable imaging capabilities of infrared imaging in low light, nighttime, and low visibility conditions, providing all-weather supplementary perception for anti-drone systems.
[0003] However, infrared images lack sufficient texture and edge information compared to visible light images. Targets appear as small-scale local hot spots, and strong background interference is caused by cloud thermal radiation, building and ground surface thermal scattering, and sensor noise. This makes it difficult to directly transfer many feature extraction and optimization strategies designed for RGB texture priors. In addition, many engineering implementations still use single-channel to three-channel hard copying to reuse RGB pre-trained detectors. While this simplifies the input interface, it introduces channel redundancy without information gain, which may weaken the network's sensitivity to differences in thermal radiation intensity, thus forming a bottleneck in the underlying input adaptation of infrared detection. Summary of the Invention
[0004] The purpose of this invention is to provide a structural enhancement method for infrared imaging against unmanned aerial vehicles, aiming to solve the above-mentioned technical problems.
[0005] This invention is implemented as follows: a structural enhancement method for anti-drone infrared imaging, comprising the following steps:
[0006] The original infrared three-channel image is input into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion of learnable projection branch, fidelity pass-through branch and global statistical gating, infrared adaptation features are output to align with the input distribution of the pre-trained backbone network.
[0007] The IR-SA-SFEM ultra-small target hotspot enhancement module, driven by local thermal entropy, takes infrared adaptation features as input, and outputs enhanced features through local thermal entropy saliency adjustment and thermal radiation contrast enhancement.
[0008] Based on the infrared feature dynamic recalibration network IR-RFNet, the input is enhanced features, and through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion, the output is a dynamic recalibration feature with suppressed background thermal interference.
[0009] Based on dynamic recalibration features, the detection results of UAV targets in infrared images are determined.
[0010] Furthermore, the learnable projection branch achieves cross-channel linear mixing through 1×1 convolution, and, in conjunction with BN and SiLU activation functions, outputs projection features F0:
[0011] ;
[0012] Where X is the input; Indicates nonlinear activation of SiLU; , H represents the number of channels after projection, and H and W represent the height and width of the original infrared three-channel image, respectively.
[0013] The fidelity direct-through branch directly captures the original infrared three-channel image I. id =X.
[0014] Furthermore, the global statistical gating extracts channel-level intensity summaries of the input image through global average pooling, and outputs scalar coefficients g via a lightweight two-layer MLP and a sigmoid function:
[0015] ;
[0016] in, Indicates global average pooling. It is a two-layer lightweight sensor. For the Sigmoid function;
[0017] The infrared adaptation feature Y output by the infrared single-channel adaptive projection module IR-Adapter is defined as:
[0018] .
[0019] Furthermore, the method for significantly adjusting the local thermal entropy specifically includes:
[0020] Let the input shallow features be Channel aggregation is performed on the input shallow features to obtain the heat response map S:
[0021] ;
[0022] Where C is the number of channels; a 9×9 neighborhood is taken centered at pixel position (i,j). The neighborhood response is normalized into a probability distribution using Softmax:
[0023] ;
[0024] The local thermal entropy map E at that location is defined as:
[0025] ;
[0026] where ε is a constant; based on the local thermal entropy map E, generate the saliency weight through lightweight mapping:
[0027] ;
[0028] where is a 1×1 convolution, is the Sigmoid function; recalibrate the input shallow features to obtain the local thermal entropy enhanced features :
[0029] .
[0030] Furthermore, the method for enhancing the thermal radiation contrast specifically includes:
[0031] For the input shallow features, use center-background differential modeling to enhance the relative contrast of the hot spot:
[0032] ;
[0033] where T is the contrast enhanced feature; represents the average pooling operation with a kernel size of k; k1 < k2 respectively correspond to the mean responses of the local center and a larger background range, and ψ(·) represents the lightweight non-linear mapping;
[0034] Adopt a residual form, and fuse the input shallow features with the local thermal entropy enhanced features and the contrast enhanced features respectively and then output.
[0035] Furthermore, the method for multi-scale thermal context modeling and branch weight adaptive fusion specifically includes:
[0036] Let the input feature be , and the output feature has the same size and number of channels as the input feature;
[0037] Parallelly construct K sets of depthwise separable dilated convolution branches on the same input feature F to perceive the relative differences in the thermal distributions of the target hot spot and the surrounding background at different spatial scales, and generate K branch features Y i :
[0038] ;
[0039] where represents the depth convolution with a dilation rate of d i , is the group normalization, is the SiLU non-linear mapping;
[0040] Based on a lightweight weight generator, branch fusion weights are adaptively generated according to the current feature statistics; firstly, channel statistical aggregation of the features is performed to obtain the global description vector s:
[0041] ;
[0042] The branch weights (logits) are then obtained through two layers of 1×1 convolution mapping and normalized to the fusion coefficients using Softmax. :
[0043] ;
[0044] in, This represents a lightweight mapping; finally, the branch outputs are fused according to dynamic weights and compressed and fused through a 1×1 convolution to obtain the recalibrated residual features R:
[0045] .
[0046] Furthermore, the method for fusing spatial high-frequency suppression gating with residual steady-state conditions specifically includes:
[0047] Local mean filtering is applied to the input feature F to obtain the low-pass term L, and the high-pass amplitude H is calculated:
[0048] ;
[0049] Channel aggregation is performed on H and in-image mean normalization is applied to obtain the normalized high-pass image H. n :
[0050] ;
[0051] in, This indicates the aggregation of the mean along the channel dimension; This indicates the mean value to be calculated over spatial dimensions H and W. To prevent extremely small constants with a denominator of zero;
[0052] Then, a gated graph is generated using the Sigmoid mapping. , where g min For gated floor items:
[0053] ;
[0054] in, The threshold value is a preset value, and t is a temperature coefficient that controls the degree of gate control.
[0055] The recalibrated residual feature R is obtained after gating. :
[0056] ;
[0057] The output is in residual form, and a learnable scaling factor is introduced. Control the amplification amplitude, output :
[0058] ;
[0059] in, This indicates residual limiting operation.
[0060] Another object of the present invention is to provide a structure enhancement system for anti-drone infrared imaging, for implementing the above-mentioned structure enhancement method for anti-drone infrared imaging, specifically including:
[0061] The adaptive fusion module is used to input the original infrared three-channel image into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion with learnable projection branch, fidelity pass-through branch and global statistical gating, it outputs infrared adaptation features to align with the input distribution of the pre-trained backbone network.
[0062] The feature enhancement module is used for the IR-SA-SFEM ultra-small target hotspot enhancement module driven by local thermal entropy. It takes infrared adaptation features as input, adjusts the significance of local thermal entropy and enhances thermal radiation contrast, and outputs enhanced features.
[0063] The feature dynamic recalibration module is used to take the enhanced features as input based on the infrared feature dynamic recalibration network IR-RFNet, and output the dynamic recalibrated features with suppressed background thermal interference through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion.
[0064] The results output module is used to determine the detection results of UAV targets in infrared images based on dynamic recalibration features.
[0065] The structural enhancement method for anti-UAV infrared imaging provided by this invention introduces an infrared single-channel adaptive projection module (IR-Adapter), a local thermal entropy-driven ultra-small target hotspot enhancement module (IR-SA-SFEM), and an infrared feature dynamic recalibration network (IR-RFNet) on the backbone network. These modules optimize infrared features at the input end, shallow layer, and mid-layer feature extraction stages, respectively, improving the separability and robustness of small targets while maintaining lightweight and end-to-end training paradigm. Experiments demonstrate that this method can significantly improve the performance of infrared small target detection and reduce false positives and false negatives caused by complex background interference. Attached Figure Description
[0066] Figure 1 This is a schematic diagram of the structure of the IR-Asapter module provided in an embodiment of the present invention.
[0067] Figure 2 This is a schematic diagram of the structure of the IR-RFNet network provided in an embodiment of the present invention.
[0068] Figure 3 This is a schematic diagram of the PCTDetect target detection network provided in an embodiment of the present invention.
[0069] Figure 4 This is a schematic diagram of the overall structure of the structural reinforcement network provided in an embodiment of the present invention. Detailed Implementation
[0070] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0071] In infrared anti-drone scenarios, targets are usually small hot spots with blurred boundaries and are easily interfered with by complex backgrounds. This invention provides task-driven structural enhancement for infrared targets. While keeping the single-stage end-to-end training and detection head output of the YOLOv8n backbone network unchanged, three types of dedicated modules are introduced to achieve task-driven enhancement of input, shallow and medium-level features.
[0072] Specifically, in one embodiment of the present invention, a structural enhancement method for anti-drone infrared imaging is provided, comprising the following steps:
[0073] S1. Input the original infrared three-channel image into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion of learnable projection branch, fidelity pass-through branch and global statistical gating, output infrared adaptation features to align with the input distribution of the pre-trained backbone network.
[0074] S2. The ultra-small target hot spot enhancement module IR-SA-SFEM driven by local thermal entropy takes infrared adaptation features as input, and outputs enhanced features through local thermal entropy saliency adjustment and thermal radiation contrast enhancement.
[0075] S3. Based on the infrared feature dynamic recalibration network IR-RFNet, the enhanced features are input, and through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion, the dynamic recalibration features with suppressed background thermal interference are output.
[0076] S4. Determine the detection result of the UAV target in the infrared image based on the dynamic recalibration features.
[0077] In practical applications, the backbone network adopts YOLOv8n. Using YOLOv8n as the baseline, this paper proposes plug-in enhancements for the infrared image feature stream, while maintaining the overall network's single-stage detection paradigm and output interface (Backbone–Neck–Head). The enhancement strategy consists of three parts:
[0078] (1) Input: An IR-Adapter is introduced to perform learnable projection on the three-channel pseudo-infrared input, and adaptively fuse it with the fidelity pass-through branch to align the input distribution of the RGB pre-trained backbone and suppress early training drift.
[0079] (2) Shallow high-resolution path: IR-SA-SFEM is introduced, and significant recalibration driven by local thermal entropy and thermal radiation contrast enhancement are used to improve the distinguishability of ultra-small hot spot targets and reduce downsampling damage.
[0080] (3) Mid-layer feature extraction stage: IR-RFNet is introduced, and the complex thermal background is suppressed and the proportion of effective target signal is increased through the synergy of thermal channel recalibration and spatial focusing dual branches.
[0081] In a preferred embodiment of the present invention, YOLOv8n natively targets RGB three-channel input and relies on visible light data such as COCO for pre-training. The input-side convolution and normalization parameter statistics are dependent on the visible light distribution. During infrared detection, although the image is read in as a .jpg three-channel image, the three channels are highly correlated, essentially equivalent to a three-channel encoding of a single thermal intensity information. If repeated grayscale conversion is performed during training, it may introduce additional interpolation and quantization errors and unnecessary distribution perturbations. Therefore, the embodiments of the present invention maintain the original three-channel infrared input during the training and inference phases. Furthermore, an IR-Adapter module is introduced at the input end to achieve robust alignment of the pre-trained backbone input distribution through adaptive fusion. The overall network structure is as follows: Figure 1 As shown, it mainly consists of three parts: learnable projection branch, fidelity pass-through branch, and global statistical gating.
[0082] In a preferred embodiment of the invention, the learnable projection branch is achieved through a 1×1 convolution Conv 1×1 Achieving cross-channel linear mixing, and combining BN and SiLU activation functions to improve training stability and non-linear expressive power, outputting the projected feature F0:
[0083] ;
[0084] Where X is the input; Indicates nonlinear activation of SiLU; C aH represents the number of channels after projection, and H and W represent the height and width of the original infrared three-channel image, respectively. To match the native input dimension of YOLOv8n, this embodiment of the invention uses C. a The value is set to 3, allowing F0 to directly interface with the YOLOv8n backbone, thus inheriting the COCO pre-trained weights without modifying the main structure. The core function of this branch is to enable the network to learn a set of infrared-specific channel response combinations at the input, making it easier for the early convolutions of the subsequent backbone to obtain discriminative features.
[0085] To avoid distribution drift or information corruption in the early stages of training when relying solely on learnable projections, the module retains a direct pass branch to maintain a fidelity baseline for the input thermal intensity structure. In the three-channel pseudo-infrared setting of the embodiment, the fidelity direct pass branch can directly capture the original three-channel infrared image I. id Input, i.e., I id =X. This branch does not change the spatial structure and thermal intensity distribution, ensuring the integrity of the input information and playing a role in stabilizing the gradient and the convergence process.
[0086] To achieve an adaptive balance between learnable projection and high-fidelity passthrough, the IR-Adapter introduces a scalar coefficient. , used to control the fusion weights of the two branches; global statistical gating extracts channel-level intensity summaries of the input image through global average pooling, and outputs scalar coefficients g through a lightweight two-layer MLP and a sigmoid function:
[0087] ;
[0088] in, Indicates global average pooling. It is a two-layer lightweight sensor. The sigmoid function is used; this design aligns with the global statistical attention approach, but further simplifies the output from a channel vector to a scalar, better reflecting the goal of overall input distribution alignment and reducing the risk of noise amplification caused by shallowly introduced complex gating; finally, the infrared adaptation feature Y output by the infrared single-channel adaptive projection module IR-Adapter is defined as:
[0089] ;
[0090] When the gated network determines that the infrared statistical characteristics of the current frame are more suitable for a learnable mapping, g is increased to enhance the learning effect. The percentage; when the scene noise is high or the thermal response distribution is abnormal, g is reduced to retain more. The original thermal strength structure is improved, thereby enhancing training stability and robustness.
[0091] In a preferred embodiment of the present invention, infrared UAV targets often appear as locally bright hot spots with weak texture and edges, easily confused with high-frequency thermal noise in the background. Over-reliance on deep semantics leads to rapid decay of fine-grained information during continuous downsampling, resulting in missed detections and positioning errors. Therefore, this embodiment inserts IR-SA-SFEM into the P2 (1 / 4) high-resolution feature path to preserve the local response of the hot spot as much as possible and reduce downsampling impairment. Let the input shallow features be... The IR-SA-SFEM output size and channels remain unchanged, facilitating seamless integration with subsequent Neck structures. The network structure is as follows: Figure 2 As shown, IR-SA-SFEM consists of two complementary paths: a local thermal entropy saliency adjustment branch and a thermal radiation contrast enhancement branch, and outputs enhanced features through residuals.
[0092] Specifically, the saliency of visible light images can be measured by global texture complexity, while the effective information in infrared images is often highly concentrated in local hotspot regions. Therefore, this embodiment of the invention uses a 9×9 local window to calculate thermal entropy, characterizing the uncertainty and information density of local thermal response, and generating an adaptive weight map accordingly to achieve feature recalibration. The method for adjusting the saliency of local thermal entropy specifically includes:
[0093] First, channel aggregation is performed on the input shallow features to obtain the heat response map S:
[0094] ;
[0095] Where C is the number of channels; a 9×9 neighborhood is taken centered at pixel position (i,j). The neighborhood response is normalized into a probability distribution using Softmax:
[0096] ;
[0097] The local thermal entropy map E at that location is defined as:
[0098] ;
[0099] Where ε is a small constant to prevent numerical instability; based on the local thermal entropy map E, significant weights are generated through lightweight mapping:
[0100] ;
[0101] in, =1×1 convolution, The Sigmoid function is used; finally, the shallow features of the input are recalibrated to obtain locally thermal entropy-enhanced features. :
[0102] ;
[0103] This process enables the network to adaptively emphasize the areas where the local thermal response is more concentrated in the shallow stage, thereby enhancing the saliency of the hotspot target.
[0104] The method for enhancing the thermal radiation contrast specifically includes:
[0105] The separability of infrared small targets mainly comes from the relative difference between the target hotspots and the background thermal field, rather than clear edge contours. To enhance this difference, the embodiments of the present invention introduce a thermal radiation contrast operator, and use a center-background difference modeling for the input shallow features to enhance the relative contrast of the hotspots:
[0106] ;
[0107] Where, T is the contrast-enhanced feature; represents the average pooling operation with a kernel size of k; k1 < k2 (such as k1 = 5, k2 = 15) respectively correspond to the mean responses of the local center and a larger background range, and ψ(·) represents a lightweight non-linear mapping (1×1Conv + BN + SiLU); this differential form can highlight the thermal contrast of the local hotspots relative to the surrounding background thermal field, thereby enhancing the detectability and localization stability of small target hotspots.
[0108] Considering that shallow features are sensitive to training stability, IR-SA-SFEM adopts a residual form, and fuses the input shallow features with the local thermal entropy enhancement feature and the contrast enhancement feature respectively and then outputs the enhanced feature :
[0109] ;
[0110] Where, γ is a learnable scale coefficient, used to balance the enhancement amplitude and avoid instability caused by overemphasis modulation in the early training stage; since the output dimension is the same as that of F2, IR-SA-SFEM can be embedded into the existing YOLOv8n feature stream as a plug-in enhancement unit without changing the interfaces of the subsequent Neck-Head and the end-to-end training paradigm.
[0111] In a preferred embodiment of the present invention, in an infrared anti-drone scenario, the background heat source distribution is complex and changes significantly with the environment. Building roofs, exposed ground areas, and lights can all generate continuous or instantaneous strong thermal responses, leading to abnormal activation of non-target areas in shallow high-resolution features. Meanwhile, drone targets often appear as small-scale, blurred-boundary local hotspots with weak texture and shape cues, making them easily overwhelmed by background thermal noise, resulting in both false positives and false negatives. Traditional single-scale convolution or general attention recalibration strategies often struggle to suppress background noise without harming the target under infrared conditions, especially in high-resolution shallow feature paths where local high-frequency thermal noise significantly amplifies the risk of false positives. Therefore, this embodiment of the present invention proposes an infrared feature dynamic recalibration network IR-RFNet. Through a collaborative mechanism of multi-scale thermal context modeling, adaptive branch weight fusion (α), spatial high-frequency suppression gating (g), and residual steady-state output (γ), it increases the proportion of hotspot target signals and suppresses background thermal noise interference. The overall network structure is as follows: Figure 3 As shown.
[0112] Let the input features be The output features maintain the same size and number of channels as the input features, facilitating seamless integration with the subsequent Neck-Head. IR-RFNet consists of three parts: a multi-scale thermal context modeling branch, a branch weight adaptive fusion branch, and a spatial high-frequency suppression gating branch based on local high-pass energy normalization; and it uses residual output to ensure training stability.
[0113] Specifically, the method of multi-scale thermal context modeling and adaptive fusion of branch weights includes:
[0114] The separability of infrared small target hotspots comes not only from local brightness peaks but also from their relative relationship with the neighborhood background thermal field. To simultaneously utilize thermal background context information at different scales, IR-RFNet constructs a set of K depthwise separable dilated convolution branches in parallel on the same input feature F. (K equals the number of elements in the dilatation rate set, e.g., dilatation rate d={1,2,3}, K=3). Different dilatation rates correspond to different receptive fields, used to perceive the relative differences between the target hotspot and the surrounding background thermal distribution at different spatial scales, generating K branch features, with a single branch feature Y. i The process is as follows:
[0115] ;
[0116] in, The expansion rate is d i Depth convolution, For group normalization, This is a SiLU nonlinear mapping;
[0117] Since the optimal receptive field scale varies in different scenarios, IR-RFNet introduces a lightweight weight generator to adaptively generate branch fusion weights based on the current feature statistics. First, channel statistics aggregation of the features is performed to obtain the global description vector s.
[0118] ;
[0119] The branch weights (logits) are then obtained through two layers of 1×1 convolution mapping and normalized to the fusion coefficients using Softmax. :
[0120] ;
[0121] in, This represents a lightweight mapping (1×1Conv+SiLU+1×1Conv); finally, the branch outputs are fused according to dynamic weights and compressed and fused by 1×1 convolution to obtain the recalibrated residual features R:
[0122] ;
[0123] This strategy enables the network to adaptively select appropriate thermal context scales based on different infrared background thermal fields, thereby improving the response consistency and localization stability of weak hot spots.
[0124] The method for fusing spatial high-frequency suppression gating with residual steady-state analysis specifically includes:
[0125] Many false detections in infrared images originate from local high-frequency thermal noise or non-target texture hotspots, which often manifest as anomalous high-pass energy local responses in shallow features. To suppress such non-target activations, IR-RFNet introduces a high-pass energy gating branch, performing local mean filtering on the input feature F to obtain a low-pass term L, and calculating the high-pass amplitude H:
[0126] ;
[0127] Channel aggregation is performed on H and in-image mean normalization is applied to obtain the normalized high-pass image H. n :
[0128] ;
[0129] in, This indicates the aggregation of the mean along the channel dimension; This indicates the mean value to be calculated over spatial dimensions H and W. To prevent extremely small constants with a denominator of zero;
[0130] Then, a gated graph is generated using the Sigmoid mapping. , where g minThis is a gated floor term used to ensure that a certain residual signal is retained even in a high-noise background to prevent small targets from being completely suppressed.
[0131] ;
[0132] in, The threshold value is set to a preset value (slightly greater than 1), and t is a temperature coefficient that controls the gating softness or hardness. This gating mechanism can effectively suppress high-frequency noise false alarms while maintaining the hot spot target response.
[0133] Furthermore, IR-RFNet can combine spatial focusing gating (1×1 Conv+Sigmoid) to assign higher weights to potential target regions in the spatial dimension, reducing the interference of background heat sources on the spatial response; finally, the recalibrated residual features R are used to obtain the features after gating. :
[0134] ;
[0135] Considering that shallow and mid-layer features are sensitive to training stability, IR-RFNet outputs in residual form and introduces learnable scaling factors. Controlling the magnitude of enhancement helps avoid over-emphasis in early training, which can lead to loss fluctuations and impact output. :
[0136] ;
[0137] in, This indicates residual limiting operation to ensure numerical stability; It is a learnable scaling factor, initially set to a small value close to 0, and can be used in conjunction with a training warmup strategy to maintain an identity mapping in the first few iterations, thereby achieving a training paradigm of first achieving stable convergence and then gradually enhancing it.
[0138] The IR-Adapter, IR-SA-SFEM, and IR-RFNet introduced in this invention all adhere to the principles of lightweight and pluggable design: without changing the YOLOv8n Backbone-Neck-Head single-stage paradigm and the detection head output interface, the size and number of channels of the output features remain consistent with the original feature flow. Therefore, they can be directly embedded into existing networks while maintaining the end-to-end training and deployment process unchanged. The increase in complexity mainly comes from a small number of low-parameter operators such as 1×1 convolutions / lightweight normalization and depthwise separable convolutions. Among them, the IR-Adapter only performs a learnable projection at the input end and adaptively fuses using scalar gating, and the parameters and computational overhead are negligible. The multi-scale branches of IR-RFNet use depthwise dilated convolutions and lightweight weight fusion, and the computational load increases linearly with the number of branches but remains controllable overall. IR-SA-SFEM mainly introduces local statistics and contrast enhancement in the P2 high-resolution path to improve the separability of ultra-small hot spots, and its additional overhead is mainly operator computation rather than parameter growth. Overall, the embodiments of the present invention, without changing the YOLOv8n detection head and training paradigm, achieve improvements in the saliency of infrared ultra-small targets and robustness against complex thermal backgrounds with a small parameter increment, thus meeting the requirements for real-time deployment.
[0139] Application Example: To meet the real-time deployment requirements of anti-drone scenarios, this embodiment of the invention selects YOLOv8n as the basic detection framework in practical engineering applications, maintaining its single-stage end-to-end training paradigm of Backbone-Neck (FPN+PAN)-Head and the three-scale detection head output interface unchanged (Detect output is P3, PAN-P4, PAN-P5P3). Based on this, and considering the imaging characteristics of weak infrared hotspots, blurred boundaries, and strong background thermal interference, this embodiment of the invention embeds three modules—IR-Adapter, IR-SA-SFEM, and IR-RFNet—in a plug-in manner at the input end, the shallow high-resolution feature extraction stage, and the mid-level feature extraction stage, respectively, forming a structured augmentation network for infrared anti-drone tasks. The overall structure is as follows: Figure 4 As shown.
[0140] In engineering implementations, infrared images are typically read in a three-channel format. However, these three channels are highly correlated, essentially representing redundant encoding of a single thermal intensity information. To achieve infrared distribution alignment without modifying the YOLOv8n pre-trained backbone input interface, this embodiment places an IR-Adapter between the network input and the Stem, performing learnable projection on the input and adaptively fusing it with a high-fidelity passthrough. Since the IR-Adapter outputs the same number of channels as the original network (3 channels), the subsequent Stem and backbone structure and pre-trained weights can be directly reused, avoiding the input adaptation bottleneck caused by hard copying the three channels and improving stability in the early stages of training.
[0141] Infrared UAV targets often appear as extremely small-scale localized hot spots, with fine-grained responses primarily existing in shallow, high-resolution features. To reduce the attenuation of hot spot information caused by downsampling, this embodiment of the invention inserts IR-SA-SFEM into the C2 output of the backbone (stride=4, 1 / 4 scale) to significantly recalibrate the local thermal entropy of shallow features and enhance thermal radiation contrast. The feature size and number of channels output by this module are consistent with C2, so it can be directly fed into the subsequent backbone downsampling and C3 generation process, thereby enhancing the separability of ultra-small hot spots from the source without introducing additional detection branches or interface modifications.
[0142] Complex thermal backgrounds (surface, buildings, lights, etc.) are more likely to generate non-target anomalous activations in mid-level features, leading to both false positives and false negatives. To increase the proportion of effective target signals, this embodiment of the invention embeds an IR-RFNet at the C3 output of the Backbone (stride=8, 1 / 8 scale) to perform multi-scale thermal context modeling and high-frequency suppression gating on mid-level features. The output features of the IR-RFNet are consistent with the C3-sized channels and can be used as high-resolution input for the top-level FPN fusion of the Neck, continuing to propagate along the Backbone to subsequent downsampling to generate C4 and C5, thereby improving the robustness of the entire network to complex thermal backgrounds without changing the Neck-Head structure.
[0143] In summary, since each module maintains the same output dimension as the original feature nodes, the Neck (FPN+PAN) and the three-scale Detect head do not need to be modified. The original YOLOv8n training script and deployment chain can be directly used to achieve a balance between accuracy improvement and real-time requirements.
[0144] To ensure consistency with the baseline model in terms of optimization objectives and to attribute performance differences primarily to structural improvements, infrared training still uses YOLOv8's default multi-task joint loss function, with the total loss consisting of the bounding box regression loss L. box Classification loss L cls With distributed regression loss L dlf The weighted composition is shown in the following formula:
[0145] ;
[0146] Among them, the bounding box regression loss L box To constrain the consistency between the predicted bounding box and the ground truth bounding box in terms of position and scale, YOLOv8 typically uses IoU-type regression loss to improve the stability and accuracy of localization convergence. For long-distance, small-scale UAVs, the target pixel scale is small and the boundary is weak. Even a slight box offset can cause a significant change in the overlap. Therefore, the regression term is more sensitive to localization accuracy in this task.
[0147] Classification loss L cls Constraining the target discrimination capability of prediction results. In anti-drone scenarios, interference such as lighting can easily induce high-confidence false alarms, and classification terms have a direct constraint effect on suppressing false alarms.
[0148] L dlf For distributed regression loss, the boundary regression is extended from single-point prediction to discrete distribution modeling, and the boundary localization quality is improved through finer-grained regression supervision. This mechanism can usually provide more stable training signals when the target scale is small, the boundary is blurred, or the feature representation is insufficient, which is beneficial to improving the localization precision and convergence quality.
[0149] Each loss weight coefficient in the embodiments of the present invention , and The default YOLOv8 configuration was used and kept consistent in all comparison and ablation experiments to ensure comparability and reproducibility of results.
[0150] The infrared detection training process is implemented based on Ultralytics YOLOv8, with a uniform input size of 640×640 and a training epoch count of 100. COCO pre-trained weights are used to initialize YOLOv8n as the starting point for transfer learning. At the implementation level, this embodiment does not modify the dataloader's image reading logic; therefore, infrared images enter the network in their original three-channel JPG format. The experimental platform uses a server environment, as shown in Table 1. To ensure that training actually runs for 100 epochs, this embodiment disables EarlyStopping (setting patience to 0) in the training configuration to avoid premature termination of training due to short-term fluctuations in the validation set.
[0151] Table 1 Infrared branch experimental platform and training parameter settings
[0152] Operating System / Image Ubuntu 22.04 (PyTorch image) Python / PyTorch / CUDA Python 3.12; PyTorch 2.8.0; CUDA 12.8 Detection framework Ultralytics YOLOv8 (v8.0.49) GPU NVIDIA RTX 4090 ×1 (24GB VRAM) CPU Intel Xeon Platinum 8481C (25 vCPU) Memory 90 GB Input image size 640 × 640
[0153] The infrared detection experiments in this embodiment of the invention are based on the thermal infrared modes of the Anti-UAV (UAV310) dataset. Regarding data partitioning, the infrared branch strictly follows the train / val / test partitioning files (train_ir.txt, val_ir.txt, test_ir.txt) provided by the official dataset, and maintains mutual exclusion of sets along the video sequence dimension to avoid overly optimistic evaluations caused by adjacent video frames crossing sets. All samples are labeled using YOLO format, with the number of categories set to 1 (UAV). The sample size, positive and negative sample frame composition, and their proportions for each partition are shown in Table 2. It should be noted that although the proportion of negative sample frames in the infrared data is not high overall, false alarms can accumulate over time in actual long-term online monitoring, creating an alarm burden. Therefore, in addition to the overall detection indicators, this embodiment of the invention will subsequently introduce a false alarm metric and threshold scanning curve based on negative sample frames to evaluate the false alarm-false negative trade-off of the model from an engineering deployment perspective.
[0154] Table 2 Infrared Data Division and Sample Size Statistics
[0155] Division Number of images negative sample frames negative sample proportion Positive sample frames Number of annotation boxes Train (train_ir.txt) 74004 1010 1.36% 72994 72994 Val (val_ir.txt) 9441 38 0.40% 9403 9403 Test (test_ir.txt) 9802 36 0.37% 9766 9766
[0156] To characterize the differences in the scale composition of infrared targets and the difficulty of different partitions, this embodiment of the invention further statistically analyzes the pixel scale (max-side) and relative area (relarea) of the target bounding boxes in the training / val / test sets, as shown in Table 3. Overall, infrared targets are predominantly small to medium scale, but there are significant differences in the proportion of small-scale, low-proportion targets across different partitions. The validation set contains a higher proportion of extremely small targets, placing stricter demands on the model's robustness under conditions of weaker target signals and more easily interfered by background thermal noise. The test set has relatively more moderate target scales, making it suitable for a comprehensive test of model generalization and stability. Based on these data characteristics, subsequent experiments in this chapter, in addition to reporting overall Precision, Recall, and mAP, will further combine subset metrics based on scale partitioning and false alarm related measures to conduct more targeted verification and analysis of the roles of IR-Adapter, IR-SA-SFEM, and IR-RFNet.
[0157] Table 3. Infrared Branched Target Scale Statistics (IR)
[0158] Division relarea mean max-side<32px max-side<64px correlation<0.002 Train (train_ir.txt) 0.005 11.24% 72.90% 7.11% Val (val_ir.txt) 0.004 32.81% 78.97% 31.99% Test (test_ir.txt) 0.005 25.11% 80.29% 9.92%
[0159] To comprehensively characterize the detection behavior of infrared detection models in anti-drone scenarios, this invention employs three index systems: overall detection performance, scale-difficult subset performance, and negative sample false alarm control. It also combines confidence threshold scanning curves to depict the trade-off between false alarms and false negatives at the deployment operating point. In infrared modal analysis, drone targets often exhibit weak texture and low-contrast local hotspot morphology, and background heat sources and thermal noise easily induce false detections. Under these data characteristics, relying solely on overall mAP may not adequately reflect the contribution of structural improvements to the detection capability and false alarm suppression capability of ultra-small hotspots. Therefore, this invention introduces scale subset and negative sample false alarm metrics for supplementary evaluation, as detailed below:
[0160] I. Overall Detection Performance: Overall performance is calculated on the validation set using Precision (P), Recall (R), F1-score (F1), and mAP (mAP@0.5 and mAP@0.5:0.95). mAP@0.5 emphasizes detection capability, while mAP@0.5:0.95 is more sensitive to localization quality and regression stability. Since this task involves single-class UAV detection, mAP is equivalent to the average AP result for that class. To analyze the changes in false positives under different confidence thresholds, this embodiment of the invention simultaneously reports the PR curve as well as the Precision-Confidence, Recall-Confidence, and F1-Confidence curves, using the confidence threshold corresponding to the F1 peak as a reference operating point to aid in explaining the basis for selecting deployment thresholds.
[0161] II. Performance of Small and Hard Subsets: Based on infrared target scale statistics, the infrared validation set shows a high proportion of small-scale and low-proportion targets, which better represent the difficulty of detecting weak hotspots in real-world deployments. To more effectively evaluate the model's detection capabilities for ultra-small targets and hard samples, this embodiment constructs two subsets. The main threshold for the Small subset is shown in the following formula:
[0162] ;
[0163] Where w and h are the pixel width and height of the target bounding box at the original image scale, respectively. This threshold corresponds to infrared minimal hotspot targets and can more sensitively examine the improvement in detection by shallow enhancement modules (such as IR-SA-SFEM). It is used to align with a broader definition of small targets, facilitating cross-sectional discussions and verification of the robustness of the conclusions. The main threshold of the Hard subset is shown in the following formula:
[0164] ;
[0165] Among them, rel area box area and image areaThese represent the relative area of the target, the area of the target bounding box, and the area of the original image, respectively. Hard mode emphasizes the difficult situation where the target occupies a very small proportion of the image, making it suitable for testing the ability of the background suppression recalibration module (IR-RFNet) to reduce background interference without damaging the target.
[0166] In the Small and Hard subsets, this embodiment of the invention calculates AP50 (and adds Recall@conf if necessary) and analyzes it together with the overall index to determine whether the structural improvement truly enhances the separability of weak targets and the localization quality of difficult samples, rather than just producing limited changes in the overall index.
[0167] III. Negative Sample False Alarm Measurement and Deployment Trade-offs (FPPI_neg and Operation Curve): False alarms accumulate and create an alarm burden during long-term operation of anti-drone systems. Therefore, this embodiment explicitly evaluates the false detection level in targetless scenarios. Let N be the number of negative sample frames in the validation set. neg The total number of predicted bounding boxes generated by all negative sample frames under the confidence threshold conf is FP. neg (conf), the negative sample FPPI is defined as follows:
[0168] ;
[0169] This embodiment of the invention reports a single-point FPPI at a uniform threshold conf=0.25. neg This facilitates an intuitive evaluation of missed detections versus false positives working points with Recall@conf; at the same time, it scans conf and plots FPPI-Recall operation curves to compare the achievable recall rates of different models under the same false positive constraints, thus better meeting the threshold selection requirements on the deployment side.
[0170] In summary, this invention addresses the challenges of single-channel information, weak texture, hot spot formation of ultra-small targets, and strong background thermal interference in infrared imaging for anti-drone scenarios. Without altering the YOLOv8n baseline or the end-to-end training paradigm, it proposes three types of lightweight infrared-specific enhancement modules: IR-Adapter for input distribution alignment and stable transfer of pre-trained backbone; IR-SA-SFEM to enhance ultra-small hot spot characterization through significant local thermal entropy recalibration and thermal radiation contrast enhancement in the P2 high-resolution path; and IR-RFNet to suppress complex thermal background interference through dynamic recalibration with dual channel and spatial branches in the middle layer. Experimental results demonstrate that the method provided in this invention significantly improves the performance and robustness of infrared ultra-small target detection while maintaining controllable parameters and computational overhead, and is feasible for real-time deployment.
[0171] In another embodiment of the present invention, a structure enhancement system for anti-drone infrared imaging is also provided, for implementing the above-described structure enhancement method for anti-drone infrared imaging, specifically including:
[0172] The adaptive fusion module is used to input the original infrared three-channel image into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion with learnable projection branch, fidelity pass-through branch and global statistical gating, it outputs infrared adaptation features to align with the input distribution of the pre-trained backbone network.
[0173] The feature enhancement module is used for the IR-SA-SFEM ultra-small target hotspot enhancement module driven by local thermal entropy. It takes infrared adaptation features as input, adjusts the significance of local thermal entropy and enhances thermal radiation contrast, and outputs enhanced features.
[0174] The feature dynamic recalibration module is used to take the enhanced features as input based on the infrared feature dynamic recalibration network IR-RFNet, and output the dynamic recalibrated features with suppressed background thermal interference through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion.
[0175] The results output module is used to determine the detection results of UAV targets in infrared images based on dynamic recalibration features.
[0176] It should be noted that each of the above modules can be implemented as a computer program, which can run on a computer device. The computer device's memory can store the computer program that makes up each module, enabling the processor to execute each step of the above method.
[0177] It should be understood that although the steps in the flowcharts of the embodiments of the present invention are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.
[0178] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods.
[0179] The above embodiments merely illustrate several implementation methods of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this patent should be determined by the appended claims.
Claims
1. A structural enhancement method for anti-UAV infrared imaging, characterized in that, Includes the following steps: The original infrared three-channel image is input into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion of learnable projection branch, fidelity pass-through branch and global statistical gating, infrared adaptation features are output to align with the input distribution of the pre-trained backbone network. The IR-SA-SFEM ultra-small target hotspot enhancement module, driven by local thermal entropy, takes infrared adaptation features as input, and outputs enhanced features through local thermal entropy saliency adjustment and thermal radiation contrast enhancement. Based on the infrared feature dynamic recalibration network IR-RFNet, the input is enhanced features, and through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion, the output is a dynamic recalibration feature with suppressed background thermal interference. Based on dynamic recalibration features, the detection results of UAV targets in infrared images are determined.
2. The structural enhancement method for anti-UAV infrared imaging according to claim 1, characterized in that, The learnable projection branch achieves cross-channel linear mixing through 1×1 convolution, and, in conjunction with BN and SiLU activation functions, outputs projection features F0: ; Where X is the input; Indicates nonlinear activation of SiLU; , H represents the number of channels after projection, and H and W represent the height and width of the original infrared three-channel image, respectively. The fidelity direct-through branch directly captures the original infrared three-channel image I. id =X.
3. The structural enhancement method for anti-UAV infrared imaging according to claim 2, characterized in that, The global statistical gating extracts channel-level intensity summaries of the input image through global average pooling, and outputs scalar coefficients g via a lightweight two-layer MLP and a sigmoid function: ; in, Indicates global average pooling. It is a two-layer lightweight sensor. For the Sigmoid function; The infrared adaptation feature Y output by the infrared single-channel adaptive projection module IR-Adapter is defined as: 。 4. The structural enhancement method for anti-UAV infrared imaging according to claim 2, characterized in that, The method for significantly adjusting the local thermal entropy specifically includes: Let the input shallow features be Channel aggregation is performed on the input shallow features to obtain the heat response map S: ; Where C is the number of channels; a 9×9 neighborhood is taken centered at pixel position (i,j). The neighborhood response is normalized into a probability distribution using Softmax: ; The local thermal entropy map E at that location is defined as: ; Where ε is a constant; based on the local thermal entropy map E, significant weights are generated through lightweight mapping: ; in, =1×1 convolution, The Sigmoid function is used to recalibrate the input shallow features, resulting in locally thermally entropy-enhanced features. : 。 5. The structural enhancement method for anti-UAV infrared imaging according to claim 4, characterized in that, The method for enhancing thermal radiation contrast specifically includes: For the shallow features of the input, center-background difference modeling is used to enhance the relative contrast of the hotspots: ; where T is the contrast enhancement feature; denotes the average pooling operation with a kernel size of k; k1 < k2 respectively correspond to the mean responses of the local center and a larger range of background, and ψ(·) represents a lightweight non-linear mapping; The input shallow features are fused with local thermal entropy enhancement features and contrast enhancement features in a residual form and then output.
6. The structural enhancement method for anti-UAV infrared imaging according to claim 5, characterized in that, The method of multi-scale thermal context modeling and adaptive fusion of branch weights described in S specifically includes: Let the input features be The output features maintain the same size and number of channels as the input features; K depthwise separable dilated convolution branches are constructed in parallel on the same input feature F. To perceive the relative differences between the target hotspot and the surrounding background thermal distribution at different spatial scales, K branch features Y are generated. i : ; in, The expansion rate is d i Depth convolution, For group normalization, This is a SiLU nonlinear mapping; Based on a lightweight weight generator, branch fusion weights are adaptively generated according to the current feature statistics; firstly, channel statistical aggregation of the features is performed to obtain the global description vector s: ; The branch weights (logits) are then obtained through two layers of 1×1 convolution mapping and normalized to the fusion coefficients using Softmax. : ; in, This represents a lightweight mapping; finally, the branch outputs are fused according to dynamic weights and compressed and fused through a 1×1 convolution to obtain the recalibrated residual features R: 。 7. The structural enhancement method for anti-UAV infrared imaging according to claim 6, characterized in that, The method for fusing spatial high-frequency suppression gating with residual steady-state conditions specifically includes: Local mean filtering is applied to the input feature F to obtain the low-pass term L, and the high-pass amplitude H is calculated: ; Channel aggregation is performed on H and in-image mean normalization is applied to obtain the normalized high-pass image H. n : ; in, This indicates the aggregation of the mean along the channel dimension; This indicates the mean value to be calculated over spatial dimensions H and W. To prevent extremely small constants with a denominator of zero; Then, a gated graph is generated using the Sigmoid mapping. , where g min For gated floor items: ; in, The threshold value is a preset value, and t is a temperature coefficient that controls the degree of gate control. The recalibrated residual feature R is obtained after gating. : ; The output is in residual form, and a learnable scaling factor is introduced. Control the amplification amplitude, output : ; in, This indicates residual limiting operation.
8. A structure enhancement system for anti-drone infrared imaging, used to implement the structure enhancement method for anti-drone infrared imaging as described in any one of claims 1-7, characterized in that, include: The adaptive fusion module is used to input the original infrared three-channel image into the infrared single-channel adaptive projection module IR-Adapter. Through adaptive fusion with learnable projection branch, fidelity pass-through branch and global statistical gating, it outputs infrared adaptation features to align with the input distribution of the pre-trained backbone network. The feature enhancement module is used for the IR-SA-SFEM ultra-small target hotspot enhancement module driven by local thermal entropy. It takes infrared adaptation features as input, adjusts the significance of local thermal entropy and enhances thermal radiation contrast, and outputs enhanced features. The feature dynamic recalibration module is used to take enhanced features as input based on the infrared feature dynamic recalibration network IR-RFNet, and output dynamic recalibrated features with suppressed background thermal interference through multi-scale thermal context modeling, branch weight adaptive fusion, spatial high-frequency suppression gating and residual steady-state fusion. The results output module is used to determine the detection results of UAV targets in infrared images based on dynamic recalibration features.