Lightweight target detection method and network for low-contrast dim and weak extended target

Through the collaborative design of LECA, DSFPN, and MAIM modules, the problem of insufficient feature extraction in low-contrast, weak extended target detection is solved, achieving efficient and stable detection results, suitable for scenarios such as infrared sensing and nighttime surveillance.

CN122243845APending Publication Date: 2026-06-19CHONGQING UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING UNIV OF TECH
Filing Date
2026-04-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively extract features in low-contrast, dimly lit extended target detection, resulting in high false negative rates and inaccurate localization. In particular, they fail to adequately protect low-frequency features of extended target contours under lightweight constraints and are difficult to adapt to edge deployments.

Method used

The system employs the LECA weak enhancement channel attention module for channel and spatial co-enhancement, the DSFPN sparse and hollow feature pyramid module for multi-level receptive field expansion, and the MAIM multi-scale adaptive fusion module for frequency decomposition and differential weighted fusion. Target detection is then performed using the CIoU loss function.

🎯Benefits of technology

It improves the ability to perceive features of low-contrast targets, enhances the purity and anti-interference ability of feature extraction, optimizes detection accuracy and generalization, and adapts to the deployment requirements of edge computing devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243845A_ABST
    Figure CN122243845A_ABST
Patent Text Reader

Abstract

This invention discloses a lightweight target detection method and network for low-contrast, dim, and extended targets, belonging to the field of computer vision and target detection technology. Based on the YOLO11n architecture, a YOLO11-DET detection model is constructed. A LECA dim enhancement channel attention module is integrated at the backbone network entry point, performing dual-path parallel weighting of global brightness perception and local contrast perception on the input features to achieve channel and spatial collaborative enhancement. The neck network uses a DSFPN sparse dilated feature pyramid module, expanding the receptive field through multi-level dilated convolutions and filtering invalid background gradients with sparse gating. The cross-scale fusion node uses a MAIM multi-scale adaptive fusion module, decomposing features into high- and low-frequency components and implementing differentiated weighted fusion. The detection head adopts a three-scale decoupled structure, using a CIoU loss function with an added dim target perception regularization term to complete classification and regression. This invention can significantly improve the detection accuracy of low-contrast, dim, and extended targets, and can be widely applied in scenarios such as infrared early warning and nighttime surveillance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and target detection, specifically to a lightweight target detection method and network for low-contrast, dim, extended targets. Background Technology

[0002] Low-contrast, dimly lit extended target detection is a key technology in fields such as infrared early warning, nighttime surveillance, astronomical imaging, and UAV reconnaissance. These targets are characterized by extremely small grayscale differences, low signal-to-noise ratios, and planar / strip-like distributions. They are easily affected by insufficient lighting and atmospheric scattering, resulting in weak target saliency and high detection difficulty. Conventional deep learning detection algorithms struggle to reliably extract effective features, often exhibiting high false negative rates and inaccurate localization. YOLO series single-stage detection models are widely used due to their speed and accuracy advantages, but existing lightweight variants are mostly designed for conventional lighting scenarios. In dim environments, shallow feature activation is weak, and the receptive field of extended targets is insufficient, making it difficult to balance detection performance with the lightweight constraints of edge deployment.

[0003] To address the challenge of detecting targets in low light and dim conditions, academic research has proposed various improvement approaches. For example, NID-DETR combines nighttime enhancement with RTDETR to improve low-light detection accuracy; the FEMR algorithm uses feature enhancement and multi-scale receptive fields to optimize low-light feature representation; and AirFormer utilizes deformable attention to construct an extremely lightweight low-light target detection network. These findings have made some progress in weak feature extraction and noise suppression. However, these methods primarily focus on low-light enhancement or single-point weak targets, neglecting channel-space collaborative enhancement and customized receptive field design for low-contrast and extended, large-span targets. Furthermore, under lightweight constraints, they fail to adequately protect the low-frequency features of extended target contours, resulting in significant shortcomings in detection accuracy and generalization.

[0004] Chinese patent CN119625259A proposes an infrared weak target detection method, system, electronic device, and storage medium. Based on an improved YOLOv8 infrared weak target detection method, it enhances the weak target feature extraction capability through network structure optimization. Chinese patent CN113935984B discloses an infrared weak target detection method and system in complex backgrounds using multi-feature fusion, proposing a multi-feature fusion and adaptive detection strategy to suppress false alarms in complex backgrounds. Although these patents represent breakthroughs in weak target detection, they still have significant shortcomings: they lack an adaptive dual-path attention mechanism for dark areas, making it impossible to dynamically amplify features in low-contrast regions; they do not employ multi-level dilated convolution and sparse gating to collaboratively expand the receptive field, resulting in insufficient coverage of targets with large spans; they do not achieve contour feature differentiation protection through frequency decomposition, and their lightweight design fails to meet the stringent deployment requirements at edge devices. Summary of the Invention

[0005] To address the aforementioned technical problems, this application discloses a lightweight target detection method and network for low-contrast, weak, extended targets; the lightweight target detection method for low-contrast, weak, extended targets includes:

[0006] Acquire a low-contrast image to be detected, and input the low-contrast image to be detected into a YOLO11-DET detection model based on the improved YOLO11n;

[0007] The LECA dim enhancement channel attention module at the entrance of the main network of the detection model performs dual-path parallel weighted processing of global brightness perception and local contrast perception on the input feature map to obtain a feature map with channel and spatial co-enhancement.

[0008] By using the DSFPN sparse dilated feature pyramid module of the neck network of the detection model, the collaborative enhancement feature map is expanded by multi-level dilated convolution receptive field and filtered by sparse gating gradient to obtain a large receptive field anti-interference feature map.

[0009] The MAIM multi-scale adaptive fusion module of the cross-scale fusion node of the detection model performs frequency decomposition and component differential weighted fusion on the large receptive field anti-interference feature map to obtain multi-scale adaptive representation features.

[0010] The detection model uses a three-scale decoupled detection head and a CIoU loss function with an added dim target perception regularization term to perform target classification and bounding box regression on multi-scale adaptive representation features, outputting dim extended target detection results.

[0011] Preferably, the LECA weak enhancement channel attention module performs dual-path parallel processing on the input feature map, including:

[0012] Global average pooling is performed on the input feature map to generate channel descriptors. The channel descriptors are then input into a two-layer MLP structure with a preset dimensionality reduction ratio, and the channel gain coefficients are output after being constrained by the ReLU6 activation function.

[0013] A spatial local standard deviation is calculated on the input feature map using a fixed-size neighborhood window, and a spatially adaptive dark area enhancement mask is constructed based on the local standard deviation values.

[0014] The channel gain coefficients and the spatial adaptive dark area enhancement mask are broadcast dimension aligned and fused. Then, the fusion result is weighted element-wise with the original input feature map to output a feature map with channel and spatial co-enhancement.

[0015] Preferably, the DSFPN sparse and hollow feature pyramid module performs receptive field expansion processing on the collaborative enhancement feature map, including:

[0016] Three parallel dilated convolutional branches are constructed in the neck network structure, and each branch is configured with an independent dilation rate parameter to form receptive field extraction capabilities at different scales.

[0017] Align the feature maps output from each dilated convolution branch by channel dimension, and weight the three feature maps using learnable gating coefficients to obtain an intermediate feature map that fuses multi-scale receptive fields.

[0018] Preferably, the DSFPN module performs sparse gated gradient filtering on the fused intermediate feature map, including:

[0019] Calculate the feature response intensity for each spatial location in the intermediate feature map, and calculate the statistical threshold based on the global feature response distribution;

[0020] The feature response intensity at each location is compared with a statistical threshold, and the gradient backpropagation path is retained only for locations where the feature response intensity is higher than the statistical threshold.

[0021] Gradient backpropagation is shielded at locations where the feature response intensity is below the statistical threshold, thereby achieving invalid gradient filtering in low signal-to-noise ratio background regions.

[0022] Preferably, the MAIM multi-scale adaptive fusion module performs frequency decomposition processing on the large receptive field anti-interference feature map, including:

[0023] Fixed-size average pooling is performed on the input adjacent scale feature maps to extract the low-frequency components that characterize the overall contour of the extended target.

[0024] The original input feature map is compared with the corresponding low-frequency component to separate the high-frequency component that represents the target edge and local details.

[0025] The low-frequency components and high-frequency components are treated as independent feature branches and then fed into the subsequent differential weighting process.

[0026] Preferably, the MAIM module performs differential weighting processing on low-frequency components and high-frequency components, including:

[0027] Construct two independent lightweight attention branches, corresponding to the low-frequency component branch and the high-frequency component branch, respectively;

[0028] An activation threshold filtering mechanism is set for high-frequency component branches, and a suppression operation is performed on weak activation features below the threshold to filter out noise-dominated feature responses;

[0029] Expanding the weight control range for low-frequency component branches enhances the ability to represent the extended target contour features.

[0030] Preferably, the MAIM module performs feature recombination and channel compression after completing the differentiated weighting, including:

[0031] The low-frequency components and high-frequency components, after weighted modulation, are superimposed and fused to form a comprehensive feature map containing contour and detail information.

[0032] The comprehensive feature map is input into a 1×1 convolutional layer to perform channel dimension normalization, compressing the number of channels to a target dimension that matches the subsequent network structure;

[0033] The compressed feature map is used as a cross-scale fusion output and transmitted to the detection head to complete the classification and regression tasks.

[0034] A lightweight target detection network for low-contrast, dimly lit extended targets, based on an improved three-segment architecture of YOLO11n (backbone-neck-detection head), including:

[0035] The backbone network integrates the LECA weak enhancement channel attention module at the feature extraction entry point, which is used to perform adaptive enhancement of the input image features in both channel and spatial dimensions.

[0036] The neck network uses a sparsely hollow feature pyramid module of DSFPN to replace the standard FPN structure, which is used to perform multi-level receptive field expansion and background noise gradient suppression on the backbone output features.

[0037] The cross-scale fusion module replaces part of the original C3k2 structure with the MAIM multi-scale adaptive fusion module, which is used to perform frequency decomposition and differential component fusion on the neck output features.

[0038] The detection head adopts the YOLO11n three-scale decoupled detection head structure and is configured with a CIoU loss function that adds a dark and weak target perception regularization term.

[0039] Preferably, the LECA dim enhancement channel attention module is composed of a global brightness perception branch and a local contrast perception branch in parallel. The global brightness perception branch is used to generate channel dimension gain coefficients, and the local contrast perception branch is used to generate spatial dimension adaptive masks. The two outputs are broadcast fused and then used to perform weighted modulation on the input feature map.

[0040] Preferably, the DSFPN sparse dilated feature pyramid module is composed of multi-level dilated convolutional branches and sparse gating units cascaded together, and the MAIM multi-scale adaptive fusion module is composed of frequency decomposition unit, dual-branch independent attention unit and channel compression unit in sequence. Each module is connected to the network with a non-redundant structure and maintains the continuity and integrity of the forward propagation link.

[0041] Compared with the prior art, the technical solution of this application has the following technical effects:

[0042] This invention employs a dual-path parallel adaptive weighting mechanism to enhance the features of dark and weak regions in both channel and spatial dimensions during the initial feature extraction stage. This enables the model to stably capture effective representations of low-contrast targets, improves the perception and representation capabilities of weak signals, and ensures that features are not buried by background noise during network transmission.

[0043] This invention leverages the collaborative operation of multi-level dilated convolution and sparse gating to effectively expand the feature receptive field, fully covering the overall contour of targets with large spans. At the same time, it filters out invalid gradient interference in low signal-to-noise ratio regions, improving the purity and anti-interference ability of feature extraction and optimizing the model's perception completeness of targets in complex backgrounds.

[0044] This invention employs a frequency decomposition and differentiated weight allocation strategy to independently regulate and protect the low-frequency contour features and high-frequency detail features of the extended target, thereby enhancing the overall structural representation of the target and suppressing noise interference. This allows multi-scale feature fusion to better fit the characteristics of the dim extended target, improving the specificity and robustness of the feature representation.

[0045] This invention achieves overall network optimization through a lightweight modular design, improving detection accuracy and inference stability while maintaining the lightweight constraints of the model. It can adapt to the deployment requirements of edge computing devices and provide efficient and reliable extended target detection capabilities in dark and weak light for scenarios such as infrared sensing and night monitoring.

[0046] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the preferred embodiments of this application are described in detail below with reference to the accompanying drawings.

[0047] The above and other objects, advantages and features of this application will become more apparent to those skilled in the art from the following detailed description of specific embodiments in conjunction with the accompanying drawings. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In all drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.

[0049] Based on the description of the figures and their corresponding technical content in the document, the titles of the figures are as follows:

[0050] Figure 1 A schematic diagram of a lightweight detection method for low-contrast, weak extended targets;

[0051] Figure 2 A schematic diagram illustrating the structure and working principle of the LECA weak light enhancement channel attention module;

[0052] Figure 3 Improved YOLO11-DET lightweight target detection network overall architecture diagram

[0053] Figure 4 Comparison of detection performance of different models in dark and weak extended target scenes. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. In the following description, specific details such as specific configurations and components are provided merely to help fully understand the embodiments of this application. Therefore, those skilled in the art should understand that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. In addition, for clarity and brevity, descriptions of known functions and structures are omitted in the embodiments.

[0055] It should be understood that the phrase "an embodiment" or "this embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "an embodiment" or "this embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.

[0056] Furthermore, reference numerals and / or letters may be repeated in different examples within this application. Such repetition is for the purpose of simplification and clarity and does not in itself indicate a relationship between the various embodiments and / or settings discussed.

[0057] In this article, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A exists alone, B exists alone, and A and B exist simultaneously. The term " / and" in this article describes another type of relationship between related objects, indicating that two relationships can exist. For example, A / and B can mean: A exists alone, and A and B exist alone. In addition, the character " / " in this article generally indicates that the related objects before and after it are in an "or" relationship.

[0058] In this article, the term "at least one" is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, "at least one of A and B" can mean: A exists alone, A and B exist simultaneously, or B exists alone.

[0059] It should also be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion.

[0060] Example 1

[0061] This embodiment mainly describes a lightweight target detection method for low-contrast, weak, extended targets, such as... Figure 1 , Figure 2 As shown, it specifically includes:

[0062] Acquire a low-contrast image to be detected, and input the low-contrast image to be detected into a YOLO11-DET detection model based on the improved YOLO11n;

[0063] The LECA dim enhancement channel attention module at the entrance of the main network of the detection model performs dual-path parallel weighted processing of global brightness perception and local contrast perception on the input feature map to obtain a feature map with channel and spatial co-enhancement.

[0064] By using the DSFPN sparse dilated feature pyramid module of the neck network of the detection model, the collaborative enhancement feature map is expanded by multi-level dilated convolution receptive field and filtered by sparse gating gradient to obtain a large receptive field anti-interference feature map.

[0065] The MAIM multi-scale adaptive fusion module of the cross-scale fusion node of the detection model performs frequency decomposition and component differential weighted fusion on the large receptive field anti-interference feature map to obtain multi-scale adaptive representation features.

[0066] The detection model uses a three-scale decoupled detection head and a CIoU loss function with an added dim target perception regularization term to perform target classification and bounding box regression on multi-scale adaptive representation features, outputting dim extended target detection results.

[0067] Before entering the model, low-contrast images to be detected undergo size normalization according to uniform specifications, fixing the image resolution to 640×640 and converting it to a standard tensor format. These images are then fed into the backbone network of the YOLO11-DET detection model. The backbone network retains the original multi-layer convolution, downsampling, and feature extraction structure of YOLO11n. To avoid low-contrast, weak targets experiencing excessively low activation values ​​and signal annihilation during shallow feature extraction, the LECA weak signal enhancement channel attention module is directly integrated at the first layer feature output position of the backbone network. This allows for simultaneous enhancement of the channel and spatial dimensions of shallow features before they enter the deeper network, improving the feature response intensity of weak signal regions and providing a more stable and reliable foundation for subsequent feature extraction, fusion, and detection.

[0068] The input feature map of the LECA module is represented as follows: Where C is the number of channels in the input feature map, H is the feature map height, and W is the feature map width. The global brightness perception branch first performs a global average pooling operation on the input feature map, aggregating all feature information in the spatial dimension into a channel-dimensional description vector. The specific calculation formula is as follows: After obtaining the channel descriptor, it is input into a two-layer MLP structure with a dimension reduction ratio of r=16 and subjected to nonlinear transformation to output the channel gain coefficient. Furthermore, the ReLU6 activation function is used to strictly constrain the coefficient values ​​within the [0,6] interval, thereby achieving a significant enhancement of the weak activation channels. Specifically, this can be expressed as follows: The local contrast perception branch uses a fixed-size 5×5 neighborhood window, which slides through the feature map row by row and column by column. The standard deviation of the feature values ​​within each window is calculated to characterize the local contrast and signal-to-noise ratio at the current location. The calculation formula is as follows: In the formula The mean of all features within the current 5×5 window is used to generate a spatially adaptive dark area enhancement mask that perfectly matches the spatial size of the input feature map, based on the local standard deviation. The channel gain coefficients are broadcast-expanded in the spatial dimension to match the mask tensor dimension. Then, the two are multiplied element-wise to obtain the joint modulation weights. Finally, the original input feature map and the joint weights are weighted element-wise to obtain the output features. The calculation formula is as follows:

[0069]

[0070] in This is an element-wise multiplication operation. For broadcast weighted computation, the total number of parameters in the LECA module is strictly controlled to 0.04M. The number of channels, height and width of the feature map will not change during the entire computation process, and it can maintain a perfect dimensionality match with the backbone network.

[0071] The feature maps enhanced by the LECA module are input to the DSFPN sparse dilated feature pyramid module of the neck network according to the original network transmission path. The neck network completely removes the original standard FPN structure and replaces it entirely with the DSFPN module. This module achieves hierarchical expansion of the receptive field through three parallel dilated convolution branches. All branches use 3×3 standard convolution kernels, distinguished only by the dilation rate parameter. The dilation rates of the three branches are set to d=1, d=2, and d=4, respectively, and the corresponding equivalent receptive fields satisfy the following relationship: The three branches can obtain equivalent receptive fields of 3×3, 5×5, and 9×9 respectively, achieving progressive expansion of the receptive field without increasing the number of convolutional parameters. The three dilated convolutions independently compute the input features, outputting three sets of feature maps with identical channel counts, heights, and widths. These are then processed using learnable gating coefficients. The three features are weighted and fused as follows: After fusion, the process proceeds to the sparse gated gradient filtering stage. For each spatial location in the fused feature map, the L2 norm form of the feature response intensity is calculated using the following formula: Simultaneously, the global mean μ and global standard deviation σ of the response intensity of the entire feature map are calculated to construct a dynamic filtering threshold, retaining gradient backpropagation paths only for spatial locations that meet the following conditions: For low signal-to-noise ratio background regions where the response intensity is below the threshold, gradient backpropagation is directly blocked to filter out invalid gradient interference. During the inference phase, the sparse gating mechanism degenerates into a static threshold mask, without introducing any additional computation or inference delay.

[0072] The large receptive field anti-interference features output by the DSFPN module are transmitted to the cross-scale fusion node. This node replaces part of the original C3k2 module with the MAIM multi-scale adaptive fusion module. This module simultaneously receives low-resolution, high-semantic features from the upper layer and high-resolution, high-detail features from the lower layer, performing frequency decomposition on each feature individually. First, a 5×5 average pooling layer is used to extract the low-frequency contour component from the features. The calculation formula is as follows: ;

[0073] Next, the original features are subtracted element-wise from the low-frequency components to separate the high-frequency components containing edge and detail information. The calculation formula is as follows: ;

[0074] Low-frequency and high-frequency components are fed into independent, lightweight attention branches that do not share parameters. The high-frequency component branch has a fixed activation threshold, and weak activation noise is filtered through an indicator function, as shown in the formula: ;

[0075] The low-frequency component branch expands the weight adjustment range and strengthens the representation intensity of the extended target contour features. After differential weighting of the two feature paths, they are fused element-wise, and then the number of channels is compressed to the standard dimension required by subsequent networks through a 1×1 convolutional layer. The calculation formula is as follows: ;

[0076] The overall parameter count of the MAIM module is only 1 / 3 of that of the C3k2 module of the same specification, approximately 0.08M, which significantly improves the feature fusion effect while maintaining lightweight constraints.

[0077] The multi-scale adaptive representation features output by the MAIM module are aligned according to channel and spatial dimensions and then fed into a three-scale decoupled detection head. The detection head contains three independent feature layers, P3, P4, and P5, corresponding to the detection tasks of small-scale, medium-scale, and large-scale expanded targets, respectively. The classification branch and the bounding box regression branch are independent and do not share parameters. The overall loss function is based on the CIoU bounding box regression loss framework, with the addition of a dark / weak target perception regularization term to enhance the model's attention to weak signal regions. The overall loss is calculated as follows: ;

[0078] In the formula The regularization coefficient is . It is negatively correlated with the confidence of the predicted bounding box, and can apply additional loss penalty to low-confidence regions, forcing the model to continuously focus on low-contrast, weak-response target regions during training, thereby improving target recall and localization accuracy.

[0079] The lightweight target detection network for low-contrast, weak, extended targets is an improvement on the three-segment architecture of YOLO11n: backbone, neck, and detection head. The backbone network retains the original multi-level feature extraction structure, only connecting the LECA module after the first layer feature extraction. The input feature dimension is C×H×W, and the output feature dimension remains completely consistent. The neck network completely replaces the standard FPN structure with a DSFPN module, maintaining matching input and output channel numbers and feature map sizes. The cross-scale fusion node replaces part of the original feature fusion structure with a MAIM module to achieve frequency domain adaptive fusion of high and low resolution features. The detection head maintains a three-scale decoupled structure, with independent classification and regression branches equipped with an improved loss function. The overall forward propagation link of the network is continuous and complete, and all modules are connected via tensor direct connections, with strict dimension matching, no feature size conflicts, no computational redundancy, and no structural breaks.

[0080] This implementation employs a progressive processing approach involving dual-path adaptive weighting, multi-level dilated convolution, and frequency decomposition-based differential fusion. This approach effectively enhances weak activation features in low-contrast regions during the initial feature extraction phase, fully covers the overall contour of targets with large spans, accurately separates and preserves target contours and detailed features, and filters out background noise and invalid gradients. Without relying on complex preprocessing, it stably improves the feature representation quality and detection integrity of dark and weak targets, while maintaining lightweight computational logic throughout, ensuring stable execution of the method in low-computing-power environments.

[0081] Example 2 details a lightweight target detection network for low-contrast, dimly lit extended targets. Its key feature is an improved three-segment architecture based on the YOLO11n backbone-neck-detection head, as follows: Figure 3 As shown, it includes:

[0082] The backbone network integrates the LECA weak enhancement channel attention module at the feature extraction entry point, which is used to perform adaptive enhancement of the input image features in both channel and spatial dimensions.

[0083] The neck network uses a sparsely hollow feature pyramid module of DSFPN to replace the standard FPN structure, which is used to perform multi-level receptive field expansion and background noise gradient suppression on the backbone output features.

[0084] The cross-scale fusion module replaces part of the original C3k2 structure with the MAIM multi-scale adaptive fusion module, which is used to perform frequency decomposition and differential component fusion on the neck output features.

[0085] The detection head adopts the YOLO11n three-scale decoupled detection head structure and is configured with a CIoU loss function that adds a dark and weak target perception regularization term.

[0086] This network is based on the YOLO11n architecture, maintaining a three-stage forward propagation structure of backbone, neck, and detection head, with continuous feature dimension matching between all modules. The backbone network is located at the front of the entire architecture, where the LECA weak signal enhancement channel attention module is connected at the initial extraction position of the input features. This module consists of a global brightness perception branch and a local contrast perception branch in parallel. The global brightness perception branch is used to perform global statistics and gain allocation on the feature responses of the channel dimension, generating channel dimension gain coefficients. The local contrast perception branch is used to calculate the local contrast at the spatial location, generating a spatial dimension adaptive mask. The two outputs are broadcast to achieve dimension alignment and then jointly perform weighted modulation on the input feature map, completing weak signal enhancement without changing the feature map size.

[0087] The neck network, located after the backbone network, completely replaces the original standard FPN structure with the DSFPN sparse dilated feature pyramid module. This module consists of multi-level dilated convolutional branches and sparse gating units cascaded together. The multi-level dilated convolutional branches use different dilation rate parameters to achieve hierarchical expansion of the receptive field, expanding the coverage of large-span expansion targets without increasing the number of parameters. The sparse gating units, located after the dilated convolutional branches, are used to statistically filter the feature response intensity, shield gradient propagation in low signal-to-noise ratio regions, and achieve noise suppression and feature purification.

[0088] The cross-scale fusion module is located between the neck network and the detection head. It replaces part of the C3k2 module in the original network with the MAIM multi-scale adaptive fusion module. This module consists of a frequency decomposition unit, a dual-branch independent attention unit, and a channel compression unit in sequence. The frequency decomposition unit is used to decompose the input features into low-frequency contour components and high-frequency detail components. The dual-branch independent attention unit applies independent weight control strategies to the low-frequency and high-frequency components respectively. The channel compression unit is used to compress the fused features to the standard number of channels required by the detection head, maintaining the computational compatibility of the subsequent network structure.

[0089] The detection head is located at the end of the network and adopts the three-scale decoupled structure of YOLO11n, which includes detection branches at three scales: P3, P4, and P5, to adapt to expanded targets in different spatial ranges. The classification branch and the regression branch are independent of each other. At the same time, it is equipped with a loss function that adds a dark and weak target perception regularization term on the basis of CIoU, so that the network focuses on low-confidence weak target regions during training and enhances the detection stability of dark and weak targets.

[0090] The entire network uses lightweight embedding of three types of modules: LECA, DSFPN, and MAIM. This ensures the integrity of the forward propagation link, eliminates redundant connections between modules, and prevents feature size conflicts, thus forming a complete detection architecture specifically designed for low-contrast, dim, and extended targets.

[0091] This embodiment details the construction of a dedicated detection network based on the improved YOLO11n architecture. Through modular embedding, it achieves weak feature enhancement, large receptive field extraction, and accurate fusion of multi-scale features. The overall structure is compact and compatible with edge deployment requirements. It can significantly improve the robustness of detecting low-contrast, dim, and large-span targets while maintaining extremely low parameter and computational costs. The system's forward propagation link is complete and smooth, and there is no redundancy in dimensional matching between modules. It can continuously and stably output high-precision detection results and is suitable for various real-world application scenarios such as infrared sensing and night monitoring.

[0092] Based on Example 1 or 2, this example describes in detail the experiments conducted on the following two public datasets;

[0093] Public Dataset 1: The NUDT-SIRST dataset contains 1,327 single-frame infrared images and corresponding pixel-by-pixel annotations, covering typical infrared weak targets such as drones, aircraft, and vehicles. Target sizes range from a few pixels to tens of pixels, and the grayscale contrast between the target and the background is generally less than 5%. The image resolution is 256×256, providing both target bounding boxes and segmentation masks with annotations. The small and weak target characteristics of this dataset highly match the research scenario in this paper, making it one of the standard benchmarks in the field of infrared weak target detection.

[0094] Public Dataset 2: The ExDark dataset, compiled and released by Loh and Chan in 2018, contains 7,363 RGB images under low-light conditions, covering 10 lighting types (low light, ambient light, single point light source, diffused light, etc.), and includes 12 object categories, providing bounding box-level annotations. This dataset systematically covers a variety of low-light scenes from indoor dim light to outdoor nighttime, and is a core evaluation set in the field of low-light object detection.

[0095] Evaluation metrics adopted the standard COCO protocol: mAP@0.5, mAP@0.5:0.95, and Recall. Model efficiency metrics include parameter count, floating-point operations (GFLOPs), and inference frame rate (FPS).

[0096] To verify the performance of the proposed model, all experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU. The experiments were based on the PyTorch 2.0 deep learning framework and accelerated computation using CUDA 11.6. Hyperparameter settings are shown in Table 1, where data augmentation included Mosaic stitching, random horizontal flipping, HSV dithering (brightness V∈[0.3,1.0], expanding the brightness dithering range for low-light scenes), and random affine transformation. All comparative experiments were trained independently three times under the same hyperparameters, and the average was used as the final result.

[0097] Table 1 Hyperparameter Settings Table

[0098] Table 1 presents the performance comparison results of YOLO11n-DET and various control methods in the NUDT-SIRST and ExDark comprehensive evaluation.

[0099] Table 2 Comparative Experiment Results (NUDT-SIRST + ExDark Comprehensive Evaluation)

[0100] As shown in Table 2, compared with the baseline YOLO11n, YOLO11n-DET improved mAP@0.5 from 63.1% to 71.2%, mAP@0.5:0.95 from 39.4% to 42.8%, and recall from 62.0% to 70.5%. Compared with YOLOv8s with 11.2M parameters, our method achieves similar detection accuracy with 27.7% of the parameters, and the FPS is about 86% higher than YOLOv8s, demonstrating a significant efficiency advantage. Compared with YOLOv8n with the same parameter scale, our method is 5.9 percentage points higher in mAP@0.5, verifying the necessity of a special design for faint targets.

[0101] To verify the independent contribution and synergistic effect of each module, this paper designed six ablation experiments, and the results are shown in Table 3.

[0102] Table 3 Ablation Experiment Results

[0103] Experimental results show that all three modules exhibit stable and independent positive contributions. DSFPN provides the most significant performance improvement, increasing mAP@0.5 by 4.4 percentage points, indicating that effective expansion of the receptive field plays a crucial role in small target detection. In contrast, LECA shows the most significant improvement in recall, increasing recall by 3.7 percentage points, indicating that its enhancement of weak features helps improve target detection capabilities. When used in combination with LECA, model performance is further improved, demonstrating a good synergistic gain effect. Although this gain does not reach the level of simple superposition, it can complement feature reconstruction and feature enhancement. Further introduction of the MAIM module increases model performance to 71.2%, validating the effectiveness of multi-module synergistic optimization in improving overall detection performance.

[0104] To intuitively evaluate the detection performance of the proposed method in complex environments, typical scenarios were selected to compare the detection results of the baseline model YOLO11n and the improved model YOLO11-DET, such as... Figure 4 As shown, the comparison scenarios cover distant small targets, complex background occlusion, strong light interference, and low contrast environments, which can comprehensively reflect the robustness and generalization ability of the model.

[0105] from Figure 4Visualization reveals that the original YOLO11n exhibits certain levels of false negatives and missed detections in scenarios with distant small targets, complex backgrounds, and significant lighting variations. For instance, when the target scale is small or obscured by the background, the model's response to the target is not obvious, easily leading to unstable detection or target localization errors. Under strong light or low contrast conditions, the model's ability to extract target features decreases, resulting in lower detection confidence or even missed targets. In contrast, the improved YOLO11-DET demonstrates more stable detection performance in various complex scenarios, effectively mitigating the false negatives and missed detections of the original model. Its response to target regions is more focused, and the detection results are more accurate and reliable. Robustness and detection stability in complex environments are significantly improved, validating the effectiveness of the proposed method.

[0106] The above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. For those skilled in the art, the present invention can have various modifications and variations. Any changes, modifications, substitutions, integrations, and parameter changes made to these embodiments within the spirit and principles of the present invention, without departing from the principles and spirit of the present invention, through conventional substitutions or to achieve the same function, fall within the scope of protection of the present invention.

Claims

1. A lightweight target detection method for low-contrast, weak, extended targets, characterized in that, include: Acquire a low-contrast image to be detected, and input the low-contrast image to be detected into a YOLO11-DET detection model based on the improved YOLO11n; The LECA dim enhancement channel attention module at the entrance of the main network of the detection model performs dual-path parallel weighted processing of global brightness perception and local contrast perception on the input feature map to obtain a feature map with channel and spatial co-enhancement. By using the DSFPN sparse dilated feature pyramid module of the neck network of the detection model, the collaborative enhancement feature map is expanded by multi-level dilated convolution receptive field and filtered by sparse gating gradient to obtain a large receptive field anti-interference feature map. The MAIM multi-scale adaptive fusion module of the cross-scale fusion node of the detection model performs frequency decomposition and component differential weighted fusion on the large receptive field anti-interference feature map to obtain multi-scale adaptive representation features. The detection model uses a three-scale decoupled detection head and a CIoU loss function with an added dim target perception regularization term to perform target classification and bounding box regression on multi-scale adaptive representation features, outputting dim extended target detection results.

2. The method according to claim 1, characterized in that, The LECA weak enhancement channel attention module performs dual-path parallel processing on the input feature map, including: Global average pooling is performed on the input feature map to generate channel descriptors. The channel descriptors are then input into a two-layer MLP structure with a preset dimensionality reduction ratio, and the channel gain coefficients are output after being constrained by the ReLU6 activation function. A spatial local standard deviation is calculated on the input feature map using a fixed-size neighborhood window, and a spatially adaptive dark area enhancement mask is constructed based on the local standard deviation values. The channel gain coefficients and the spatial adaptive dark area enhancement mask are broadcast dimension aligned and fused. Then, the fusion result is weighted element-wise with the original input feature map to output a feature map with channel and spatial co-enhancement.

3. The method according to claim 2, characterized in that, The DSFPN sparse and hollow feature pyramid module performs receptive field expansion processing on the collaborative enhancement feature map, including: Three parallel dilated convolutional branches are constructed in the neck network structure, and each branch is configured with an independent dilation rate parameter to form receptive field extraction capabilities at different scales. Align the feature maps output from each dilated convolution branch by channel dimension, and weight the three feature maps using learnable gating coefficients to obtain an intermediate feature map that fuses multi-scale receptive fields.

4. The method according to claim 3, characterized in that, The DSFPN module performs sparse gated gradient filtering on the fused intermediate feature maps, including: Calculate the feature response intensity for each spatial location in the intermediate feature map, and calculate the statistical threshold based on the global feature response distribution; The feature response intensity at each location is compared with a statistical threshold, and the gradient backpropagation path is retained only for locations where the feature response intensity is higher than the statistical threshold. Gradient backpropagation is shielded at locations where the feature response intensity is below the statistical threshold, thereby achieving invalid gradient filtering in low signal-to-noise ratio background regions.

5. The method according to claim 4, characterized in that, The MAIM multi-scale adaptive fusion module performs frequency decomposition processing on the large receptive field anti-interference feature map, including: Fixed-size average pooling is performed on the input adjacent scale feature maps to extract the low-frequency components that characterize the overall contour of the extended target. The original input feature map is compared with the corresponding low-frequency component to separate the high-frequency component that represents the target edge and local details. The low-frequency components and high-frequency components are treated as independent feature branches and then fed into the subsequent differential weighting process.

6. The method according to claim 5, characterized in that, The MAIM module performs differential weighting processing on low-frequency and high-frequency components, including: Construct two independent lightweight attention branches, corresponding to the low-frequency component branch and the high-frequency component branch, respectively; An activation threshold filtering mechanism is set for high-frequency component branches, and a suppression operation is performed on weak activation features below the threshold to filter out noise-dominated feature responses; Expanding the weight control range for low-frequency component branches enhances the ability to represent the extended target contour features.

7. The method according to claim 6, characterized in that, The MAIM module performs feature reconstruction and channel compression after completing the differentiated weighting, including: The low-frequency components and high-frequency components, after weighted modulation, are superimposed and fused to form a comprehensive feature map containing contour and detail information. The comprehensive feature map is input into a 1×1 convolutional layer to perform channel dimension normalization, compressing the number of channels to a target dimension that matches the subsequent network structure; The compressed feature map is used as a cross-scale fusion output and transmitted to the detection head to complete the classification and regression tasks.

8. A lightweight target detection network for low-contrast, dimly lit extended targets, characterized in that: Improvements based on the YOLO11n three-segment architecture (backbone-neck-detection head) include: The backbone network integrates the LECA weak enhancement channel attention module at the feature extraction entry point, which is used to perform adaptive enhancement of the input image features in both channel and spatial dimensions. The neck network uses a sparsely hollow feature pyramid module of DSFPN to replace the standard FPN structure, which is used to perform multi-level receptive field expansion and background noise gradient suppression on the backbone output features. The cross-scale fusion module replaces part of the original C3k2 structure with the MAIM multi-scale adaptive fusion module, which is used to perform frequency decomposition and differential component fusion on the neck output features. The detection head adopts the YOLO11n three-scale decoupled detection head structure and is configured with a CIoU loss function that adds a dark and weak target perception regularization term.

9. The network according to claim 8, characterized in that, The LECA dim enhancement channel attention module is composed of a global brightness perception branch and a local contrast perception branch in parallel. The global brightness perception branch is used to generate channel dimension gain coefficients, and the local contrast perception branch is used to generate spatial dimension adaptive masks. The two outputs are broadcast fused and then used to perform weighted modulation on the input feature map.

10. The network according to claim 9, characterized in that, The DSFPN sparse and hollow feature pyramid module is composed of multi-level hollow convolutional branches and sparse gating units cascaded together. The MAIM multi-scale adaptive fusion module is composed of frequency decomposition unit, dual-branch independent attention unit and channel compression unit in sequence. Each module is connected to the network with a non-redundant structure and maintains the continuity and integrity of the forward propagation link.

Citation Information

Patent Citations

  • Mid-infrared Dim and Small Target Detection Method and System with Multi-Feature Fusion in Complex Background

    CN113935984B

  • Infrared weak and small target detection method and system, electronic equipment and storage medium

    CN119625259A