A low-light raw image target detection method based on hierarchical recursion

By employing a hierarchical low-light RAW image target detection method, and utilizing a hierarchical connection-driven ISP intermediate feature-detection network framework, the problem of the separation between image processing and detection tasks is solved. This achieves deep collaborative enhancement between ISP intermediate features and the detection network, thereby improving target detection performance in low-light environments.

CN122265638APending Publication Date: 2026-06-23ZHONGBEI UNIV +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGBEI UNIV
Filing Date
2026-04-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing low-light image target detection methods have failed to completely solve the problem of the separation between image processing and detection tasks, and have not made full use of ISP intermediate features, resulting in limited detection performance.

Method used

A hierarchical target detection method for low-light RAW images is designed. The method utilizes a hierarchical connection-driven ISP intermediate feature-detection network framework, which includes a task-oriented ISP submodule, an edge enhancement module, a multi-scale contrast adaptive adjustment module, and a feature fusion module. This framework accurately matches the key ISP processing steps with the detection task requirements and makes full use of ISP intermediate features.

Benefits of technology

It significantly improves target detection performance in low-light environments, solves the challenges of global brightness balance and local detail preservation, and achieves deep collaborative enhancement of ISP intermediate features and detection network, thereby improving detection accuracy and reliability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265638A_ABST
    Figure CN122265638A_ABST
Patent Text Reader

Abstract

The present application belongs to low-light image target detection, and particularly relates to a low-light RAW image target detection method based on hierarchical cascade. Image degradation under low-light environment seriously restricts the performance of target detection. Traditional methods rely on irreversible information loss of RGB images after image signal processing (ISP). Existing RAW data driven methods face the bottleneck of splitting preprocessing and detection tasks, and insufficient utilization of intermediate information in ISP. The present application designs a task-oriented ISP submodule, constructs an "EEM+MSYA+FM" collaborative enhancement architecture, realizes precise fusion of ISP intermediate features and each stage of the backbone network through hierarchical connection, and constructs a RAW detection lossless information link.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention pertains to low-light image target detection, specifically relating to a hierarchical low-light RAW image target detection method. Background Technology

[0002] Low-light environments are prevalent in critical areas of modern society, such as autonomous driving, security monitoring, and industrial inspection, posing a severe challenge to image acquisition quality and downstream visual processing tasks. In such scenarios, the photon energy received by image sensors is insufficient, resulting in images generally exhibiting degradation characteristics such as low brightness, weak contrast, and severe noise interference. This directly causes blurring or even loss of target details, significantly increasing the false positive and false negative rates of detection algorithms, and severely restricting the reliability and security of related systems.

[0003] Traditional low-light target detection generally adopts a sequential approach of "image enhancement → target detection," that is, first improving image quality through illumination enhancement, noise suppression, and other means, and then inputting it into the detection network. Specifically, it can be divided into three categories: image-level, feature-level, and hybrid-level methods, all of which highly rely on the sRGB image processed by image signal processing (ISP). However, these methods have two core contradictions: First, low-light image enhancement uses visual quality (such as brightness equalization and color fidelity) as the evaluation standard, while target detection needs to ensure feature discrimination and geometric structural integrity. These two requirements are significantly misaligned, meaning that even if the image appearance is optimized, detection performance may be reduced due to feature distortion and noise amplification. Second, the original design intention of traditional ISPs is to optimize the subjective perception of the human eye. Their embedded operations such as de-mosaicing, noise suppression, and non-linear tone mapping, while improving appearance, cause irreversible information loss (such as smoothing target edges, weakening textures, and compressing dynamic range). Furthermore, the RGB image processed by ISP, upon which existing methods rely, itself suffers from information loss and noise superposition, limiting the improvement of detection performance from the source.

[0004] In order to fundamentally avoid the information loss caused by traditional ISP, in recent years the academic community has begun to explore a new paradigm of target detection based directly on RAW format sensor data. RAW data retains the original linear response characteristics of the sensor, has higher bit depth and wider dynamic range, and provides a basis for accurately recovering target details in low light scenes. A series of studies have confirmed the feasibility and superiority of this direction: for example, Ljungbergh et al. introduced lightweight learnable transformations (such as Yeo-Johnson transformation) to make the RAW data-driven detector outperform the traditional RGB baseline model; Xu et al.'s research showed that dynamic range adjustment is a key link affecting detection performance in the ISP process, and proposed an adaptive adjustment strategy; Cui et al.

[11] proposed RAW-Adapter, Guo et al. proposed Dark-ISP, and other works further focused on designing task-oriented ISP alternatives; Wu et al.'s VisionISP modified the ISP module to improve the detection performance of autonomous driving scenes; Hou et al.'s PhyDiISP fused physical priors to enhance the low light detection accuracy.

[0005] Despite the progress made in the above research, detection networks based directly on RAW data still face two major challenges: First, most existing methods have failed to completely solve the core problem of "deep separation between image processing workflow and detection task requirements", and have not fully considered the impact of ISP processing on the detection task; Second, they have not effectively utilized intermediate features of ISP, ignored the information of intermediate features in each stage of ISP, and failed to fully explore the complete process from RAW data to final features, thus limiting the improvement of detection performance. Summary of the Invention

[0006] Image degradation in low-light environments severely restricts target detection performance. Traditional methods rely on image signal processing (ISP), which results in irreversible information loss in the RGB image. Existing RAW data-driven methods face bottlenecks such as the separation of preprocessing and detection tasks and insufficient utilization of intermediate information from ISP. This invention addresses these issues by providing a hierarchical target detection method for low-light RAW images.

[0007] Traditional image spectroscopy (ISP) processes, such as nonlinear correction and excessive denoising, can impair the edge and texture features required for detection. Targeted ISP optimizations (e.g., dynamic denoising and adaptive white balance) can effectively improve the quality of low-light features. Based on this, a hierarchical connection-driven ISP intermediate feature-detection network collaborative framework is proposed. Addressing the core challenge of balancing global balance with local detail preservation in low-light image brightness adjustment, a multi-scale contrast adaptive adjustment module (MSYA) is designed to precisely optimize the luminance component (Y channel). This module captures global-local distribution differences in brightness through multi-scale feature extraction, utilizes learnable multinomial basis functions to achieve flexible nonlinear brightness transformation, and then outputs accurate brightness adjustment results through cross-scale attention fusion. Furthermore, the framework integrates a parameter prediction module (PPM), an edge enhancement module (EEM), and a feature fusion module (FM). By addressing key aspects of the ISP workflow, such as denoising, white balance, and color conversion, and customizing the intermediate ISP enhancement process based on the impact of ISP on detection, a deep adaptation between preprocessing and detection tasks is achieved.

[0008] To achieve the above objectives, the present invention employs the following technical solution:

[0009] This invention provides a hierarchical low-light RAW image target detection method, comprising the following steps:

[0010] The low-light RAW image is input into the constructed hierarchical connection-driven ISP intermediate feature detection network, and the detection result is output. The hierarchical connection-driven ISP intermediate feature detection network includes a task-oriented ISP sub-module, an edge enhancement module, a multi-scale contrast adaptive adjustment module, and a feature fusion module. The hierarchical connection realizes the accurate fusion of ISP intermediate features with each stage of the backbone network, and then the detection result is output after passing through the neck network and the prediction head.

[0011] The task-oriented ISP submodule includes three stages: adaptive denoising, task-oriented white balance, and color conversion. The parameters are dynamically optimized through the parameter prediction module. While preserving the linear characteristics of RAW data, the three stages are completed respectively with object detection as the guide.

[0012] The edge enhancement module relies on a gradient-aware edge enhancement mechanism to achieve refined reconstruction and enhancement of the edge contours of degraded targets in low-light scenes;

[0013] The multi-scale contrast adaptive adjustment module completes the contrast differentiation optimization of target regions at different scales through a dynamic calibration strategy of multi-scale features.

[0014] The feature fusion module complementarily fuses multi-source enhanced features to output high-quality features with sharp edges and balanced contrast.

[0015] Furthermore, the task-oriented ISP submodule specifically comprises:

[0016] Adaptive denoising: Denoising is achieved through Gaussian filtering, and the parameters of the Gaussian convolution kernel are predicted by the parameter prediction module. The parameter set is as follows: ,in, , For the major / minor axis of the elliptical Gaussian kernel, To sharpen the weights, For gain, the formula is:

[0017] ;

[0018] ;

[0019] in, The kernel is Gaussian, and * indicates convolution operation. This indicates that the input is a RAW image;

[0020] Task-oriented white balance: White balance parameters are predicted by the parameter prediction module, and the parameter set is as follows. These correspond to the parameters of the red, green, and blue channels, respectively, and the formula is:

[0021] ;

[0022] in, Represents the channels of an image;

[0023] Color Conversion: The color conversion matrix (CCM) is predicted synchronously by the parameter prediction module. The formula for linear color space mapping is:

[0024] ;

[0025] in, The color conversion matrix (CCM) outputs a color-converted image. .

[0026] Furthermore, the parameter prediction module employs a self-attention mechanism to predict parameters, specifically as follows:

[0027]

[0028] in, A set of learnable dynamic parameters, or Generated from the input image through convolution and linear layers. For scaling terms, FFN indicates that the feedforward network consists of linear layers and activation layers, and parameters represent the parameters required for the three stages in the task-oriented ISP submodule.

[0029] Furthermore, the edge enhancement module specifically comprises:

[0030] Dual-path input definition: Parallel processing of two types of input features to form dual-path enhancement branches that balance edge integrity and low noise characteristics: in, Preserve the original edge clues. It has the advantage of low noise and can be adapted to the fusion requirements of different stages of ResNet.

[0031] Four-directional edge extraction: For the input features of the two paths, edge features are extracted using improved Sobel convolution kernels in four directions. The convolution operation is defined as follows:

[0032] ;

[0033] in, For the first Path Edge response characteristics of the direction, Represents a dual-path index. Representing 4 edge directions, This is an improved Sobel convolution kernel for the corresponding direction. This represents the convolution operation;

[0034] Multi-directional edge feature fusion and dimensionality adaptation: The edge features of a single path in four directions are concatenated by channel, and then a 1×1 convolution is used to complete feature fusion and channel dimension adjustment, as detailed below:

[0035] Channel splicing, ;

[0036] 1×1 convolution fusion, ; , The learnable parameters are for a 1×1 convolution;

[0037] BN normalization suppresses noise, while ReLU activation enhances effective edge response.

[0038] ;

[0039] Staged output and fusion localization: The edge enhancement features after dual-path processing are output to the specified stage of ResNet, defined as follows:

[0040] , ;

[0041] in, Depend on This provides the original, subtle edge clues for stage 4; Depend on This provides pure edge features for stage 3.

[0042] Furthermore, the multi-scale contrast adaptive adjustment module specifically comprises:

[0043] Multi-scale brightness adjustment residual calculation: The RGB features are separated into the luminance channel Y and the chrominance channels Cb and Cr. Then, based on the preset multi-scale branch set S = {1,2,4,8}, the luminance channel Y is adjusted according to the scale. Average pooling downsampling is performed to obtain the brightness features of different receptive fields, adapting to global and local adjustment needs. The formula is as follows:

[0044] ;

[0045] in, The pooling kernel size, The brightness feature is at the s-th scale;

[0046] Basis function weighted response: Each scale branch generates a personalized brightness adjustment response by learning the adaptive weights of eight preset polynomial basis functions: for underexposed dark areas, signal gain is enhanced through low-order basis functions to improve local brightness; for overexposed bright areas, signal amplitude is suppressed through high-order basis functions to avoid loss of detail, while the nonlinear combination of basis functions adaptively cancels the interference of low-light noise on the brightness distribution.

[0047] ;

[0048] in, Let be the learnable weights of the i-th basis function at the s-th scale. For predefined polynomial basis functions, For the adjusted response at the s-th scale;

[0049] Cross-scale attention fusion: Adjusting response to different scales Attention-weighted processing is performed, automatically assigning contributions at different scales, and fusing them to generate a global-local collaborative brightness adjustment residual. The formula is:

[0050] ;

[0051] ;

[0052] in, Let be the attention weight at scale s. For upsampling, all scales are restored to the original brightness channel size. , Adjust the residual brightness for the final blend;

[0053] Feature Output and Fusion Localization: The adjusted residual is fused with the original luminance channel, and after range limitation, it is reconstructed into RGB features:

[0054] ;

[0055] in, For range-limiting functions, This is the optimized luminance channel;

[0056] Will With chroma channels Inverse conversion back to RGB space yields contrast enhancement features. , After precise integration into ResNet's stage 1, that is:

[0057] .

[0058] Furthermore, the hierarchical connection enhances the output features of the task-oriented ISP submodule through the edge enhancement module and the multi-scale contrast adaptive adjustment module, and then injects them layer by layer according to the functions of each ResNet stage, specifically as follows:

[0059] Inject before stage 1 After stage 1, the contrast_feat generated by the multi-scale contrast adaptive adjustment module is injected, and after stage 2, the edge enhancement module generates a feature based on the denoised features. After stage 3, an edge enhancement module is injected based on features generated before denoising. After being injected into the neck network via stage 4, features are fused between stage 1, stage 2, and stage 3 through a feature fusion module.

[0060] Furthermore, the feature fusion module specifically comprises:

[0061] Spatial attention weighting: Attention is enhanced for both backbone features and augmenting features, using the following formula:

[0062] ;

[0063] in, The input features are either backbone features or enhancement features. Generate attention weight maps for 7×7 convolutions. For element-wise multiplication, Features weighted by attention;

[0064] Scale alignment: Enhanced features are adapted to the backbone feature size through multiple downsampling steps, as shown in the formula:

[0065] ;

[0066] in, To enhance features, , The number of downsampling times. The convolution stride is... To enhance features after aligning them with the scale of the backbone features;

[0067] Channel adaptation and normalization: Eliminating channel dimensional differences and stabilizing feature distribution, the formula is:

[0068] ;

[0069] in, Number of main feature channels Achieve channel dimension conversion, Main characteristics Enhance features for final fusion.

[0070] Compared with the prior art, the present invention has the following advantages:

[0071] 1. Quantitatively analyze the impact of low-light image degradation factors on target detection performance, accurately match the key processing steps of ISP with the requirements of detection tasks, and fully release the raw data advantages of RAW images in low-light detection scenarios through deep interaction and fusion of ISP and detection.

[0072] 2. A hierarchical connection-driven ISP intermediate feature-detection network framework is proposed. The framework achieves synergistic enhancement between ISP intermediate features and the detection network through "dual-path EEM + multi-scale MSYA + FM". The EEM module uses four-directional convolution and staged fusion to achieve refined edge enhancement of low-light targets, while the MSYA module solves the balance problem between global brightness equalization and contrast improvement of small targets in local dark areas through cross-scale attention fusion. Both modules precisely address the core bottlenecks of low-light detection.

[0073] 3. By fully utilizing the details and gradient information of intermediate images in each stage of ISP preprocessing, a lossless complementary link from RAW data to detection features is constructed through a feature fusion module. This solves the problems of existing methods neglecting intermediate ISP information and insufficient information utilization. Experimental verification on real and synthetic low-light RAW datasets shows that the detection performance of the proposed framework is significantly better than that of existing mainstream methods, providing an efficient and reliable solution for target detection in low-light environments. Attached Figure Description

[0074] Figure 1 A framework diagram of an ISP intermediate feature detection network driven by hierarchical connections.

[0075] Figure 2 This refers to the image processing procedure.

[0076] Figure 3 Line graph showing the impact of fuzzy noise, contrast, brightness, and hue on the accuracy of the detection model.

[0077] Figure 4 The graph shows the curves of the eight basis functions.

[0078] Figure 5 This is a comparison chart of the detection accuracy of different methods for different categories in the LOD dataset.

[0079] Figure 6 This is a visualization of the MSYA module. Detailed Implementation

[0080] To further illustrate the technical solution of the present invention, the present invention will be further described below through embodiments.

[0081] There are two major bottlenecks in the field of target detection in low-light RAW images: First, existing methods have not solved the problem of the separation between image processing and detection tasks, and have not fully considered the impact of the ISP process on detection performance; second, there is insufficient utilization of intermediate features in the ISP process, and the representation link from RAW data to the final detection features has not been fully explored.

[0082] To overcome the aforementioned bottlenecks, this embodiment presents a hierarchical low-light RAW image target detection method, comprising the following steps:

[0083] The low-light RAW image is input into the constructed hierarchical connection-driven ISP intermediate feature detection network, and the detection result is output. Figure 1 As shown, it includes a task-oriented ISP submodule, an edge enhancement module, a multi-scale contrast adaptive adjustment module, and a feature fusion module. Through hierarchical connections, it achieves accurate fusion of intermediate features of the ISP with each stage of the backbone network, and then outputs the detection results through the neck network and prediction head.

[0084] 1) Task-oriented ISP submodule

[0085] The core contradiction between traditional image processing systems (ISPs) and inspection tasks lies in the "inconsistency of design goals": traditional ISPs are geared towards optimizing subjective human vision, while inspection tasks require preserving objective characteristics of "distinguishability and localization." This contradiction leads to the impact of ISP processing on inspection performance; for example... Figure 2 As shown, the nonlinear correction (Gamma), excessive denoising, and compression processes in traditional ISPs smooth target edges, compress dynamic range, and amplify noise, resulting in irreversible loss of detection features.

[0086] To investigate the impact of the ISP process on object detection performance, experiments were designed based on the YOLOv11 model and tested using the COCO dataset. Modifications to noise, blur, brightness, contrast, and hue were performed, and their effects on the detection results were tested. Furthermore, the mechanism of its effect on image edges was explored through experiments using indicators such as edge density, sharpness, and quantity. Experimental results are as follows: Figure 3 Table 1 shows that noise, blurring, and contrast are the core factors affecting the detection mAP and edge features (density, sharpness, and quantity), while brightness and hue have minimal impact. This conclusion directly guides the design of subsequent modules; addressing noise, contrast imbalance, and edge blurring issues specifically can significantly improve detection performance.

[0087] Table 1. The impact of blur / noise / contrast / brightness / hue on edges.

[0088] Edge density (↑) Edge sharpness (↑) Number of edges (↑) noise 0.22 39.88 791.33 Vague 0.09 141.50 391.04 Contrast 0.08 123.28 377.76 tone 0.00 1.19 10.97 brightness 0.01 12.41 57.87

[0089] This method designs a lightweight task-oriented ISP submodule, abandoning the nonlinear correction and compression features of traditional ISPs, and retaining only three key steps: adaptive denoising, task-oriented white balance, and color conversion. The parameters are dynamically optimized through the parameter prediction module (PPM) to complete these three steps respectively with the target detection as the guide while preserving the linear characteristics of RAW data.

[0090] Adaptive denoising: Denoising is achieved through Gaussian filtering, and the parameters of the Gaussian convolution kernel are predicted by the parameter prediction module. The parameter set is as follows: , For the major / minor axis of the elliptical Gaussian kernel, To sharpen the weights, For gain, the formula is:

[0091] ;

[0092] ;

[0093] in, The kernel is Gaussian, and * indicates convolution operation. This indicates that the input is a RAW image;

[0094] Task-oriented white balance: White balance parameters are predicted by the parameter prediction module, and the parameter set is as follows. These correspond to the parameters of the red, green, and blue channels, respectively, and the formula is:

[0095] ;

[0096] in, Represents the channels of an image;

[0097] Color Conversion: The color conversion matrix CCM (ccm_matrix) is predicted synchronously by the parameter prediction module. The formula for linear color space mapping is:

[0098] ;

[0099] The color conversion matrix (CCM) outputs a color-converted image. .

[0100] In summary, the task-driven ISP submodule generates multi-stage features through a three-step process. , , This not only corrects core degradation issues such as noise and color shift, but also accurately adapts to the dual-branch data flow of subsequent "feature-level enhancement" and "main feature extraction", laying the foundation for end-to-end optimization of the overall framework.

[0101] The parameter prediction module uses a self-attention mechanism to predict parameters, specifically:

[0102]

[0103] in, A set of learnable dynamic parameters, or Generated from the input image through convolution and linear layers. For scaling terms, FFN indicates that the feedforward network consists of linear layers and activation layers, and parameters represent the parameters required for the three stages (adaptive denoising, task-oriented white balance, and CCM color conversion) in the task-oriented ISP submodule.

[0104] 2) ISP-Detection Collaborative Architecture Driven by Core Feature Enhancement

[0105] Addressing the core challenges of low-light target detection performance, namely noise interference, blurred target edges, and multi-scale contrast imbalance, this method employs a collaborative enhancement architecture centered on an edge enhancement module (EEM) and a multi-scale contrast adaptive adjustment module, supplemented by a feature fusion module (FM). In this architecture, the EEM module leverages a gradient-aware edge enhancement mechanism to achieve refined reconstruction and enhancement of degraded target edge contours in low-light scenes. The multi-scale contrast adaptive adjustment module utilizes a dynamic calibration strategy for multi-scale features to optimize the contrast differences of target regions at different scales, precisely addressing the issue of insufficient contrast for small targets in dark areas. The FM module performs complementary fusion of multi-source enhancement features, ultimately outputting high-quality feature representations with sharp edges and balanced contrast to meet the requirements of the detection task. The ISP preprocessing output features are precisely transformed and enhanced into high-quality features with "sharp edges and balanced contrast" suitable for target detection tasks, thereby achieving deep coupling with the detection network and completing end-to-end joint optimization.

[0106] 2.1) Edge Enhancement Module (EEM)

[0107] In low-light scenes, target edges are easily obscured by noise and blurred, leading to decreased positioning accuracy, which is one of the core bottlenecks in low-light detection. Existing edge enhancement methods mostly use unidirectional gradient kernels, which are difficult to fully capture the contour information of complex targets and are not adapted to the characteristics of low-light noise, easily resulting in incomplete edge extraction and noise amplification. To address this, an innovative EEM edge enhancement module is designed to overcome the limitations of existing methods. It specifically enhances the multi-directional, high-purity edge features required for detection (such as vehicle contours in dark areas and corners of small targets), providing accurate positioning basis for the detection network.

[0108] Meanwhile, to adapt to the functional differences between ResNet18 stage3 semantic extraction and stage4 localization regression, and to balance the integrity of edge features with low noise characteristics, the EEM module adopts a dual-path input-output design to avoid the dilution and weakening of edge features during layer-by-layer propagation in low-light scenes, thereby improving the confidence of target classification and the accuracy of candidate box localization. This specifically addresses the localization and classification biases caused by blurred edges in low light. The details are as follows:

[0109] Dual-path input definition: Parallel processing of two types of input features to form dual-path enhancement branches that balance edge integrity and low noise characteristics: in, Preserve the original edge clues. It has the advantage of low noise and can be adapted to the fusion requirements of different stages of ResNet.

[0110] Four-directional edge extraction: For the input features of the two paths, edge features are extracted using improved Sobel convolution kernels in four directions. The convolution operation is defined as follows:

[0111] ;

[0112] in, For the first Path The edge response features of the direction, with dimensions of , Represents a dual-path index. Represents 4 edge directions (horizontal) ,vertical , 45° direction, 135° direction), An improved Sobel convolution kernel for the corresponding direction (kernel size is...) ), This represents the convolution operation;

[0113] Multi-directional edge feature fusion and dimensionality adaptation: The edge features of a single path in four directions are concatenated by channel, and then a 1×1 convolution is used to complete feature fusion and channel dimension adjustment, as detailed below:

[0114] Channel splicing, The concatenated dimensions are (B, 4, H, W).

[0115] 1×1 convolution fusion, ; , The learnable parameters are for a 1×1 convolution;

[0116] BN normalization suppresses noise, while ReLU activation enhances effective edge response.

[0117] ;

[0118] Staged output and fusion localization: The edge enhancement features after dual-path processing are output to the specified stage of ResNet18, defined as follows:

[0119] , ;

[0120] in, Depend on This provides the original, subtle edge clues for stage 4; Depend on This provides pure edge features for stage 3, enhancing the semantic-contour binding.

[0121] 2.2) Multi-scale contrast adaptive adjustment module (MSYA)

[0122] The coexistence of underexposure of small targets in dark areas and overexposure of the background in bright areas in low-light scenes leads to a severe imbalance in the contrast between the target and the background, which is a key factor affecting the discriminative power of detection features. Traditional contrast adjustment methods (such as Gamma correction) use a single, fixed mapping relationship, which cannot adapt to complex differences in brightness distribution. Existing RAW enhancement methods mostly use single-scale adjustment strategies, which are difficult to balance global brightness balance with contrast enhancement of small targets in local dark areas. To address this, we have innovatively designed a multi-scale adaptive contrast adjustment module. Through multi-scale progressive adjustment and adaptive weight allocation, we achieve synergistic optimization of "global brightness balance + precise contrast enhancement of small targets in local dark areas," overcoming the limitation of traditional methods that "pay attention to one aspect but lose another," and providing the detection network with highly discriminative feature input.

[0123] The core logic of the module is "precise feature input after white balance → Y channel optimization → multi-scale adjustment → RGB reconstruction → fusion after stage 1", and its process can be described by formulas (symbol definitions: The input features are after white balance, with dimensions (B, 3, H, W); B is the batch size, 3 is the number of RGB channels, H and W are the feature map dimensions; S = {1,2,4,8} is the preset multi-scale branch set.

[0124] Multi-scale brightness adjustment residual calculation: The RGB features are separated into a luminance channel Y and chrominance channels Cb and Cr using the BT.601 standard to avoid chrominance interference with contrast adjustment; subsequent optimization is performed only on the luminance channel. Then, based on a preset multi-scale branch set S = {1,2,4,8}, the luminance channel Y is adjusted according to the scale... Average pooling downsampling is performed to obtain the brightness features of different receptive fields, adapting to global and local adjustment needs. The formula is as follows:

[0125] ;

[0126] in, The pooling kernel size (equal to the scale s to ensure accurate downsampling size). Let be the brightness feature at the s-th scale (dimensions are (B, 1, H / s, W / s)).

[0127] Basis function weighted response: Combining the characteristics of low-light images—"concentrated pixels in dark areas, sparse signals in bright areas, and uneven global brightness distribution"—each scale branch generates a personalized brightness adjustment response by learning the adaptive weights of eight preset polynomial basis functions. For underexposed dark areas, low-order basis functions are used to enhance signal gain to improve local brightness; for overexposed bright areas, high-order basis functions are used to suppress signal amplitude to avoid loss of detail. Simultaneously, the nonlinear combination of basis functions adaptively cancels the interference of low-light noise on brightness distribution, overcoming the limitation of traditional fixed mapping in adapting to complex low-light brightness differences.

[0128] ;in, Let be the learnable weights of the i-th basis function at the s-th scale, used to dynamically assign the weights of different basis functions in low-light scenes. For predefined polynomial basis functions (such as...) Figure 4 (as shown) For the adjusted response at the s-th scale.

[0129] Cross-scale attention fusion: Adjusting response to different scales Attention-weighted processing is performed, automatically assigning contributions at different scales, and fusing them to generate a global-local collaborative brightness adjustment residual. The formula is:

[0130] ;

[0131] ;

[0132] in, The attention weights at the s-th scale are obtained by 1×1 convolution dimensionality reduction + Softmax normalization. For upsampling, all scales are restored to the original brightness channel size. , Adjust the residual brightness for the final blended result.

[0133] Feature Output and Fusion Localization: The adjusted residual is fused with the original luminance channel, and after range limitation, it is reconstructed into RGB features:

[0134] ;

[0135] in, For range-limiting functions, This is the optimized luminance channel;

[0136] Will With chroma channels Inverse conversion back to RGB space yields contrast enhancement features. , After precise integration into ResNet18's stage 1, that is:

[0137] .

[0138] 3) Hierarchical connection integration

[0139] Addressing the core issues of poor compatibility between RAW enhancement features and the detection backbone, disconnect between ISP and feature extraction, and scattered enhancement information from multiple sources, this module constructs a bridge connecting the ISP and the ResNet backbone. Task-oriented features output from the ISP submodule are enhanced by EEM and MSYA, and then injected layer by layer according to the functional requirements of each ResNet stage, achieving deep collaboration between the two. This breaks the sequential disconnect, ensuring that ISP features adapt to detection needs throughout the entire process; the layered injection balances backbone capabilities and enhancement value, addressing low-light pain points in stages.

[0140] Inject before stage 1 After stage 1, the contrast_feat generated by the multi-scale contrast adaptive adjustment module is injected, and after stage 2, the edge enhancement module generates a feature based on the denoised features. Semantic information is bound to high-purity, low-noise edge details to avoid interfering with semantic extraction, thereby achieving edge layering enhancement. An edge enhancement module is injected after stage 3 based on features generated before denoising. By leveraging complete original edge details to optimize candidate box prediction, a dual enhancement of "semantics + contour" is achieved. After being injected into the neck network via stage 4, features are fused between stage 1, stage 2, and stage 3 through a feature fusion module.

[0141] The feature fusion module is: Spatial attention weighting: Attention enhancement is applied to both the backbone features and the enhancement features, using the following formula:

[0142] ;

[0143] in, The input features are either backbone features or enhancement features. Generate attention weight maps for 7×7 convolutions. To achieve target region enhancement and background suppression through element-wise multiplication, Features weighted by attention;

[0144] Scale alignment: Enhanced features are adapted to the backbone feature size through multiple downsampling steps, as shown in the formula:

[0145] ;

[0146] in, To enhance features, , The number of downsampling times. The convolution stride is... To enhance features after aligning them with the scale of the backbone features;

[0147] Channel adaptation and normalization: Eliminating channel dimensional differences and stabilizing feature distribution, the formula is:

[0148] ;

[0149] in, Number of main feature channels Achieve channel dimension conversion, Main characteristics Enhance features for final fusion.

[0150] Example 2

[0151] For the object detection task, a RetinaNet detector with a ResNet backbone was used, and ImageNet pre-trained weights were used for initialization. The model was trained using a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.001, momentum of 0.9, and weight decay coefficient of 0.0001; the total number of training epochs across all datasets was 15. Data augmentation strategies included random horizontal flipping, and all images were resized to 400×600 pixels.

[0152] To comprehensively verify the performance and efficiency of the proposed low-light RAW image target detection method, six mainstream baseline methods were selected for comparative experiments, including traditional image signal processing methods (Default ISP, Demosaic), low-light enhancement-driven detection methods (LIS, FeatEnHancer), and advanced RAW data adaptation methods (SID, RAW-Adapter, Dark-ISP). ResNet-18 was used as the backbone network in the experiments, and the detection accuracy (mAP) was evaluated on both real low-light RAW datasets (LOD) and synthetic low-light RAW datasets (PASCALRAW). The lightweight nature of the model was verified through parameter statistics. The experimental results are shown in Tables 2, 3, and 4.

[0153] Table 2 presents the comparative experimental results of the LOD dataset under the ResNet18 backbone network. The detection performance of each method shows significant gradient differences: among traditional methods, Default... ISP (55.1%), relying on a fixed image processing workflow, outperforms Demosaic (52.4%) and SID (49.1%), confirming the necessity of basic ISP preprocessing for low-light RAW data detection. However, due to the limitations of its "human vision-oriented" design, its performance lags significantly behind task-driven dedicated methods. Among low-light enhancement methods, FeatEnHancer achieves 57.9% mAP through a hierarchical feature fusion strategy, a significant improvement over LIS (55.0%), highlighting the positive role of multi-scale feature enhancement in low-light target detection. In RAW data adaptation methods, RAW-Adapter uses a lightweight adapter to reduce the distribution difference between RAW data and the pre-trained model, achieving an mAP of 56.7%. Dark-ISP specifically optimizes the ISP workflow for low-light scenes, increasing the mAP to 59.7%. The proposed method achieves 60.1% mAP through deep fusion of multi-scale contrast adaptive adjustment and edge enhancement, outperforming Dark-ISP, RAW-Adapter, and Default. The ISP was improved by 0.4, 3.4, and 5.0 percentage points respectively, fully validating its superior detection performance on real low-light RAW data. To further clarify the adaptability of the proposed method to different types of targets, the following section combines... Figure 5 Conduct category-level performance analysis.

[0154] Table 2. Comparative Experiments (LOD dataset, ResNet18)

[0155] Methods mAP Default ISP 55.1 Demosaic 52.4 SID 49.1 LIS 55.0 FeatEnHancer 57.9 RAW-Adapter 56.7 Dark-ISP 59.7 Ours 60.1

[0156] Figure 5The paper further demonstrates the class-level detection performance of each method on the LOD dataset (covering traffic targets such as bicycle, car, motorbike, and bus, and indoor targets such as chair, dining table, bottle, and TV): For traffic targets that are easily blurred due to low light intensity imbalance, the proposed method significantly outperforms the car and motorbike categories (e.g., mAP close to 0.9 for motorbike), and is comparable to or slightly better than FeaEnHancer and RAW-Adapter in the bicycle and bus categories, demonstrating excellent feature extraction and edge preservation capabilities; For indoor targets that are small-scale, low-texture, and easily affected by background interference, the proposed method outperforms RAW-Adapter and LIS in the dining table, bottle, and TV categories (e.g., performance is about 0.2 higher than RAW-Adapter in the dining table category), and is on par with FeaEnHancer in the chair category, demonstrating good scene generalization. Overall, the proposed method outperforms or matches the comparative methods in most target categories, with a particularly strong advantage in low-light, easily interfered target types. This aligns with the overall mAP improvement trend in Table 2, validating the effectiveness of the proposed module in adapting to different target types. The improved category-level performance stems from the targeted design of the core module. The following section further breaks down the working mechanism of the MSYA module through visualization experiments.

[0157] The PASCALRAW-low light dataset was synthesized from sRGB images using the inverse InvISP transform. Its low-light scenes exhibit controllability and consistency, making it suitable for verifying the method's adaptability to standardized low-light environments. Dark-ISP, due to its reliance on the physical characteristics of real camera RAW data (such as sensor noise models and light response functions), does not match the distribution of synthesized RAW data and was therefore excluded from the comparison. Table 3 shows that FeatEnHancer achieves the highest mAP of 85.8% with its complex feature enhancement structure, but at the cost of a large number of parameters; RAW-Adapter achieves 82.5% mAP through a dual-adaptor design, demonstrating the effectiveness of the pre-trained model adaptation strategy; Demosaic and LIS achieve mAPs of 80.3% and 78.0%, respectively, reflecting the limitations of directly processing RAW data or using a single enhancement strategy in synthesizing low-light scenes. The proposed method (Ours) achieves 83.4% mAP without introducing a complex network structure, which is 0.9 percentage points higher than RAW-Adapter. Although it is slightly lower than FeatEnHancer, the parameter analysis in Table 4 shows that the proposed method achieves a better balance between performance and efficiency, and also has a strong adaptability to synthesized low-light RAW data.

[0158] Table 3 Comparative experiments (PASCALRAW-low light dataset, ResNet18)

[0159] Methods mAP Default ISP - Demosaic 80.3 SID 78.2 LIS 78.0 FeatEnHancer 85.8 RAW-Adapter 82.5 Dark-ISP - Ours 83.9

[0160] The number of parameters is a core quantitative indicator for evaluating the feasibility of engineering deployment of a model, and its size directly determines the model's adaptability and inference efficiency on devices with low computing resources. Table 4 presents the parameter count statistics of the proposed method and existing mainstream methods. The quantitative data in the table clearly shows that the proposed method (Ours) has a parameter count of 11.310 MB, exhibiting the best lightweight performance among all compared methods: a 1.18% reduction compared to Dark-ISP (11.445 MB), a 2.66% reduction compared to RAW-Adapter (11.620 MB), a 0.05% reduction compared to FeatEnHancer (11.316 MB), and a 6.08% reduction compared to LIS (12.042 MB). Both the Multi-Scale Contrast Adaptive Adjustment (MSYA) module and the Edge Enhancement (EEM) module employ lightweight convolutional structures and efficient spatial attention mechanisms, effectively avoiding the parameter redundancy problem caused by deep network stacking in traditional methods. This ensures feature enhancement while reducing the number of parameters. Secondly, by constructing a deep collaborative fusion mechanism between the ISP preprocessing and detection networks, redundant feature transformation and intermediate adaptation steps in the traditional serial architecture are effectively omitted, achieving end-to-end collaborative optimization of preprocessing and detection tasks.

[0161] Table 4 Comparison of Parameter Quantities

[0162] Methods Parameters↓(MB) LIS 12.042 FeatEnHancer 11.316 RAW-Adapter 11.620 Dark-ISP 11.445 Ours 11.310

[0163] In summary, the proposed method demonstrates superior detection performance on both real and synthetic low-light RAW datasets. Its core advantages lie in two aspects: first, a dedicated enhancement module is designed to address the core challenges of low-light scenes (edge ​​loss and brightness imbalance), achieving deep collaboration between ISP processing and detection tasks; second, a lightweight structural design is adopted, reducing the number of parameters while maintaining performance. Compared to Dark-ISP, which relies on real RAW data, and RAW-Adapter, which focuses on pre-trained model adaptation, the proposed method offers stronger scene adaptability and deployment flexibility. Compared to the parameter-intensive FeatEnHancer, the proposed method, while maintaining similar performance, better meets the needs of engineering applications, providing an efficient and reliable solution for target detection in low-light RAW images.

[0164] On the LOD dataset, ablation experiments were conducted using ResNet18 as the backbone network, as shown in Table 5. Removing skip connections (No-skip) resulted in an mAP of 55.1. Adding convolutional layers and connecting them to subsequent models (Conv-skip) improved the mAP by 0.3%, indicating that introducing the intermediate image from the ISP into the feature extraction backbone network improves detection performance. Passing the intermediate image from the ISP through the Edge Enhancement (EEM) module significantly improved detection performance by 2.2%. To overcome the influence of uneven illumination on target edges, a Multi-Scale Contrast Adaptive Adjustment (MSYA) module was introduced, achieving a 1.7% improvement compared to the method using only convolutional processing. Finally, the accuracy after adding all modules to the model was 60.1%, demonstrating that each module effectively improves detection accuracy.

[0165] Table 5 Ablation experiments on the LOD dataset

[0166] Methods mAP No-skip 55.1 Conv-skip 55.4 Edge-Enhance-Module 57.6 Multi-Scales-Y-Adjuster 57.1 Ours 60.1

[0167] To verify the core function of the MSYA module and the emphasis of the four-scale adjustment, a visual comparison experiment was designed, and the results are as follows: Figure 6 As shown (from left to right: default ISP effect diagram, before and after enhancement comparison diagram, and four-scale adjustment heatmap). The experiment compares "traditional benchmark - module effect - scale mechanism", which not only demonstrates the advantages of the module compared with the traditional ISP, but also quantifies the weight distribution of each scale through heatmap, thus proving the rationality of the module design.

[0168] Figure 6 The results can be interpreted layer by layer: Comparing the default ISP with the images before and after enhancement, the low-light images processed by the default ISP suffer from brightness imbalance and blurred edges. After module enhancement, the brightness is more balanced and the target details are clearer. This also explains why the method mentioned in Table 2 has the overall higher mAP. Figure 5 One of the core reasons for the excellent performance at the mid-level is the four-scale heatmap, which confirms the progressive optimization strategy of "global-local-pixel level": scale0 achieves global uniform weighting to balance brightness and suppress overexposure; scale1 and scale2 focus on the target subject and weaken the background with high weights, adapting to the detection needs of traffic targets such as cars and motorbikes, as well as small indoor targets such as dining tables and bottles; scale3 targets the edge details of the target with high weights, achieving precise calibration and ensuring the accuracy of target localization. The results show that the module can accurately solve the pain points of brightness imbalance and detail loss in low-light RAW images, providing high-quality feature input for subsequent detection. The heatmap also provides quantitative evidence for the rationality of the adjustment strategy, supporting the improvement of the accuracy of the proposed method.

Claims

1. A hierarchical target detection method for low-light RAW images, characterized in that, Includes the following steps: The low-light RAW image is input into the constructed hierarchical connection-driven ISP intermediate feature detection network, and the detection result is output. The hierarchical connection-driven ISP intermediate feature detection network includes a task-oriented ISP sub-module, an edge enhancement module, a multi-scale contrast adaptive adjustment module, and a feature fusion module. The hierarchical connection realizes the accurate fusion of ISP intermediate features with each stage of the backbone network, and then the detection result is output after passing through the neck network and the prediction head. The task-oriented ISP submodule includes three stages: adaptive denoising, task-oriented white balance, and color conversion. The parameters are dynamically optimized through the parameter prediction module. While preserving the linear characteristics of RAW data, the three stages are completed respectively with object detection as the guide. The edge enhancement module relies on a gradient-aware edge enhancement mechanism to achieve refined reconstruction and enhancement of the edge contours of degraded targets in low-light scenes; The multi-scale contrast adaptive adjustment module completes the contrast differentiation optimization of target regions at different scales through a dynamic calibration strategy of multi-scale features. The feature fusion module complementarily fuses multi-source enhanced features to output high-quality features with sharp edges and balanced contrast.

2. The method for target detection in low-light RAW images based on hierarchical layering according to claim 1, characterized in that, The task-oriented ISP submodule is specifically as follows: Adaptive denoising: Denoising is achieved through Gaussian filtering, and the parameters of the Gaussian convolution kernel are predicted by the parameter prediction module. The parameter set is as follows: ,in, , For the major / minor axis of the elliptical Gaussian kernel, To sharpen the weights, For gain, the formula is: ; ; in, The kernel is Gaussian, and * indicates convolution operation. This indicates that the input is a RAW image; Task-oriented white balance: White balance parameters are predicted by the parameter prediction module, and the parameter set is as follows. These correspond to the parameters of the red, green, and blue channels, respectively, and the formula is: ; in, Represents the channels of an image; Color Conversion: The color conversion matrix (CCM) is predicted synchronously by the parameter prediction module. The formula for linear color space mapping is: ; in, The color conversion matrix (CCM) outputs a color-converted image. .

3. The method for target detection in low-light RAW images based on hierarchical layering according to claim 2, characterized in that, The parameter prediction module uses a self-attention mechanism to predict parameters, specifically: in, A set of learnable dynamic parameters, or Generated from the input image through convolution and linear layers. For scaling terms, FFN indicates that the feedforward network consists of linear layers and activation layers, and parameters represent the parameters required for the three stages in the task-oriented ISP submodule.

4. The method for target detection in low-light RAW images based on hierarchical layering according to claim 3, characterized in that, The edge enhancement module is specifically: Dual-path input definition: Parallel processing of two types of input features to form dual-path enhancement branches that balance edge integrity and low noise characteristics: ,in, Preserve the original edge clues. It has the advantage of low noise and can be adapted to the fusion requirements of different stages of ResNet. Four-directional edge extraction: For the input features of the two paths, edge features are extracted using improved Sobel convolution kernels in four directions. The convolution operation is defined as follows: ; in, For the first Path Edge response characteristics of the direction, Represents a dual-path index. Representing 4 edge directions, This is an improved Sobel convolution kernel for the corresponding direction. This represents the convolution operation; Multi-directional edge feature fusion and dimensionality adaptation: The edge features of a single path in four directions are concatenated by channel, and then a 1×1 convolution is used to complete feature fusion and channel dimension adjustment, as detailed below: Channel splicing, ; 1×1 convolution fusion, ; , The learnable parameters are for a 1×1 convolution; BN normalization suppresses noise, while ReLU activation enhances effective edge response. ; Staged output and fusion localization: The edge enhancement features after dual-path processing are output to the specified stage of ResNet, defined as follows: , ; in, Depend on This provides the original, subtle edge clues for stage 4; Depend on This provides pure edge features for stage 3.

5. The method for target detection in low-light RAW images based on hierarchical layering according to claim 4, characterized in that, The multi-scale contrast adaptive adjustment module is specifically as follows: Multi-scale brightness adjustment residual calculation: The RGB features are separated into the luminance channel Y and the chrominance channels Cb and Cr. Then, based on the preset multi-scale branch set S = {1,2,4,8}, the luminance channel Y is adjusted according to the scale. Average pooling downsampling is performed to obtain the brightness features of different receptive fields, adapting to global and local adjustment needs. The formula is as follows: ; in, The pooling kernel size, The brightness feature is at the s-th scale; Basis function weighted response: Each scale branch generates a personalized brightness adjustment response by learning the adaptive weights of eight preset polynomial basis functions: for underexposed dark areas, signal gain is enhanced through low-order basis functions to improve local brightness; for overexposed bright areas, signal amplitude is suppressed through high-order basis functions to avoid loss of detail, while the nonlinear combination of basis functions adaptively cancels the interference of low-light noise on the brightness distribution. ; in, Let be the learnable weights of the i-th basis function at the s-th scale. For predefined polynomial basis functions, For the adjusted response at the s-th scale; Cross-scale attention fusion: Adjusting response to different scales Attention-weighted processing is performed, automatically assigning contributions at different scales, and fusing them to generate a global-local collaborative brightness adjustment residual. The formula is: ; ; in, Let be the attention weight at scale s. For upsampling, all scales are restored to the original brightness channel size. , Adjust the residual brightness for the final blend; Feature Output and Fusion Localization: The adjusted residual is fused with the original luminance channel, and after range limitation, it is reconstructed into RGB features: ; in, For range-limiting functions, This is the optimized luminance channel; Will With chroma channels Inverse conversion back to RGB space yields contrast enhancement features. , After precise integration into ResNet's stage 1, that is: 。 6. The method for target detection in low-light RAW images based on hierarchical layering according to claim 5, characterized in that, The hierarchical connection enhances the output features of the task-oriented ISP submodule through the edge enhancement module and the multi-scale contrast adaptive adjustment module, and then injects them layer by layer according to the functions of each ResNet stage. Specifically: Inject before stage 1 After stage 1, the contrast_feat generated by the multi-scale contrast adaptive adjustment module is injected, and after stage 2, the edge enhancement module generates a feature based on the denoised features. After stage 3, an edge enhancement module is injected based on features generated before denoising. After being injected into the neck network via stage 4, features are fused between stage 1, stage 2, and stage 3 through a feature fusion module.

7. The method for target detection in low-light RAW images based on hierarchical layering according to claim 6, characterized in that, The feature fusion module is specifically as follows: Spatial attention weighting: Attention is enhanced for both backbone features and augmenting features, using the following formula: ; in, The input features are either backbone features or enhancement features. Generate attention weight maps for 7×7 convolutions. For element-wise multiplication, Features weighted by attention; Scale alignment: Enhanced features are adapted to the backbone feature size through multiple downsampling steps, as shown in the formula: ; in, To enhance features, , The number of downsampling times. The convolution stride is... To enhance features after aligning them with the scale of the backbone features; Channel adaptation and normalization: Eliminating channel dimensional differences and stabilizing feature distribution, the formula is: ; in, Number of main feature channels Achieve channel dimension conversion, Main characteristics Enhance features for final fusion.