A Low-Light Image Enhancement Method Based on Random Region Hidden Reconstruction
By generating low-light noisy images under good lighting conditions and performing self-supervised learning, combined with multi-scale feature fusion guided by reconstructed features, the problems of noise amplification and high computational cost in existing low-light enhancement methods are solved, achieving fast and effective low-light image enhancement that can adapt to complex lighting conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIHANG UNIV
- Filing Date
- 2024-07-08
- Publication Date
- 2026-06-30
AI Technical Summary
Existing low-light enhancement methods tend to amplify noise when processing low-light images, resulting in unclear edge details in the enhanced image. They are also computationally intensive, requiring a large number of real image pairs for training, which is labor-intensive and time-consuming. Furthermore, they perform poorly under complex lighting conditions.
A two-stage training method based on hidden reconstruction of random regions is adopted. First, a low-light noisy image is generated under good lighting conditions, and a feature extractor is trained through self-supervised learning. Then, a multi-scale feature fusion module guided by the reconstructed features is used to fuse image features, which reduces the amount of computation and improves the network's ability to process low-light images.
While reducing computational load, it improves the noise reduction capability of low-light images, preserves edge details, adapts to various complex lighting scenarios, and achieves fast and effective low-light image enhancement.
Smart Images

Figure CN118505540B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of low-light image enhancement, specifically relating to a low-light image enhancement method based on random region hidden reconstruction. Background Technology
[0002] With the advancement of technology, images are being used more and more widely in social production and daily life, playing a vital role in fields such as medicine, education, entertainment, scientific research, industry, and agriculture. For example, medical images help doctors make diagnoses, and satellite imagery is used for climate prediction and geographical research. High-quality images not only provide a better visual experience but also convey information more accurately and support decision-making more effectively. However, in practical applications, environmental factors and equipment often result in low-quality images that are too dark, contain noise, or have color deviations. Therefore, post-processing using image enhancement techniques is necessary to improve image quality and usability.
[0003] The main purpose of low-light enhancement is to improve the brightness and clarity of images in low-light environments. Traditional image processing methods, such as spatial domain algorithms like histogram equalization and gamma correction, improve image brightness and contrast by directly processing pixels. However, these often amplify noise and introduce artifacts and color distortion. Frequency domain algorithms, such as low-pass filtering, smooth the image; high-pass filtering improves details at edge boundaries; and homomorphic filtering enhances details in dark areas. However, they are generally ineffective for low-light images, often requiring parameter tuning for different lighting conditions, and perform poorly in complex urban lighting environments. Retinex theory is also widely used for low-light enhancement. Based on trichromatic theory and color constancy, it posits that an object's color is determined by its ability to reflect long, medium, and short wavelengths, unaffected by the intensity of reflected light. Therefore, it decouples the original low-light image into illuminance and reflection components, obtaining the enhanced image by separating the reflection component from the original image. However, it is ineffective in extremely dark environments and complex lighting conditions, often requiring parameter tuning and exhibiting poor generalization.
[0004] With the development of neural networks and the improvement of computing power, neural network-based image processing methods have been widely used in the field of vision. However, convolutional neural networks are limited by the size of the convolutional kernel and have a small receptive field, making them unsuitable for capturing global long-range relationships in images. In low-light enhancement tasks, they are prone to uneven illumination in the enhancement results and struggle to utilize useful information from non-neighboring blocks. Transformer networks have a good ability to capture global dependencies in global sequence data. ViT was the first network to apply the Transformer architecture to computer vision tasks and achieved performance equal to or even surpassing that of convolution. However, as a sequence-to-sequence model, the Transformer takes pixels as sequence input, which can lead to excessively long input sequences, much longer than those in natural language processing (e.g., 224). (224 = 50176), the computational cost is enormous. In other words, existing low-light enhancement methods require a large number of real-world paired low-light image pairs for supervised network training. However, capturing a large number of image pairs in both low-light and well-lit conditions of the same scene is extremely time-consuming and labor-intensive; currently, such datasets are very limited, hindering neural network training. Furthermore, the Transformer network, as a sequence-to-sequence model, uses pixels as sequence input when dealing with high-resolution images, resulting in a massive computational burden and high computational power requirements. Additionally, existing low-light enhancement methods suffer from limited low-light noise reduction capabilities, leading to unclear edge details in the enhanced image. Therefore, there is an urgent need to propose a method for fast and effective low-light image enhancement. Summary of the Invention
[0005] To overcome the shortcomings of existing technologies, this invention provides a low-light image enhancement method based on random region concealment reconstruction. First, it utilizes a richer set of images with good lighting conditions to perform low-light and random masking and noise addition processing on the input images with good lighting conditions. Then, through a self-supervised learning task of reconstructing the masked and hidden missing pixels, it greatly reduces the computational load of the network, saving computing power and time, while improving the network's ability to learn global information of low-light noisy images. This allows for better preservation of edge details while denoising low-light noisy images.
[0006] In the second stage, a multi-scale feature fusion module guided by reconstructed features enables better feature fusion of features extracted by different feature encoders. Compared to the limitations imposed by the regular rectangular sliding window and the limited receptive field of convolutional neural networks in the U-net convolutional denoising network, the feature fusion module allows for more flexible and broader selection of sampling points for interaction, thereby extracting feature information more suitable for low-light and noisy images. Simultaneously, using relatively few real low-light images to fine-tune network parameters allows for faster adaptation to various real and complex low-light scenes, achieving better low-light enhancement and noise reduction than existing methods, resulting in clearer images. This invention, by performing random masking on the original image before reconstructing it—a self-supervised learning task—and fusing feature information from different features in a multi-scale feature fusion module guided by reconstructed features, reduces data volume and saves computing power while improving the network's flexible processing capability for information in low-light and noisy images.
[0007] To achieve the above objectives, the present invention adopts the following technical solution:
[0008] This invention discloses a low-light image enhancement method based on random region hidden reconstruction. The network is trained in two stages, sharing the same feature extraction module (including a denoising feature encoder and a reconstruction feature encoder). The first stage aims to train on a richer dataset of well-photographed conditions to obtain a better feature extractor for low-light, noisy images. The second stage inherits the feature extractor from the first stage and then fuses the image features through a multi-scale feature fusion module guided by reconstruction features. By adjusting the network weights using a small number of real low-light, noisy images, a more generalizable low-light enhancement network is obtained. Specifically, the method includes the following steps:
[0009] The specific steps for the first phase are as follows:
[0010] Step (1): First, the image dataset under good lighting conditions is preprocessed by randomly adding noise and reducing brightness to obtain a synthesized low-light noisy image.
[0011] Step (2): After dividing the low-light noisy image obtained in step (1) into multiple image blocks of size 16×16, a random sequence is generated based on a uniform distribution. The random values are sorted and mapped to the original image blocks. The image blocks are masked at a ratio of 75%.
[0012] Step (3): Input the two images before and after masking into a convolutional neural network with a kernel size of 16×16 and a fully connected layer respectively to obtain the encoding result of the image with a size of 196×768. Use sine and cosine position coding to perform position coding on each image block, and add the two to obtain the initial encoding information of each image block.
[0013] Step (4): Input the image encoding information into the denoising feature encoder and the reconstructing feature encoder to obtain a high-dimensional feature map;
[0014] Step (5): Input the feature maps output from the two-branch feature encoders into the multi-layer noise decoder and the multi-layer restoration decoder, respectively;
[0015] Step (6): Output the decoded noise distribution map and reconstructed map and compare them with the input image to calculate the reconstruction loss and perceptual loss.
[0016] The specific steps for the second phase are as follows:
[0017] Step (7): Inherit the denoising feature encoder and the reconstructed feature encoder obtained in the first stage task.
[0018] Step (8): Input the real-world low-light and noisy image dataset into the two-branch feature encoder obtained in the first stage task for feature extraction.
[0019] Step (9): Combine the features obtained from the two branches. Figure 1 The input is fed into a multi-scale feature fusion module guided by reconstructed features for feature fusion.
[0020] Step (10): Input the feature map obtained in step (9) into the low-light noise reduction decoder to obtain the result map after low-light noise reduction enhancement processing. Calculate the loss function with the ground truth and fine-tune the network weight parameters to enable it to quickly adapt to various complex real noise images.
[0021] The advantages of this invention compared to the prior art are as follows:
[0022] 1. The first phase of the task utilizes a dataset with richer data under good lighting conditions for network training to obtain a feature extractor capable of processing low-light noisy images. Compared with existing enhancement methods that amplify noise after low-light enhancement, this invention reduces noise after enhancement while avoiding the problem of overfitting due to the small number of real low-light image datasets.
[0023] 2. Applying a 75% masking rate to the original image reduces the amount of data input to the network for feature extraction to only a fraction of the original image's size.
[0024] The computational load is reduced by 25%, significantly saving computing power and accelerating network training.
[0025] 3. Fine-tuning the network weight parameters using real low-light images with relatively small amounts of data allows the network to better adapt to real complex low-light scenes, resulting in better robustness and generalization.
[0026] 4. Compared to the regular rectangular sliding window in convolutional neural networks, this invention generates the offset of the sampling point relative to the feature point by reconstructing features in the feature fusion model, thereby obtaining a more flexible interaction range.
[0027] 5. Compared to the U-net denoising network, which loses texture detail information due to multi-layer downsampling, this invention uses sampling points at various scales when performing feature fusion, which preserves the details in noisy images, making it more suitable for noisy images and reducing the image blurring problem caused by denoising. Attached Figure Description
[0028] Figure 1 This is a flowchart of the two-stage task in this invention;
[0029] Figure 2 This is a schematic diagram of the feature fusion module guided by reconstruction features in the second stage task of the present invention;
[0030] Figure 3 This is a schematic diagram of the fusion module in the feature fusion module for reconstructing features in the second phase of the present invention. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0032] This invention first trains a neural network to reconstruct images after random regions are masked, enabling the neural network to learn detailed deep semantic information about low-light noisy images. Then, it uses real low-light image pairs to fine-tune the feature extraction network inherited from the first stage, and achieves the fusion of different types of features through multi-scale feature fusion guided by reconstructed features. This results in low-light enhancement and noise reduction with stronger generalization under various real low-light conditions, suitable for low-visibility scenarios, and providing clear images for systems such as autonomous driving, security monitoring, and target capture.
[0033] like Figure 1 As shown, the low-light image enhancement method based on random region hidden reconstruction of the present invention uses a network with two stages to obtain the low-light enhancement model, and the two stages share a denoising feature encoder and a reconstruction feature encoder.
[0034] The first stage includes the following steps:
[0035] Step (1): Apply Gaussian noise and Poisson noise to the image under good lighting conditions, and randomly reduce the image brightness to generate a low-light noisy image with 3 channels, a height of 224 and a width of 224.
[0036] Step (2): Divide the 224×224 image into 14 There are 14 image blocks, each with a size of (16, 16). After dividing the low-light noisy image into image blocks, 75% of the image blocks are randomly selected for masking to obtain the masked low-light noisy image.
[0037] Step (3): Input the two images before and after masking into a convolutional neural network with a kernel size of 16×16 and a fully connected layer respectively to obtain the encoding result of the image with a size of 196×768. Use sine and cosine position coding to perform position coding on each image block, and add the two to obtain the initial encoding information of each image block.
[0038] Step (4): Input the image encoding information into the denoising feature encoder and the reconstructing feature encoder to obtain a high-dimensional feature map. The feature encoder uses the classic Transformer encoder structure. The network structure of each feature encoder consists of the following modules and processing layers in sequence: ① normalization layer; ② multi-head self-attention layer; ③ residual layer; ④ normalization layer; ⑤ multilayer perceptron layer.
[0039] Step (5): Input the feature maps output from the two branch feature encoders into the multilayer noise decoder and the multilayer restoration decoder, respectively. Both the multilayer noise decoder and the multilayer restoration decoder consist of 6 feature decoder layers. Each feature decoder layer consists of the following modules and processing layers in sequence: multi-head self-attention layer, residual layer, and normalization layer.
[0040] Step (6): Output the decoded noise distribution map and reconstructed map and compare them with the input image to calculate the reconstruction loss and perceptual loss.
[0041] In the second stage, the corresponding modules of the second-stage network are initialized using the feature decoder weights obtained in the first stage, and the low-light enhancement network is trained using the following steps:
[0042] Step (7): Input the real low-light image into the second-stage network, repeat step (2) for encoding, then inherit the pre-trained denoising feature encoder and reconstruction feature encoder of the first stage for feature extraction, and input the features finally extracted from the upper and lower branches into the multi-scale feature fusion module guided by reconstruction features.
[0043] Step (8): As Figure 2 As shown, the features from the two branches are input into a multi-scale feature fusion module guided by reconstructed features. Within this module, the features first pass through a fusion module. (See diagram below.) Figure 3 As shown, within the fusion module, the denoised image features and reconstructed image features are sequentially input into convolutional neural networks with different kernel sizes of 3×3 and strides of 1, 2, and 2 for downsampling, resulting in denoised image features and reconstructed image features at three scales. Then, the obtained multi-scale reconstructed image features are input into convolutional neural networks with kernel sizes of 3×3 and strides of 1 for each scale, generating the offsets of 9 irregular sampling points relative to each feature point in the corresponding denoised image feature map at each scale. Then, based on the multi-scale offset map obtained with the reconstructed features as guidance, each feature point in the denoised image feature map at each scale is processed with its 27 irregular sampling points (9 for each of the three scales), generating an updated multi-scale denoised image feature map, which is then input into the next feature fusion layer. After the final fusion layer, the smaller-scale denoised image feature map is upsampled and concatenated with the feature map to obtain an updated denoised image feature map of the original size.
[0044] Step (9): Multiply the output of the fusion module in step (8) with the parameter matrices Wq and Wk respectively to obtain feature maps Q and K. Multiply the input denoised image features with the parameter matrix Wv to obtain feature map V. Perform cross-attention calculation on Q, K, and V, and output the resulting feature map as a new denoised image feature.
[0045] Step (10): After repeating steps (8) and (9) N times, input the denoised image features from the last layer into a multi-layer decoder consisting of 6 low-light denoising decoders. Each decoder layer consists of a multi-head self-attention layer, a residual layer, and a normalization layer. The final output enhanced image is compared with the ground truth, and the L2 loss and perceptual loss are calculated. The overall network is then fine-tuned so that it can adapt well to low-light real-world scenes, resulting in the enhanced image.
[0046] Preferably, the reconstruction loss and perception loss functions in steps (6) and (10) are L1 and L2 norms, respectively. During the first stage of training, the Adam optimizer is used, where the exponential decay rate β1 of the first moment estimation is 0.9 and the exponential decay rate β2 of the second moment estimation is 0.999. The low-light denoising reconstruction model of the first stage is trained for 100 rounds with an initial learning rate of 0.0006, which linearly decays to 0.0003 after 50 rounds. After the first stage of training, the low-light enhancement denoising model of the second stage is fine-tuned for 30 rounds using a real low-light dataset with a learning rate of 0.0002. To increase the availability of data, 500 pairs of real low-light datasets are randomly cropped and horizontally flipped during the second stage of training to perform data augmentation and improve the low-light enhancement denoising performance of the second stage task.
[0047] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A low-light image enhancement method based on random region cloaking reconstruction, characterized in that, Includes the following steps: Step 1: Obtain a dataset of 10,000 images under good lighting conditions, with the data type being color visible light images, and divide it into training and testing sets; Obtain a dataset containing 500 pairs of low-light noisy images and well-lit images of the same real scene, with the data type being color visible light images, and divide it into training and testing sets. Step 2: First, use the dataset of 10,000 images under good lighting conditions for the first stage network. First, preprocess the dataset by adding noise and reducing brightness. Then, after the images are hidden by random regions, input them into the first stage network. Then train the network to restore and reconstruct the low-light noisy images after random hiding, and obtain a denoising feature encoder and a reconstruction feature encoder with good information extraction ability for low-light noisy images. Step 3: Use 500 pairs of low-light noisy images of real scenes for the second-stage network. Input the feature encoder obtained in Step 2 for feature extraction. Then, input the denoised image features and reconstructed image features into the multi-scale feature fusion guided by the reconstructed features for feature fusion. Then, input the fused features into the low-light noise reduction decoder. Finally, use the well-lit images paired in the same scene as the reference ground value for network training, so that the network parameters can quickly adapt to the real low-light scene. After the network enhancement processing, a low-light enhanced image with improved brightness and better quality is obtained. The input to the multi-scale feature fusion module guided by reconstruction features is the denoised image features and reconstructed image features obtained through two feature encoders. First, the denoised and reconstructed image features are input into the fusion module. Within the fusion module, the two features are sequentially input into convolutional neural networks with different kernel sizes of 3×3 and strides of 1, 2, and 2 for downsampling, resulting in denoised and reconstructed image features at three different scales. Then, the multi-scale reconstructed image features are input into convolutional neural networks with kernel sizes of 3×3 and strides of 1, generating the offset of each feature point in the denoised image feature map at each scale relative to the offset of 9 irregular sampling points at each scale. Based on the multi-scale offset map obtained guided by the reconstructed features, each feature point in the feature map at each scale of the denoised image features is convolved with 27 sampling points (9 for each of the three scales) to generate a new multi-scale denoised image feature map, which is then input into the next layer of feature fusion. After the final fusion layer, the small-scale denoised image feature maps are sequentially upsampled and stitched together to finally obtain the denoised image feature map of the original size. The image is then input into a low-light noise reduction decoder for feature decoding to obtain the enhanced image. The decoder structure uses the decoder structure from Transformer.
2. The low-light image enhancement method based on random region concealment reconstruction according to claim 1, characterized in that, The preprocessing in step 2 includes applying Gaussian noise and Poisson noise and uniformly reducing the image brightness; the hiding of the random region involves dividing the low-light noisy image into multiple image blocks of size 16×16, generating a random sequence based on a uniform distribution, sorting the random values and mapping them to the original image blocks, and masking the image blocks at a ratio of 75%.
3. The low-light image enhancement method based on random region concealment reconstruction according to claim 1, characterized in that, In step 2, the feature encoder of the image restoration and reconstruction branch only processes visible pixels that are not covered by the mask, while the decoder is responsible for using the image features extracted by the reconstruction feature encoder and the information of the covered pixels to perform the image reconstruction subtask.
4. The low-light image enhancement method based on random region concealment reconstruction according to claim 1, characterized in that, The feature encoder in step 2 uses the encoder structure in the classic Transformer; finally, the feature maps of the two branches are input into the feature decoder for noise localization and image restoration and reconstruction tasks, and the decoder uses the decoder structure in the Transformer.
5. The low-light image enhancement method based on random region concealment reconstruction according to claim 1, characterized in that, The reconstruction loss and perception loss functions in steps 2 and 3 are L1 and L2 norms, respectively. The Adam optimizer is used during the first stage of training, with an exponential decay rate β1 = 0.9 for the first moment estimation and β2 = 0.999 for the second moment estimation. The low-light denoising reconstruction model for the first stage is trained for 100 epochs with an initial learning rate of 0.0006, which linearly decays to 0.0003 after 50 epochs. After the first stage of training, the low-light enhancement denoising model for the second stage is trained for 30 epochs using a real low-light dataset with a learning rate of 0.0002. To increase data availability, 500 pairs of real low-light datasets are randomly cropped and horizontally flipped during training to augment the data and improve the low-light enhancement performance of the second stage task.