A fixed pattern noise removal method based on a coding-decoding network
By building a lightweight codec network and training a deep learning model using a dataset specific to image sensors, the problem of eliminating fixed-pattern noise in image sensors is solved. This achieves unified elimination and improved computational efficiency across different devices, making it suitable for resource-constrained edge devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI JUNZHENG TECH CO LTD
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-19
Smart Images

Figure CN122243784A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image noise reduction technology, and specifically relates to a fixed-pattern noise elimination method based on an encoding / decoding network. Background Technology
[0002] Fixed Pattern Noise (FPN) is a common type of noise in image sensors. During the manufacturing process of an image sensor, the response characteristics of each pixel may differ slightly. These minute manufacturing differences cause each pixel to produce different output signals under the same conditions. This difference in response characteristics constitutes fixed pattern noise. FPN exists in a fixed spatial pattern, typically manifesting as independent point noise and blocky brightness inhomogeneities in the image. Fixed pattern noise significantly reduces the signal-to-noise ratio of an image, especially in low-light conditions, where the noise becomes more pronounced and severely degrades image quality. Therefore, eliminating FPN is crucial for improving image quality in low-light environments.
[0003] Fixed-mode noise can be reduced or eliminated in various ways. For example, during the manufacturing of image sensors, precise processes and calibration steps can be used to reduce the response differences between pixels, or high-precision manufacturing processes and materials can be used to reduce manufacturing inconsistencies. Alternatively, the fixed spatial mode of the FPN can be utilized to calibrate the FPN and store it in a preset storage medium, which can then be eliminated through digital signal processing when the electronic imaging device is in operation.
[0004] However, the shortcomings of existing technology are:
[0005] Optimizing the manufacturing process of image sensors and using higher purity materials significantly increases hardware costs, which is not conducive to the cost reduction and popularization of electronic imaging devices. Calibration can basically eliminate FPN, but the FPN spatial patterns of different individual image sensors of the same model are different. Each imaging device needs to calibrate the FPN in a dark environment, which adds an extra production step.
[0006] Furthermore, the terminology used in this art includes:
[0007] Encoder-decoder network: A neural network architecture consisting of an encoder and a decoder, often used for image-to-image tasks. The encoder is used to extract features while gradually reducing the spatial dimensions of the image, and the decoder is used to restore the low-resolution feature maps generated by the encoder to the original resolution.
[0008] Fixed Pattern Noise (FPN): Fixed pattern noise is a common type of noise in image sensors. During the manufacturing process of an image sensor, the response characteristics of each pixel may be slightly different. These small manufacturing differences can cause each pixel to produce different output signals under the same conditions. This difference in response characteristics constitutes fixed pattern noise.
[0009] RAW images refer to raw image data that has not undergone any processing or decoding, existing in the form of uncompressed, raw pixel information.
[0010] Black frame: An image output by an image sensor in the absence of light.
[0011] Bayer format: Bayer format generally refers to a method of representing color images. It is a method based on the arrangement of a color filter array (CFA). The most common form of Bayer format is the Bayer RGGB mode, which uses four different color filter units (usually red (R), green (G), and blue (B)) arranged in a certain way on the image sensor to capture color image information. Summary of the Invention
[0012] To address the aforementioned issues, the purpose of this application is to: train a deep learning model for eliminating fixed-pattern noise by creating an image sensor-specific dataset and building a general encoding / decoding network, thereby mitigating the negative impact of fixed-pattern noise on image quality in low-light environments.
[0013] Specifically, the present invention provides a fixed-pattern noise cancellation method based on a codec network, the method comprising the following steps:
[0014] S1. Acquire RAW images using an electronic imaging device equipped with a target image sensor, and create a paired dataset for training a denoising model;
[0015] S2. Perform necessary preprocessing on the dataset, and expand the dataset size and increase its diversity through data augmentation;
[0016] S3. Design a lightweight codec network consisting of an encoder, a connection layer, a decoder, and an output layer;
[0017] S4. Train the encoder-decoder network based on the preprocessed paired dataset; input the preprocessed and data-enhanced noisy image I. noisy The input is fed into the encoding / decoding network, where it undergoes encoding and decoding processes sequentially to obtain the predicted noise map. The MAE function is used to calculate the denoised image. With reference image Iref The error is calculated as follows:
[0018]
[0019] The gradient is calculated through backpropagation of the error, the network weight parameters are updated using the Adam optimizer, and a cosine annealing learning rate reduction strategy is used during training, with the half-cycle of the cosine function set as the maximum number of iterations.
[0020] S5. After the model training reaches convergence, the preprocessed, uncropped complete image is loaded into the model, and the denoised image is obtained after post-processing.
[0021] Step S1 further includes:
[0022] S1.1 In a low-light environment, N original noisy images are acquired. The illumination range of the low-light environment is consistent with the illumination range of the camera and camera electronic imaging equipment when they are actually working. The number of noisy images N is determined by the theoretical upper limit of the intensity of time-domain denoising. The essence of time-domain denoising is to eliminate noise by weighted averaging of noisy images in the time series. Its theoretical upper limit is the noise level of how many frames of noisy images can be averaged to achieve the noise level of the denoised image. Select ISO in sequence within the ISO setting range. Assuming that the ISO range of the imaging equipment is 1000 to 16000, set ISO to 1000, 2000, 4000, 8000 and 16000 in sequence. Acquire M black frames in a dark environment under each ISO setting. Assuming that the noise of other signals is irrelevant except for FPN is zero-mean noise, the approximate FPN is obtained by superimposing and averaging the M black frames. The calculation method is shown in Equation (1): Theoretically, the larger the value of M, the higher the accuracy of FPN.
[0023]
[0024] In the above formula, I dark(i) This represents the i-th frame in a set of M black frames;
[0025] S1.2 Based on the acquired original noisy images and FPNs obtained under different ISOs, a paired dataset is created. Two different types of paired datasets are constructed according to whether they are suitable for temporal denoising: those without temporal denoising and those with temporal denoising. If the image before FPN removal has not undergone temporal denoising, the input is any one of the N original noisy images, denoted as I. noisy Then the output is I. ref This is the result of subtracting FPN from a noisy image; if the image before removing FPN has undergone temporal denoising, then the input is the image obtained by averaging the sum of n original noisy images. The output is The result of subtracting FPN from the averaged image is calculated as follows;
[0026]
[0027]
[0028] In the above formula, I noisy(i) This represents the i-th image among N original noisy images. clip is a value range truncation operation. clip(x,a,b) means that the value range of x is limited to a to b. bl represents the black level and wp represents the white point value.
[0029] In step S1.1, the illuminance range is assumed to be 0.01 to 0.1 lux; N = 100; M = 500; in step S1.2, in order to adapt to different temporal denoising intensities and increase the diversity of noise, n is set to a random value between 30 and 50.
[0030] Step S2 further includes:
[0031] S2.1 For noisy images containing FPN I noisy and reference image I without FPN ref Black level correction and normalization are performed. Black level correction is used to correct the reference level deviation produced by the image sensor when there is no light input. The purpose of normalization is to improve the efficiency and effectiveness of model training. The calculation process is as follows:
[0032]
[0033] The raw image acquired by S2.2 is in Bayer format. The computational cost of the network input for encoding and decoding is greatly affected by the resolution of the input image. Therefore, converting it to four channels before inputting it into the network can reduce the computational cost. The pixels of the R, Gr, Gb and B channels in the Bayer format image are extracted to obtain four single-channel images with their width and height halved. Then, they are stacked in the channel dimension to obtain a four-channel image.
[0034] S2.3 The noisy images acquired through the acquisition method are limited. In order to expand the dataset size, the noisy images I... noisy Non-repeating cropping and random horizontal and vertical flipping are performed. Converting the Bayer format to four channels before cropping can avoid destroying the Bayer format. Non-repeating cropping is performed by sampling the complete four-channel image with a fixed-size window of 256×256 and a step size of the same as the window size through a sliding window.
[0035] In step S2.2, the Bayer format includes RGGB and BGGR pixel arrangement methods.
[0036] Step S3 further includes:
[0037] S3.1 Design the encoder; the encoder consists of three consecutive "convolutional layers + downsampling layers", which gradually reduce the spatial resolution while extracting the features of the input image. The feature map output by each convolutional layer is passed to the decoder through cross-layer connections to realize feature reuse.
[0038] S3.2 Design of the intermediate layer: The intermediate layer connects the encoder and decoder and consists of n consecutive "convolution + activation function" structures. The role of the intermediate layer is to further expand the network's receptive field based on the encoder, enabling the network to acquire higher-level semantic information. The number n of the "convolution + activation function" structures controls the network's maximum receptive field, determined by the spatial resolution of the input image. S3.3 Design of the decoder: The decoder consists of three consecutive "transposed convolution + convolutional layers," progressively restoring the low-resolution feature maps to the original spatial resolution of the input. During image restoration, it utilizes the multi-scale features transmitted by the encoder to compensate for information loss caused by downsampling and upsampling operations, enhancing the network's representational and detail restoration capabilities. After the feature map concatenation operation, learnable channel weights are inserted to recalibrate the importance of each channel. The calculation method is as follows:
[0039] F recal =F*weight (7)
[0040] In the above formula, F represents the concatenated feature map, weight is the learnable channel weights, and * indicates element-wise multiplication. recal Feature map after channel importance recalibration;
[0041] S3.4 Design the output layer; the input feature map of the output layer has the original resolution of the image. In order to reduce the amount of computation, the output layer is set to a 1×1 convolution.
[0042] In step S3.2, n = 3 is set.
[0043] Therefore, this application has the following advantages:
[0044] (1) The design utilizes the powerful fitting ability of the encoding and decoding network to eliminate prominent fixed-pattern noise in low-light images, and is suitable for different types of image sensors;
[0045] (2) The designed encoding and decoding network has low computational cost and is suitable for resource-constrained end-side devices. Attached Figure Description
[0046] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0047] Figure 1 This is a flowchart illustrating the method.
[0048] Figure 2 This is a schematic diagram of the overall structure of the lightweight codec network designed in step S3 of this method. Detailed Implementation
[0049] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.
[0050] This invention provides a fixed-pattern noise cancellation method based on a codec network, the overall process of which is as follows: Figure 1 As shown, the method includes the following steps:
[0051] Step S1. Acquire RAW images using an electronic imaging device equipped with a target image sensor, and create a paired dataset for training the denoising model;
[0052] Step S1.1 Acquire N original noisy images in a low-light environment. The illumination range of the low-light environment is consistent with the illumination range of electronic imaging devices such as cameras and webcams during actual operation, for example, 0.01 to 0.1 lux. The number of noisy images N is determined by the theoretical upper limit of the intensity of temporal denoising. The essence of temporal denoising is to eliminate noise by weighted averaging of noisy images in a time series. Its theoretical upper limit is the noise level of how many frames of noisy images the denoised image can reach after averaging. In this example, N = 100 is set. Select ISO sequentially within the ISO setting range. For example, if the ISO range of the imaging device is 1000 to 16000, then set ISO to 1000, 2000, 4000, 8000 and 16000 sequentially. Acquire M black frames in a dark environment under each ISO setting. Assuming that the noise of other signals besides FPN is zero-mean noise, the approximate FPN is obtained by superimposing and averaging the M black frames. The calculation method is shown in Equation (1). Theoretically, the larger the value of M, the higher the accuracy of FPN. In this example, M is set to 500.
[0053]
[0054] In the above formula, I dark(i) Let i represent the i-th frame in M black frames.
[0055] Step S1.2: Based on the acquired original noisy images and FPNs obtained under different ISOs, a paired dataset is created. Two different types of paired datasets are constructed according to whether they are suitable for temporal denoising: those without temporal denoising and those with temporal denoising. If the image before FPN removal has not undergone temporal denoising, the input is any one of the N original noisy images, denoted as I. noisy Then the output is I. refThis is the result of subtracting FPN from a noisy image; if the image before removing FPN has undergone temporal denoising, then the input is the image obtained by averaging the sum of n original noisy images. The output is The result of subtracting FPN from the averaged image is calculated as follows;
[0056]
[0057] I ref =clip(I noisy -FPN,0,wp-bl) (3)
[0058]
[0059] In the above formula, I noisy(i) This represents the i-th image among N original noisy images. clip is a value range truncation operation. clip(x,a,b) means limiting the value range of x to a~b. bl represents the black level and wp represents the white point value. In order to adapt to different temporal domain denoising intensities and increase the diversity of noise, n is set to a random value between 30 and 50 in this patent.
[0060] Step S2. Perform necessary preprocessing on the dataset, expand the dataset size and increase dataset diversity through data augmentation;
[0061] Step S2.1 For the noisy image I containing FPN noisy and reference image I without FPN ref Black level correction and normalization are performed. Black level correction is used to correct the reference level deviation produced by the image sensor when there is no light input. The purpose of normalization is to improve the efficiency and effectiveness of model training. The calculation process is as follows:
[0062]
[0063] Step S2.2 Acquisition of raw data Figure 1 Generally, it is in Bayer format, such as RGGB and BGGR pixel arrangement. The computational cost of the network input for encoding and decoding is greatly affected by the resolution of the input image. Converting it to four channels before inputting it into the network can significantly reduce the computational cost. The pixels of the R, Gr, Gb and B channels in the Bayer format image are extracted to obtain four single-channel images with half the width and height. Then, they are stacked in the channel dimension to obtain a four-channel image.
[0064] Step S2.3 The noisy images acquired through the acquisition method are limited. In order to expand the dataset size, the noisy images I... noisyPerform non-repeating cropping and random horizontal and vertical flipping. Converting the Bayer format to four channels before cropping can avoid destroying the Bayer format. Non-repeating cropping is performed by sampling the complete four-channel image with a fixed-size window of 256×256 and a step size of the same as the window size using a sliding window method.
[0065] Step S3. Design a lightweight codec network, which consists of an encoder, a connection layer, a decoder, and an output layer. The overall structure is as follows: Figure 2 As shown;
[0066] Step S3.1 Design the encoder. The encoder consists of three consecutive "convolutional layers + downsampling layers". While extracting features from the input image, it gradually reduces the spatial resolution. The feature map output by each convolutional layer is passed to the decoder through cross-layer connections to achieve feature reuse.
[0067] Step S3.2 Design the intermediate layer. The intermediate layer is used to connect the encoder and decoder, and consists of n consecutive "convolution + activation function" structures. The role of the intermediate layer is to further expand the receptive field of the network based on the encoder, so that the network can acquire higher-level semantic information. The number n of "convolution + activation function" structures is used to control the maximum receptive field of the network, which is determined by the spatial resolution of the input image. In this embodiment, n = 3 is set.
[0068] Step S3.3: Design the decoder. The decoder consists of three consecutive "transposed convolution + convolutional layers," which progressively restore the low-resolution feature maps to the original spatial resolution of the input. During image restoration, the multi-scale features transmitted by the encoder compensate for information loss caused by downsampling and upsampling operations, enhancing the network's representational and detail restoration capabilities. Learnable channel weights are inserted after the feature map concatenation operation to recalibrate the importance of each channel. The calculation method is as follows:
[0069] F recal =F*weight (7)
[0070] In the above formula, F represents the concatenated feature map, weight is the learnable channel weights, and * indicates element-wise multiplication. recal Feature map after channel importance recalibration.
[0071] Step S3.4 Design the output layer. The input feature map of the output layer has the same resolution as the original image. To reduce computation, the output layer is set to a 1×1 convolution.
[0072] Step S4. Train the encoder-decoder network based on the preprocessed paired dataset. The preprocessed and data-augmented noisy image I... noisy The input is fed into the encoding / decoding network, where it undergoes encoding and decoding processes sequentially to obtain the predicted noise map. The MAE function is used to calculate the denoised image. With reference image I ref The error is calculated as follows:
[0073]
[0074] The gradient is calculated through backpropagation of the error, and the network weight parameters are updated using the Adam optimizer. During training, a cosine annealing learning rate descent strategy is used, with the half-cycle of the cosine function set as the maximum number of iterations. The cosine annealing learning rate descent strategy is a learning rate scheduling method used in optimization algorithms in machine learning and deep learning. It mimics the annealing process in physics, where materials are heated to a high temperature and then gradually cooled to achieve a more stable structure. In the context of machine learning, this process helps the model escape local optima by periodically adjusting the learning rate, potentially leading to a better global optimum.
[0075] Step S5. After the model training reaches convergence, the preprocessed, uncropped complete image is loaded into the model, and the denoised image is obtained after post-processing.
[0076] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A fixed-pattern noise cancellation method based on an encoding / decoding network, characterized in that, The method includes the following steps: S1. Acquire RAW images using an electronic imaging device equipped with a target image sensor, and create a paired dataset for training a denoising model; S2. Perform necessary preprocessing on the dataset, and expand the dataset size and increase its diversity through data augmentation; S3. Design a lightweight codec network consisting of an encoder, a connection layer, a decoder, and an output layer; S4. Train the encoder-decoder network based on the preprocessed paired dataset; input the preprocessed and data-enhanced noisy image I. noisy The input is fed into the encoding / decoding network, where it undergoes encoding and decoding processes sequentially to obtain the predicted noise map. The MAE function is used to calculate the denoised image. With reference image I ref The error is calculated as follows: The gradient is calculated through backpropagation of the error, the network weight parameters are updated using the Adam optimizer, and a cosine annealing learning rate reduction strategy is used during training, with the half-cycle of the cosine function set as the maximum number of iterations. S5. After the model training reaches convergence, the preprocessed, uncropped complete image is loaded into the model, and the denoised image is obtained after post-processing.
2. The fixed-pattern noise cancellation method based on a codec network according to claim 1, characterized in that, Step S1 further includes: S1.1 In a low-light environment, N original noisy images are acquired. The illumination range of the low-light environment is consistent with the illumination range of the camera and camera electronic imaging equipment when they are actually working. The number of noisy images N is determined by the theoretical upper limit of the intensity of time-domain denoising. The essence of time-domain denoising is to eliminate noise by weighted averaging of noisy images in the time series. Its theoretical upper limit is the noise level of how many frames of noisy images can be averaged to achieve the noise level of the denoised image. Select ISO in sequence within the ISO setting range. Assuming that the ISO range of the imaging equipment is 1000 to 16000, set ISO to 1000, 2000, 4000, 8000 and 16000 in sequence. Acquire M black frames in a dark environment under each ISO setting. Assuming that the noise of other signals is irrelevant except for FPN is zero-mean noise, the approximate FPN is obtained by superimposing and averaging the M black frames. The calculation method is shown in Equation (1): Theoretically, the larger the value of M, the higher the accuracy of FPN. In the above formula, I dark(i) This represents the i-th frame in a set of M black frames; S1.2 Based on the acquired original noisy images and FPNs obtained under different ISOs, a paired dataset is created. Two different types of paired datasets are constructed according to whether they are suitable for temporal denoising: those without temporal denoising and those with temporal denoising. If the image before FPN removal has not undergone temporal denoising, the input is any one of the N original noisy images, denoted as I. noisy Then the output is I. ref This is the result of subtracting FPN from a noisy image; if the image before removing FPN has undergone temporal denoising, then the input is the image obtained by averaging the sum of n original noisy images. The output is The result of subtracting FPN from the averaged image is calculated as follows; I ref =clip(I noisy -FPN,0,wp-bl) (3) In the above formula, I noisy(i) This represents the i-th image among N original noisy images. clip is a value range truncation operation. clip(x,a,b) means that the value range of x is limited to a to b. bl represents the black level and wp represents the white point value.
3. The fixed-pattern noise cancellation method based on a codec network according to claim 2, characterized in that, In step S1.1, the illuminance range is assumed to be 0.01 to 0.1 lux; N = 100; M = 500; in step S1.2, in order to adapt to different temporal denoising intensities and increase the diversity of noise, n is set to a random value between 30 and 50.
4. The fixed-pattern noise cancellation method based on a codec network according to claim 1, characterized in that, Step S2 further includes: S2.1 For noisy images containing FPN I noisy and reference image I without FPN ref Black level correction and normalization are performed. Black level correction is used to correct the reference level deviation produced by the image sensor when there is no light input. The purpose of normalization is to improve the efficiency and effectiveness of model training. The calculation process is as follows: The raw image acquired by S2.2 is in Bayer format. The computational cost of the network input for encoding and decoding is greatly affected by the resolution of the input image. Therefore, converting it to four channels before inputting it into the network can reduce the computational cost. The pixels of the R, Gr, Gb and B channels in the Bayer format image are extracted to obtain four single-channel images with their width and height halved. Then, they are stacked in the channel dimension to obtain a four-channel image. S2.3 The noisy images acquired through the acquisition method are limited. In order to expand the dataset size, the noisy images I... noisy Non-repeating cropping and random horizontal and vertical flipping are performed. Converting the Bayer format to four channels before cropping can avoid destroying the Bayer format. Non-repeating cropping is performed by sampling the complete four-channel image with a fixed-size window of 256×256 and a step size of the same as the window size through a sliding window.
5. A fixed-pattern noise cancellation method based on a codec network according to claim 4, characterized in that, In step S2.2, the Bayer format includes RGGB and BGGR pixel arrangement methods.
6. The fixed-pattern noise cancellation method based on a codec network according to claim 1, characterized in that, Step S3 further includes: S3.1 Design the encoder; the encoder consists of three consecutive "convolutional layers + downsampling layers", which gradually reduce the spatial resolution while extracting the features of the input image. The feature map output by each convolutional layer is passed to the decoder through cross-layer connections to realize feature reuse. S3.2 Design the intermediate layer; the intermediate layer connects the encoder and decoder and consists of n consecutive "convolution + activation function" structures. The role of the intermediate layer is to further expand the receptive field of the network based on the encoder, enabling the network to acquire higher-level semantic information. The number n of "convolution + activation function" structures controls the maximum receptive field of the network, which is determined by the spatial resolution of the input image. S3.3 Design the decoder; the decoder consists of three consecutive "transposed convolution + convolutional layers", which gradually restores the low-resolution feature maps to the original spatial resolution of the input. During the image restoration process, the multi-scale features transmitted by the encoder are used to compensate for the information loss caused by downsampling and upsampling operations, enhancing the network's representational ability and detail restoration ability. After the feature map concatenation operation, learnable channel weights are inserted to recalibrate the importance of each channel. The calculation method is as follows: F recal =F*weight (7) In the above formula, F represents the concatenated feature map, weight is the learnable channel weights, and * indicates element-wise multiplication. recal Feature map after channel importance recalibration; S3.4 Design the output layer; the input feature map of the output layer has the original resolution of the image. In order to reduce the amount of computation, the output layer is set to a 1×1 convolution.
7. A fixed-pattern noise cancellation method based on a codec network according to claim 1, Its features are, In step S3.2, n = 3 is set.