A method and apparatus for denoising an image

By using cascaded network modules and a random blind spot strategy, combined with frequency domain enhanced dilated convolution and multi-scale feature fusion, the bottleneck of existing self-supervised image denoising methods in real noise processing is solved, achieving efficient denoising and high-quality image reconstruction in complex noise environments.

CN120047341BActive Publication Date: 2026-06-30UNIT 32002 OF THE CHINESE PEOPLES LIBERATION ARMY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIT 32002 OF THE CHINESE PEOPLES LIBERATION ARMY
Filing Date
2024-12-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing self-supervised image denoising methods typically assume that noise is independent and identically distributed with zero mean when dealing with real noise, resulting in poor performance in complex real noise environments. Furthermore, existing methods have limited effectiveness in dealing with large-area spatially correlated noise and struggle to recover high-frequency details.

Method used

A cascaded network module is adopted, including a pixel recombination and downsampling module, a random blind spot module, a deep feature extraction and reconstruction module. Through the random blind spot strategy and incremental blind spot training, combined with frequency domain enhanced dilated convolution and multi-scale feature fusion, noise correlation is broken, and the model's generalization ability and detail recovery effect are improved.

Benefits of technology

Without increasing the number of parameters, it significantly improves image denoising performance, maintains high-quality image reconstruction in complex and realistic noise environments, has strong adaptability and excellent generalization ability, and is suitable for special fields such as medical and military applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120047341B_ABST
    Figure CN120047341B_ABST
Patent Text Reader

Abstract

This invention discloses a method and apparatus for image denoising, belonging to the field of image processing technology. The method includes acquiring an image to be denoised and inputting the image into a trained cascaded network module; the cascaded network module performs denoising processing on the image and outputs a denoised image; wherein the cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module, the output of the first branch and the output of the second branch are fused to obtain a fused feature, the fused feature is input into a reconstruction module, the reconstruction module maps the fused feature to a three-dimensional RGB space, the resulting reconstructed image is input into a pixel reconstruction upsampling module, and the pixel reconstruction upsampling module outputs a denoised image. This invention can achieve efficient denoising without the need for paired data support, and has strong adaptability and generalization ability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, and in particular relates to a method and apparatus for image denoising. Background Technology

[0002] Noise is a significant factor affecting image quality. While camera resolution has greatly improved with advancements, noise is still unavoidably introduced due to factors such as low-light conditions, sensor heating, and channel interference. Noise disrupts the signal distribution and texture features of image pixels, thus impacting the performance of tasks such as object detection, semantic segmentation, and object tracking. Images are widely used in modern society, serving not only for aesthetics and clarity but also in industries such as industry, military, and medicine. Image quality directly affects the success or failure of tasks such as defect detection, medical diagnosis, deployment planning, and information gathering. Therefore, recovering a clear image from damaged images is crucial.

[0003] Image denoising is a fundamental technique in low-level vision tasks, aiming to recover clean images from noisy observations. It is widely used in camera signal processing pipelines, as well as in industrial, medical, and military fields. With the development of deep neural networks, learning-based image denoising methods have made significant progress, offering clear advantages over traditional denoising methods in terms of denoising speed, performance, and large-scale denoising capabilities.

[0004] Currently, most deep learning-based denoising algorithms are supervised learning methods, requiring training on a large number of noisy-clean image pairs. The most common method for constructing training datasets is to add additive white Gaussian noise (AWGN) to clean images, artificially synthesizing noisy images to obtain a large number of clean-noisy image pairs. However, compared to synthetic noise, noise distribution in real-world scenes is often more complex, exhibiting signal dependence and spatial correlation. This leads to poor performance or even failure of the trained denoiser when applied to real-world noisy images. To address this issue, some researchers have attempted to capture clean-noisy image pairs in real-world scenes to form training datasets, such as the SIDD dataset. However, collecting real-world datasets requires long exposures or multiple shots, which is impractical in some specialized fields such as medicine and the military.

[0005] To overcome the dependence on large-scale paired datasets, self-supervised learning denoising methods that do not require clean images have attracted increasing attention. In 2019, Alexander Krull et al. proposed the Noise2Void method. This method is based on the assumption that pixel signals are spatially correlated, while noise signals are spatially independent and have zero mean. Building upon this, they proposed the Blind Spot Network (BSN) denoising method. Since then, many researchers have improved upon the Blind Spot Network proposed in Noise2Void, achieving significant results in handling synthetic noise (such as AWGN). The Blind Spot Network has become one of the most representative methods in the field of self-supervised image denoising.

[0006] Each pixel generated by the blind spot network is estimated from noisy pixels within its receptive field, excluding the pixel itself. This design allows the network to be trained using a self-supervised loss function, making it possible to train the network solely on noisy images—that is, the network's input and target are the same noisy image. Furthermore, the blind spot design effectively prevents the network from converging to an identity mapping. Under the assumption of zero-mean and identically distributed (IID) noise, BSN theoretically converges to a clean, noise-free image. To achieve network blinding, researchers typically employ two strategies: one is to mask a large number of pixels in the image, and the other is to mask the central pixel through convolutional kernel design. However, the idealized noise assumptions of this method limit its practical performance. Since real-world noise usually does not satisfy the assumptions of IID and zero mean, these methods perform poorly when dealing with real-world image noise.

[0007] To address the aforementioned issues, researchers have proposed several self-supervised real-image denoising methods. Among them, Lee et al. proposed the APBSN method, which introduces pixel shuffling down-sampling (PD) into the blind spot network. Specifically, this method first downsamples the input noisy image to break the correlation between noisy pixels, thus meeting the blind spot network's requirement for noise signal independence. Subsequently, the reconstructed downsampled image is input into the blind spot network for denoising. This method successfully achieves self-supervised real-image denoising and achieves good results.

[0008] Building upon this foundation, numerous subsequent studies have further improved upon these methods. However, the masking mechanisms of most methods only block the central pixels within the receptive field, offering limited effectiveness in handling large-area spatially correlated noise. Some methods, such as LG-BPN, MM-BSN, and AT-BSN, address the high correlation of noise by increasing the number of blind spots. However, the blind spots in these methods are typically set to fixed locations, which can easily lead to overfitting of the network to specific types of noise, resulting in poor model generalization.

[0009] Furthermore, self-supervised real-image denoising methods primarily break down noise correlation through downsampling or neighborhood masking. However, according to Nyquist-Shannon sampling theory, downsampling disrupts the spatial structure of the image, reduces sampling density, and leads to the loss of high-frequency details. Simultaneously, neighborhood masking methods may result in the loss of crucial information from the network input, further impacting denoising performance. Therefore, these methods often struggle to effectively recover high-frequency details in denoised images, limiting their performance in high-precision applications. Summary of the Invention

[0010] The present invention provides a method and apparatus for denoising images, which solves or at least partially solves the above-mentioned problems.

[0011] In a first aspect, a method for image denoising is disclosed, the method comprising:

[0012] Step S1: Obtain the image to be denoised and input the image into the trained cascaded network module;

[0013] Step S2: The cascaded network module performs denoising processing on the image and outputs the denoised image;

[0014] The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image.

[0015] The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blinded feature, uses the first blinded feature as input to the first deep feature extraction module, and outputs a first feature. The second random blind spot module generates a second blinded feature, uses the second blinded feature as input to the second deep feature extraction module, and outputs a second feature.

[0016] Preferably, the first random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features of the image using a first random blind spot convolution kernel to generate a first blinding feature; wherein, the first random blind spot convolution kernel is obtained by element-wise multiplication of a convolution kernel with a first noise mask matrix, the first noise mask matrix is ​​a randomly generated matrix composed of 0s and 1s with the same dimension as the convolution kernel, and the elements with 0s in the noise mask matrix represent blind spots; the size of the convolution kernel corresponding to the first random blind spot convolution kernel is 3×3; that is:

[0017] k1 binld =k1 conv ⊙k1 mask

[0018] k1 binld The first random blind spot convolution kernel, k1 conv The convolution kernel corresponding to the first random blind spot convolution kernel, k1 mask This is the first noise mask matrix;

[0019] The second random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features using a second random blind spot convolution kernel to generate a second blinding feature. The second random blind spot convolution kernel is obtained by element-wise multiplication of the convolution kernel with a second noise mask matrix. The second noise mask matrix is ​​randomly generated and has the same dimension as the convolution kernel, consisting of 0s and 1s. Elements with a value of 1 in the noise mask matrix represent blind spots. The size of the convolution kernel corresponding to the second random blind spot convolution kernel is 5×5.

[0020] k2 binld =k2 conv ⊙k2 mask

[0021] k2 binld The second random blind spot convolution kernel, k2 conv The convolution kernel corresponding to the second random blind spot convolution kernel, k2 mask This is the second noise mask matrix.

[0022] Preferably, during the training process, the number of 0 elements in the first noise mask matrix and the second noise mask matrix corresponding to the first random blind spot module and the second random blind spot module gradually increases.

[0023] Preferably, the first deep feature extraction module and the second deep feature extraction module have the same structure, both including 8 frequency domain enhancement dilated convolution sub-modules connected in sequence. Each frequency domain enhancement dilated convolution sub-module includes a first sub-branch and a second sub-branch connected in parallel, as well as a multi-scale feature fusion sub-module that fuses the output of the first sub-branch and the output of the second sub-branch.

[0024] The first sub-branch performs a discrete wavelet transform on the input to obtain a frequency feature map. Then, it performs a 3×3 dilated convolution on the frequency feature map. The first ReLU layer processes the high-frequency components in the first feature map obtained by the dilated convolution to obtain a frequency domain feature map. The frequency domain feature map is then subjected to an inverse discrete wavelet transform to obtain frequency enhancement features. The high-frequency components refer to features whose frequencies exceed a preset threshold in the frequency domain after the input is transformed from the spatial domain to the frequency domain by the discrete wavelet transform.

[0025] The second sub-branch performs a 3×3 dilated convolution on the input, and the second feature map obtained by the dilated convolution is processed by the second ReLU layer. Then, a 1×1 convolution is performed on the second feature map to obtain a pointwise convolutional feature map. The pointwise convolutional feature map is processed by the third ReLU layer to obtain the second feature.

[0026] The frequency enhancement feature and the second feature are respectively input into the multi-scale feature fusion submodule;

[0027] The multi-scale feature fusion submodule adds the frequency enhancement feature and the second feature point by point to obtain a first fused feature. The first fused feature is then input into the parallel third and fourth sub-branches. The third sub-branch compresses the first fused feature to one dimension through a 1×1 convolution to obtain a global feature. The global feature is then input into a fourth ReLU layer, and the processed global feature is further processed by a 1×1 convolution to obtain a first global feature. The fourth sub-branch extracts local features from the first fused feature through a 1×1 convolution. The local features are then input into a fifth ReLU layer, and the processed local features are further processed by a 1×1 convolution to obtain a first local feature. The first global feature and the first local feature are added together, and the result is activated by a Sigmoid activation function to obtain an attention feature map. Finally, the attention feature map, the frequency enhancement feature, and the second feature are fused to obtain a second fused feature.

[0028] Preferably, the reconstruction module maps the fused features to a three-dimensional RGB space to obtain a reconstructed image, which is then input into a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image, wherein:

[0029] The reconstruction module consists of five sequentially connected convolutional layers with a kernel size of 1×1, which map the fused features to a three-dimensional RGB space to obtain the reconstructed image.

[0030] Secondly, an apparatus for denoising images is disclosed, the apparatus comprising:

[0031] Feature acquisition module: configured to acquire the image to be denoised and input the image into the trained cascaded network module;

[0032] Denoising module: configured to perform denoising processing on the image by the cascaded network module and output the denoised image;

[0033] The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image.

[0034] The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blinded feature, uses the first blinded feature as input to the first deep feature extraction module, and outputs a first feature. The second random blind spot module generates a second blinded feature, uses the second blinded feature as input to the second deep feature extraction module, and outputs a second feature.

[0035] Thirdly, an electronic device is disclosed, the electronic device comprising:

[0036] At least one processor; and

[0037] A memory communicatively connected to the at least one processor; wherein,

[0038] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method as described above.

[0039] Fourthly, a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method described above is disclosed.

[0040] The present invention has the following technical effects:

[0041] This invention aims to address the problem that real noise distribution is more complex and spatially correlated than synthetic noise, overcoming the limitations of existing blind spot networks that rely on the assumption of independent and identically distributed (IID) noise. For situations in medical, military, and other fields where paired data of noise and clean images are difficult to obtain, this invention proposes a self-supervised denoising framework based on training with noisy data. Simultaneously, a random blind spot strategy and an incremental blind spot training strategy are designed to solve the overfitting problem caused by fixed blind spot positions and improve the model's generalization ability in noisy environments. Furthermore, this invention introduces a frequency-domain enhanced dilated convolution module, which reduces information loss caused by downsampling and the blind spot mechanism while enhancing detail recovery, achieving the goal of significantly improving denoising performance without increasing the number of parameters.

[0042] This invention achieves improved denoising performance without increasing the number of parameters by employing a self-supervised learning framework, a random blind spot strategy, and multi-scale feature fusion, among other modules working collaboratively. It maintains high-quality image reconstruction even in complex, real-world noisy environments and achieves efficient denoising without the need for paired data support, demonstrating strong adaptability and excellent generalization ability.

[0043] The technical solution of this invention breaks through the bottleneck of existing self-supervised denoising methods in real noise processing, and improves the robustness, generalization ability and detail recovery effect of the model through a series of innovative designs, making it suitable for image processing tasks in various complex noise scenarios. Attached Figure Description

[0044] Figure 1 This is a schematic flowchart of the image denoising method of the present invention;

[0045] Figure 2 This is a schematic diagram of the architecture of the cascaded network module of the present invention;

[0046] Figure 3 This is a schematic diagram of the downsampling operation with a step size of 2 in this invention;

[0047] Figure 4 A schematic diagram illustrating random blind spot processing of the 3×3 convolution kernel of this invention;

[0048] Figure 5 A schematic diagram illustrating random blind spot processing of the 5×5 convolution kernel of this invention;

[0049] Figure 6 This is a schematic diagram of the architecture of the frequency domain enhanced dilated convolutional submodule of the present invention;

[0050] Figure 7 This is a schematic diagram illustrating the principle of the multi-scale feature fusion submodule of the present invention;

[0051] Figure 8 This is a schematic diagram of the structure of the image denoising device of the present invention. Detailed Implementation

[0052] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0053] like Figure 1 As shown, the present invention provides an image denoising method, the method comprising:

[0054] Step S1: Obtain the image to be denoised and input the image into the trained cascaded network module;

[0055] Step S2: The cascaded network module performs denoising processing on the image and outputs the denoised image;

[0056] The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image.

[0057] The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blinded feature, uses the first blinded feature as input to the first deep feature extraction module, and outputs a first feature. The second random blind spot module generates a second blinded feature, uses the second blinded feature as input to the second deep feature extraction module, and outputs a second feature.

[0058] In this invention, firstly, a pixel recombination downsampling module breaks the spatial correlation of noise, ensuring it conforms to the independence assumption of the blind spot network. Then, a shallow feature extraction module maps the downsampled image to a high-dimensional feature map, extracting shallow features. Next, a random blind spot module achieves blinding by dynamically masking pixels, allowing the network to learn within a self-supervised learning framework and further breaking the correlation of noise. A deep feature extraction module further extracts global and complex feature information. Finally, a reconstruction module restores the high-dimensional feature map to an RGB image and uses an upsampling module to recover the original resolution, outputting the final denoising result.

[0059] like Figure 3As shown, the pixel reconstruction downsampling module employs a pixel reconstruction downsampling operation, namely the PD operation (which has been disclosed in APBSN). This operation reconstructs the pixel arrangement to break the spatial correlation of real noise, making it more consistent with the blind spot network's assumption of independence from noisy images, thus laying the foundation for subsequent blind spot network denoising.

[0060] The shallow feature extraction module is a convolutional neural network with a kernel size of 1×1, used to extract shallow features of the image, that is, to map each pixel of the image to be denoised to a high-dimensional feature space. For example, given a mixed degraded image X, the feature size of X is 3×H×W, where H is the height of the image and W is the width of the image. Inputting it into a 1×1 convolutional layer yields a high-dimensional shallow feature map of 128×H×W.

[0061] Further, the first random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features of the image using a first random blind spot convolution kernel to generate a first blinding feature; wherein, the first random blind spot convolution kernel is obtained by element-wise multiplication of the convolution kernel with a first noise mask matrix, the first noise mask matrix is ​​a randomly generated matrix composed of 0s and 1s with the same dimension as the convolution kernel, and the elements with 0s in the noise mask matrix represent blind spots; the size of the convolution kernel corresponding to the first random blind spot convolution kernel is 3×3; that is:

[0062] k1 binld =k1 conv ⊙k1 mask

[0063] k1 binld The first random blind spot convolution kernel, k1 conv The convolution kernel corresponding to the first random blind spot convolution kernel, k1 mask This is the first noise mask matrix.

[0064] The second random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features using a second random blind spot convolution kernel to generate a second blinding feature. The second random blind spot convolution kernel is obtained by element-wise multiplication of the convolution kernel with a second noise mask matrix. The second noise mask matrix is ​​randomly generated and has the same dimension as the convolution kernel, consisting of 0s and 1s. Elements with a value of 1 in the noise mask matrix represent blind spots. The size of the convolution kernel corresponding to the second random blind spot convolution kernel is 5×5.

[0065] k2 binld =k2 conv ⊙k2 mask

[0066] k2 binld The second random blind spot convolution kernel, k2conv The convolution kernel corresponding to the second random blind spot convolution kernel, k2 mask This is the second noise mask matrix.

[0067] Furthermore, during the training process, the number of 0 elements in the first noise mask matrix and the second noise mask matrix corresponding to the first random blind spot module and the second random blind spot module gradually increases.

[0068] In this invention, such as Figures 4-5 As shown, the key to the random blind spot module lies in dynamically masking some pixels during convolution, enabling the network to robustly handle noise in the input data within a self-supervised learning framework. The masked pixel locations are considered "blind spots," meaning their values ​​are ignored or randomly replaced during training. This prevents the model from relying on noisy information and allows it to learn to infer features based on context.

[0069] The mathematical expression and operational formula of the random blind spot module are shown below:

[0070] k binld =k conv ⊙k mask

[0071] Where, k binld k represents the convolution kernel of the random blind spot module. conv This represents a regular 3×3 or 5×5 convolution kernel, k mask This represents the noise mask matrix, indicating the blind spot locations. The symbol "⊙" indicates element-wise multiplication. The convolution kernel k... conv With mask k mask The product of these elements forms the random blind spot convolution kernel k. binld During training, the mask matrix k mask While ensuring that the center point is a blind point and the number of blind points is fixed, they are randomly generated to achieve random blinding operation of pixel position.

[0072] The function of the random blind spot module is as follows:

[0073] Breaking the correlation of noise: During each training session, the mask matrix changes randomly, ensuring that the model does not memorize specific spatial patterns. This random blind spot mechanism effectively breaks the spatial dependence of noise, enabling the network to better handle complex noise.

[0074] Preventing overfitting: By performing random blinding, the network cannot rely on input noise features, thereby improving the model's generalization ability and avoiding overfitting on the training data.

[0075] Self-supervised optimization: The network fills in the masked blind spots using contextual information, restoring the true values ​​of these pixels in the reconstruction module. Finally, the reconstruction loss is calculated against the original noisy image to optimize network performance.

[0076] Furthermore, the first deep feature extraction module and the second deep feature extraction module have the same structure, both including 8 frequency domain enhancement dilated convolution sub-modules connected in sequence. Each frequency domain enhancement dilated convolution sub-module includes a first sub-branch and a second sub-branch connected in parallel, as well as a multi-scale feature fusion sub-module that fuses the output of the first sub-branch and the output of the second sub-branch.

[0077] The first sub-branch performs discrete wavelet transform on the input to obtain a frequency feature map, then performs a 3×3 dilated convolution on the frequency feature map, and processes the high-frequency components in the first feature map obtained by the first ReLU layer to obtain a frequency domain feature map. The frequency domain feature map is then subjected to inverse discrete wavelet transform to obtain frequency enhancement features. The high-frequency components refer to features in the frequency domain whose frequencies exceed a preset threshold after the input is transformed from the spatial domain to the frequency domain by discrete wavelet transform.

[0078] The second sub-branch performs a 3×3 dilated convolution on the input, and the second feature map obtained by the dilated convolution is processed by the second ReLU layer. Then, a 1×1 convolution is performed on the second feature map to obtain a pointwise convolutional feature map. The pointwise convolutional feature map is processed by the third ReLU layer to obtain the second feature.

[0079] The frequency enhancement feature and the second feature are respectively input into the multi-scale feature fusion submodule;

[0080] The multi-scale feature fusion submodule adds the frequency enhancement feature and the second feature point by point to obtain a first fused feature. The first fused feature is then input into the parallel third and fourth sub-branches. The third sub-branch compresses the first fused feature to one dimension through a 1×1 convolution to obtain a global feature. The global feature is then input into a fourth ReLU layer, and the processed global feature is further processed by a 1×1 convolution to obtain a first global feature. The fourth sub-branch extracts local features from the first fused feature through a 1×1 convolution. The local features are then input into a fifth ReLU layer, and the processed local features are further processed by a 1×1 convolution to obtain a first local feature. The first global feature and the first local feature are added together, and the result is activated by a Sigmoid activation function to obtain an attention feature map. Finally, the attention feature map, the frequency enhancement feature, and the second feature are fused to obtain a second fused feature.

[0081] The second feature map is a large receptive field feature map.

[0082] Furthermore, the reconstruction module maps the fused features to a three-dimensional RGB space to obtain a reconstructed image. This reconstructed image is then input into a pixel reconstruction upsampling module, which outputs a denoised image, wherein:

[0083] The reconstruction module consists of five sequentially connected convolutional layers with a kernel size of 1×1, which map the fused features to a three-dimensional RGB space to obtain the reconstructed image.

[0084] The pixel reconstruction upsampling module restores the reconstructed image to the same resolution as the image to be denoised, thus obtaining the denoised image.

[0085] In this invention, such as Figures 6-7 As shown, a deep feature extraction module was designed to address the loss of detail information caused by PD operations and blind spot settings. The core design of this module combines frequency enhancement and dilated convolution techniques to highlight high-frequency details in the image. The module consists of eight frequency-domain enhanced dilated convolution modules, each including a frequency branch and a dilated convolution branch, which are then fused together using a multi-scale feature fusion module.

[0086] Specifically, the frequency branch first uses Discrete Wavelet Transform (DWT) to transform the image from the spatial domain to the frequency domain, then processes the high-frequency components through a 3×3 dilated convolution (Rule layer), and finally uses Inverse Discrete Wavelet Transform (IDWT) to transform the processed image back to the spatial domain. The dilated convolution branch uses 3×3 dilated convolution, ReLU activation function, and pointwise convolution operations.

[0087] (1) Frequency branch design: such as Figure 6 As shown in the upper branch, the input shallow features F are transformed to the frequency domain using Discrete Wavelet Transform (DWT) to obtain the frequency feature map F. freq The high-frequency components are then processed sequentially using 3×3 dilated convolutions and ReLU layers to highlight and enhance high-frequency details in the image. The frequency domain feature map F after high-frequency processing is then processed. h The inverse transformation is performed back to the spatial domain to obtain the final frequency enhancement feature F1. These operations are represented as:

[0088] F1 = IDWT(ReLU(Conv(DWT(F))))

[0089] (2) Dilated convolution branch design: such as Figure 6 As shown in the lower branch, the input shallow features F are processed through 3×3 dilated convolution, ReLU, 1×1 convolution, and ReLU layers to obtain the extracted features F2. These operations are represented as follows:

[0090] F2=ReLU(Conv3(ReLU(Conv1(F))))

[0091] (3) Multi-scale feature fusion module: such as Figure 7 As shown, frequency enhancement features F1 and F2 are first added point-by-point to obtain fused features. These fused features are then input into two branches. The first branch first compresses the fused features from high dimension to one dimension using a 1×1 convolution, obtaining a 1×H×W feature map to obtain global features. These are then further processed by ReLU and a 1×1 convolution to finally obtain the 1×H×W global features. In the second branch, the dimension of the fused features remains unchanged, and local features are extracted mainly through 1×1 convolution, ReLU, and another 1×1 convolution, outputting C×H×W local features. After adding the 1×H×W features to C×H×W, an attention feature map F is generated using the Sigmoid activation function. att (C×H×W). Finally, the attention feature map F att Combining the frequency enhancement features F1 and F2, we obtain the fused feature F. fuse These operations are represented as:

[0092]

[0093] In this invention, the high-dimensional feature map extracted from deep features is passed to the reconstruction module. This module maps the deep features back to the three-dimensional RGB space, generating a preliminarily denoised image. The reconstruction module consists of five 1×1 convolutions, which gradually map the high-dimensional feature map to the three-dimensional RGB image.

[0094] Finally, the reconstructed image is restored to its original resolution using a pixel resampling module, and the denoising result is output. This module ensures the restoration of image details while preserving the denoising effect of the network during multi-scale processing.

[0095] To address the challenges posed by the random blind spot module during training, this invention employs an incremental blind spot training strategy. As training progresses, the number of blind spots gradually increases, correspondingly reducing the amount of effective information the network can directly utilize in each convolutional operation. This makes it more difficult for the network to handle complex tasks. To avoid the network encountering excessively high learning difficulty in the initial stages, this strategy adopts a progressive approach, ensuring that the network can fully learn the basic features and global information of the image in the early stages.

[0096] In the early stages of training, the number of blind spots is small, allowing the network to access more pixel information and quickly learn basic patterns and local features in the data. This stage of training is relatively simple, helping the network establish a good foundation and ensuring relatively stable convergence in the early stages. As training progresses, the number of blind spots gradually increases, and the local information that the network can directly utilize becomes sparser. This forces the model to rely more on global contextual information for prediction, thereby improving its adaptability to complex scenes. This progressive training process effectively increases the difficulty of the task, enabling the network to gradually learn to handle more challenging scenarios and improving the model's generalization ability on unknown data.

[0097] Furthermore, the incremental blind spot strategy helps prevent the network from overfitting simple data in the early stages, avoiding excessive reliance on local features while ignoring global structural information. By gradually increasing the number of blind spots, the network benefits from a more hierarchical feature learning approach, enabling it to extract features at different scales. This multi-scale feature extraction capability makes the model more robust to image denoising tasks facing various noise patterns and complex environments. The specific training arrangements are shown in Table 1.

[0098] Table 1: Incremental Blind Spot Training Plan During Training

[0099]

[0100] This invention has the following characteristics:

[0101] 1. Self-supervised learning noise reduction method

[0102] This invention designs a self-supervised learning denoising framework that only requires training the model on a noisy dataset, eliminating the need for paired clean and noisy image data. This solves the problem of obtaining large-scale paired data in specialized fields such as medicine and the military. This method reduces data collection costs while improving the model's applicability in practical applications.

[0103] 2. Random blind spot module

[0104] To address the issue that real-world noise distributions are more complex and spatially correlated, failing to meet the assumption of independent and identically distributed (IID) noise in blind spot networks, this invention proposes a randomized blind spot module. By dynamically and randomly masking pixel locations, this strategy further reduces noise correlation and improves the model's robustness and denoising performance in complex, realistic noise environments.

[0105] 3. Incremental blind spot training strategy

[0106] To address the problem of fixed blind spot locations in existing blind spot networks, which easily leads to overfitting to specific noise types, this invention designs an incremental blind spot training strategy. By gradually increasing the number of blind spots during training, the model can progressively adapt to more complex noise patterns, improving the network's generalization ability. Furthermore, this strategy optimizes the training process, making the model more prone to convergence.

[0107] 4. Frequency Domain Enhanced Dilated Convolution Module

[0108] This invention designs a frequency-domain enhanced dilated convolution module to address the information loss caused by downsampling and blind spot mechanisms during denoising. This module enhances feature extraction capabilities in the frequency domain by expanding the receptive field of the convolution kernel, thereby improving the model's detail recovery performance without increasing the number of parameters, especially excelling in the reconstruction of high-frequency information.

[0109] 5. Multi-scale feature fusion module

[0110] To further enhance the model's denoising capabilities, this invention introduces a multi-scale feature fusion module. This module integrates image features at different scales to achieve efficient fusion of global information and local details, thereby improving the ability to handle complex image structures and multi-level noise during the denoising process. The fusion of multi-scale features ensures that the model has higher robustness and adaptability when dealing with noise of different sizes and shapes.

[0111] 6. Achieving high-efficiency noise reduction performance

[0112] This invention achieves improved denoising performance without increasing the number of parameters by employing a self-supervised learning framework, a random blind spot strategy, and multi-scale feature fusion, among other modules working collaboratively. The system maintains high-quality image reconstruction in complex, real-world noisy environments and achieves efficient denoising without the need for paired data support, demonstrating strong adaptability and excellent generalization ability.

[0113] The technical solution of this invention breaks through the bottleneck of existing self-supervised denoising methods in real noise processing, and improves the robustness, generalization ability and detail recovery effect of the model through a series of innovative designs, making it suitable for image processing tasks in various complex noise scenarios.

[0114] like Figure 8 As shown, the present invention provides an image denoising apparatus, the apparatus comprising:

[0115] Feature acquisition module: configured to acquire the image to be denoised and input the image into the trained cascaded network module;

[0116] Denoising module: configured to perform denoising processing on the image by the cascaded network module and output the denoised image;

[0117] The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image.

[0118] The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blinded feature, uses the first blinded feature as input to the first deep feature extraction module, and outputs a first feature. The second random blind spot module generates a second blinded feature, uses the second blinded feature as input to the second deep feature extraction module, and outputs a second feature.

[0119] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein, and such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for denoising images, characterized in that, The method includes the following steps: Step S1: Obtain the image to be denoised and input the image into the trained cascaded network module; Step S2: The cascaded network module performs denoising processing on the image and outputs the denoised image; The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image. The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blind feature, uses the first blind feature as input to the first deep feature extraction module, and outputs a first feature. The first random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features of the image using a first random blind spot convolution kernel to generate a first blinding feature; wherein the first random blind spot convolution kernel is obtained by multiplying the convolution kernel and the first noise mask matrix element by element. The second random blind spot module generates a second blind feature, uses the second blind feature as input to the second deep feature extraction module, and outputs the second feature. The second random blind spot module receives the shallow features of the image and performs a convolution operation on the shallow features of the image using a second random blind spot convolution kernel to generate a second blind spot feature; wherein the second random blind spot convolution kernel is obtained by multiplying the convolution kernel and the second noise mask matrix element by element. The first deep feature extraction module and the second deep feature extraction module have the same structure, both including 8 frequency domain enhanced dilated convolution sub-modules connected in sequence.

2. The method as described in claim 1, characterized in that, The first noise mask matrix is ​​a randomly generated matrix consisting of 0s and 1s with the same dimension as the convolution kernel. Elements with a value of 0 in the first noise mask matrix represent blind spots. The size of the convolution kernel corresponding to the first random blind spot convolution kernel is 3. 3; that is: The first random blind spot convolution kernel, The convolution kernel corresponding to the first random blind spot convolution kernel. This is the first noise mask matrix; The second noise mask matrix is ​​a randomly generated matrix consisting of 0s and 1s with the same dimension as the convolution kernel. Elements with a value of 1 in the second noise mask matrix represent blind spots. The size of the convolution kernel corresponding to the second random blind spot convolution kernel is 5. 5; that is: The second random blind spot convolution kernel, This is the convolution kernel corresponding to the second random blind spot convolution kernel. This is the second noise mask matrix.

3. The method as described in claim 2, characterized in that, During the training process, the number of 0 elements in the first noise mask matrix and the second noise mask matrix corresponding to the first random blind spot module and the second random blind spot module gradually increases.

4. The method as described in claim 2, characterized in that, Each frequency domain enhanced dilated convolutional submodule includes a first sub-branch and a second sub-branch connected in parallel, as well as a multi-scale feature fusion submodule that fuses the outputs of the first sub-branch and the second sub-branch. The first sub-branch performs a discrete wavelet transform on the input to obtain a frequency feature map, and then performs a 3D transformation on the frequency feature map.

3. Dilated convolution: The high-frequency components in the first feature map obtained by the first ReLU layer are processed by the dilated convolution to obtain a frequency domain feature map. The frequency domain feature map is then subjected to inverse discrete wavelet transform to obtain frequency-enhanced features. The high-frequency components refer to features whose frequencies exceed a preset threshold in the frequency domain after the input is transformed from the spatial domain to the frequency domain by discrete wavelet transform. The second sub-branch will process the input in 3 steps.

3. Dilated convolution: The second feature map obtained by the dilated convolution is processed by the second ReLU layer, and then the second feature map is further processed by 1... First, a convolution is performed to obtain a pointwise convolutional feature map. The third ReLU layer then processes the pointwise convolutional feature map to obtain the second feature. The frequency enhancement feature and the second feature are respectively input into the multi-scale feature fusion submodule; The multi-scale feature fusion submodule adds the frequency enhancement feature and the second feature point by point to obtain a first fused feature, and inputs the first fused feature into the parallel third sub-branch and fourth sub-branch respectively; the third sub-branch passes through 1 A convolutional layer compresses the first fused feature to one dimension, yielding a global feature. This global feature is then input into a fourth ReLU layer, and the processed global feature is further processed by a convolutional layer.

1. Convolutional processing yields the first global feature; the fourth sub-branch is processed through 1...

1. Convolutional processing extracts local features from the first fused feature, and the local features are input into the fifth ReLU layer. The processed local features are then subjected to further processing.

1. Convolutional processing to obtain the first local feature; 2. Add the first global feature and the first local feature, and then activate the result of the addition using the Sigmoid activation function to obtain the attention feature map; 3. Then fuse the attention feature map, the frequency enhancement feature and the second feature to obtain the second fused feature.

5. The method as described in claim 2, characterized in that, The reconstruction module maps the fused features to a three-dimensional RGB space to obtain a reconstructed image. This reconstructed image is then input into a pixel reconstruction upsampling module, which outputs a denoised image, wherein: The reconstruction module consists of five sequentially connected convolutional kernels, each with a size of 1. A convolutional layer of 1 maps the fused features to a three-dimensional RGB space to obtain the reconstructed image.

6. An apparatus for denoising an image, used to perform the method of any one of claims 1-5, characterized in that, The device includes: Feature acquisition module: configured to acquire the image to be denoised and input the image into the trained cascaded network module; Denoising module: configured to perform denoising processing on the image through the cascaded network module and output the denoised image; The cascaded network module includes a pixel reconstruction downsampling module, a shallow feature extraction module, a first branch and a second branch connected in sequence to the shallow feature extraction module. The output of the first branch and the output of the second branch are fused to obtain a fused feature. The fused feature is input to a reconstruction module. The reconstruction module maps the fused feature to a three-dimensional RGB space to obtain a reconstructed image. The reconstructed image is input to a pixel reconstruction upsampling module. The pixel reconstruction upsampling module outputs a denoised image. The first branch includes a first random blind spot module and a first deep feature extraction module connected in sequence, and the second branch includes a second random blind spot module and a second deep feature extraction module connected in sequence. The first random blind spot module generates a first blinded feature, uses the first blinded feature as the input of the first deep feature extraction module, and outputs a first feature. The second random blind spot module generates a second blinded feature, uses the second blinded feature as the input of the second deep feature extraction module, and outputs a second feature.

7. A computer-readable storage medium storing a plurality of instructions; the plurality of instructions being loaded by a processor and executing the method as claimed in any one of claims 1-5.

8. An electronic device, characterized in that, The electronic device includes: A processor is used to execute multiple instructions; Memory, used to store multiple instructions; The plurality of instructions are to be stored in the memory and loaded by the processor and executed as described in any one of claims 1-5.