Implementation method of raw domain knowledge distillation lightweight denoising network
By using knowledge distillation technology, the denoising capabilities of large models are transferred to small models, solving the problems of high network complexity and large memory requirements in existing technologies. This achieves efficient denoising under extremely low light conditions and is suitable for image processing in smart devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI JUNZHENG TECH CO LTD
- Filing Date
- 2024-12-20
- Publication Date
- 2026-06-23
AI Technical Summary
Existing neural network denoising techniques suffer from high network complexity, large memory requirements, and poor practicality when processing images under extremely low light conditions. They are difficult to achieve both denoising effectiveness and lightweight and speed requirements.
The knowledge distillation method is adopted, which uses the pre-trained large model as the teacher model and transfers its knowledge to the small model (student model) through feature distillation. The appropriate convolutional layer output features are selected as the guide layer by using feature distillation, the loss is calculated and the student model is trained iteratively, which reduces the network complexity while maintaining the denoising effect.
It achieves the goal of maintaining denoising effect while reducing network complexity and computational load, effectively removing noise from various cameras in extremely dark scenes, and is suitable for practical applications.
Smart Images

Figure CN122265070A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing technology, and specifically relates to a method for implementing a lightweight denoising network based on raw domain knowledge distillation. Background Technology
[0002] With the widespread use of smartphones, smart cameras, and autonomous driving, people have increasingly stringent requirements for the imaging capabilities of these devices. Especially at night or in environments with insufficient light, the images captured by the cameras suffer from significant noise. When the camera's light-sensing device has poor light sensitivity or insufficient light, the noise intensity in the image is greater, severely affecting people's visual experience.
[0003] However, traditional techniques such as bilateral filtering and Gaussian filtering can only handle relatively weak noise. For high-noise images, especially those from night vision, denoising can result in significant loss of detail or persistent noise. With advancements in denoising technology and a more comprehensive understanding of the sources of camera noise, the advantages of neural network denoising techniques have become apparent. The English paper "Physics-based Noise Modeling for Extreme Low-light Photography" (by Kaixuan Wei) addresses the difficulty of obtaining raw camera images by analyzing noise patterns during ISP imaging and using statistical methods to model noise at different ISO levels. Deep learning is then used to achieve good denoising results. To further simulate noise more accurately, the English paper "Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising" (by Xin Jin) proposes that combining a small amount of real paired data can effectively improve denoising performance.
[0004] Existing neural networks all use the UET network structure for denoising, which mainly increases the complexity of the network in the following two ways: (1) by increasing the number of channels or the depth of the network, and (2) by using an effective attention mechanism module. In order to increase the number of channels or the depth of the network, the model proposed in the paper Physics-based Noise Modeling for Extreme Low-light Photography uses multiple autoencoder structures to extract features to extract abstract information of image diversity. The paper Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAWDenoising uses a sampling reparameterization method based on UET, and uses a feature alignment module corresponding to a specific camera to adjust the input distribution of different cameras to adapt to different cameras. Because channel attention and spatial attention mechanisms can capture global information and discard redundant information, they are widely used in low-level tasks such as image denoising and dehazing. The paper "Efficient Denoising of Extremely Low-Light Original Images Based on Reparameterized Multi-Scale Fusion Network" (authors Wei Kaixuan and Fu Ying) uses a reparameterization method and adds an attention mechanism module to the network to improve the denoising effect of the network in order to reduce the complexity of the network, reduce computational memory and improve speed. To improve the practicality of the model without compromising the quality of denoising, knowledge distillation technology has been widely used. In the paper "A Lightweight Image Denoising Model Based on Feature Knowledge Distillation" (authors: Shen Yu, Zhang Dingni), a sufficiently large network is first constructed and trained to obtain a teacher model with good denoising effect. Then, a relatively small model called the student model is constructed. By guiding the student model to learn the corresponding layers from the corresponding layers of the teacher model, the denoising effect of the student model can be close to that of the teacher model.
[0005] However, most existing neural network denoising techniques aim to improve denoising quality, neglecting the importance of memory and speed in practical applications. Considering issues such as cost and deployment, chip manufacturers want chips to have as little memory as possible, which has led to an increasing emphasis on lightweight networks.
[0006] Furthermore, the terminology used in this art includes:
[0007] Knowledge distillation: a model compression technique in deep learning that transfers rich knowledge from a large deep neural network to a relatively small network, thereby reducing computational cost without affecting performance.
[0008] ISO: The camera's sensitivity to light.
[0009] Reparameterization: A technique in deep learning that expresses one random variable as a function of another random variable. Summary of the Invention
[0010] The purpose of this application is to address the following issues: To overcome the problems of large size and poor practicality of existing complex denoising networks, a knowledge distillation method is proposed. First, a large model with good denoising performance is constructed as the teacher model and trained. Then, a suitable small model is constructed as the student model. The features of the student model are distilled using the intermediate layer features of the teacher model to achieve a lightweight model with denoising performance close to that of the teacher model.
[0011] Specifically, the present invention provides a method for implementing a lightweight denoising network using raw domain knowledge distillation. The method first selects appropriate convolutional layer output features from the trained and frozen teacher network layer as a guide layer through feature distillation. Then, it selects the convolutional output of the corresponding layer from the constructed student model. Finally, it calculates the total loss of the corresponding layer and the loss between the student model's own output and the clean image, and applies it to the student model, causing the student model to iterate continuously until a stable result is achieved based on this loss.
[0012] The method specifically includes the following steps:
[0013] S1: First, a suitable large model needs to be built as the teacher model. The network consists of two encoders and two decoders. Each encoder contains a convolution, an activation function ReLU, and an attention mechanism module. The convolution kernel of the convolution operator is 3x3. The encoder of this module is described by the formula, which is described as Equation (1).
[0014] f encoder =(att_block(conv(feature) BXCXHXW ))) (1)
[0015] In the formula, f encoder This represents the encoder output, att_block is the attention mechanism module, conv is the convolutional layer, and feature... BXCXHXW The previous module outputs a feature map with dimensions BXCXHXW.
[0016] S2: For the attention mechanism in step S1, the spatial attention and channel attention mechanisms are combined to extract the effective information of the corresponding layer. The corresponding layer refers to the output feature map of the convolutional structure before the attention module.
[0017] S3: After building the network structure of the teacher model through steps S1 and S2, the next step is to train the teacher model.
[0018] S4: Following step S3, the trained teacher model is obtained. All layer parameters of the teacher network are frozen. Next, the student network structure is built.
[0019] S5: Construct the student model; use feature distillation to select appropriate feature maps for corresponding layers in the teacher and student models; select the minimum scale and the output of the last attention module in the teacher model. As guiding features, features at the corresponding positions of the student model are extracted accordingly. The difference between the two feature maps is calculated by the following formula (6). If the loss of formula (6) is smaller, it indicates that the student model can learn the knowledge of the teacher model better, which is conducive to improving the generalization of the student model.
[0020]
[0021] In the formula, w1 and w2 are the weight coefficients of the corresponding layers, which are used to select the importance of feature guidance;
[0022] S6: After constructing the student model, train it. The parameters involved in training the student model are the same as those used in training the teacher model. If the input noise image is input and the student model is model_student, then the output noise of the model is represented as noise_out, as shown in the following formula:
[0023] noise_out=model_stunet(input) (7)
[0024] pred_denoise_img=input-*noise_out (8)
[0025] loss total =|pre_denoise_img-target_img|*(1-w1-w2)+loss1 (9) In the formula, `pre_denoise_img` represents the final predicted clean image, and `α` controls the denoising intensity. A larger value results in cleaner denoising but may damage image details, leading to blurring. A smaller value of `α` results in more residual noise in the denoised image. sstotal This is the total loss function for training the student model.
[0026] Step S1 further includes:
[0027] A new network structure based on the basic network structure of UET is designed as the teacher model. The overall framework of the distillation model is as follows:
[0028] The noisy image input is used as input to both the teacher model and the student model. The teacher model feeds the noisy image to the student model through feature distillation, and the student model finally outputs the predicted denoised image pre_denoise_img.
[0029] Step S2 further includes:
[0030] The specific implementation process is described in equations (2), (3), and (4), f spatial f channel These are the output feature maps of the spatial attention module and the channel attention module, respectively.
[0031]
[0032] To characterize the strength of the extracted effective feature information, the two attention feature information obtained above are jointly represented, f out The final output features after joint attention; as the model iterates, the features obtained after spatial attention are once again assigned corresponding weights according to the importance of the channels by the sigmoid(.) function to weaken the information on unnecessary channels;
[0033] f out =sigmoid(f spatial )*f spatial (5).
[0034] Step S3 further includes:
[0035] The optimizer selected in this method is the Adam optimizer, with an initial learning rate of 3e-4 and an epoch number of 110. After every 20 epochs, the learning rate is reduced by half, and training is stopped when the learning rate drops to 1e-6.
[0036] Step S4 further includes:
[0037] For the student network, all attention mechanisms in the teacher model are removed, and the remaining network structure is used as the student model network architecture.
[0038] The method is applied to the field of low-light denoising. In order to reduce the computational load of the denoising network, the knowledge distillation method is used to compress the model, and denoising is carried out at the same time as obtaining a small model. It can use the pre-trained neural network to remove noise from various cameras in extremely dark scenes.
[0039] Therefore, this application has the following advantages:
[0040] By employing the feature distillation compression strategy described above, the feature extraction capabilities of the large model can be transferred to the student model, ensuring that the denoising performance of the small model is not affected while reducing network complexity. This method effectively removes noise from various cameras in extremely dark scenes using a pre-trained neural network. Attached Figure Description
[0041] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0042] Figure 1 This is a schematic diagram of the overall framework of the distillation model.
[0043] Figure 2 This is a schematic diagram of the training process for the lightweight distillation model.
[0044] Figure 3 This is a schematic diagram of the specific student model structure in this method. Detailed Implementation
[0045] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.
[0046] This application applies to the field of low-light denoising. To reduce the computational cost of the denoising network, a knowledge distillation method is used to compress the model, resulting in a smaller model while maintaining good denoising performance. The method mentioned in this application can effectively remove noise from various cameras in extremely dark scenes using a pre-trained neural network.
[0047] This application proposes a lightweight denoising network implementation method based on raw domain knowledge distillation. Extensive experimental verification demonstrates that this method achieves lightweight denoising networks with superior performance compared to non-distilled small networks. The method employs feature distillation. First, it selects appropriate convolutional layer output features from the trained and frozen teacher network layers as the guiding layer. Then, it selects the convolutional output of the corresponding layer from the constructed student model. Finally, it calculates the total loss of the corresponding layer and the loss between the student model's output and the clean image, applying this total loss to the student model. This causes the student model to iterate continuously until a stable result is achieved.
[0048] The specific implementation steps of this method are as follows:
[0049] Step S1: First, a suitable large-scale model needs to be constructed as the teacher model. This invention designs a new network structure as the teacher model based on the basic network structure of UET, such as... Figure 1The diagram illustrates the overall framework of the distillation model: the noisy image `input` is fed into both the teacher and student models. The teacher model distills the features and feeds them to the student model, which ultimately outputs the predicted denoised image `pre_denoise_img`. Since the teacher model's parameters are pre-trained, they do not participate in the training process, while the student model requires training. Therefore, by inputting the same image, the teacher model can guide the student model to learn the teacher model's feature extraction capabilities. The network consists of two encoders and two decoders. Each encoder contains a convolutional layer and an activation function `rel`. u And an attention mechanism module, where the convolution kernel of the convolution operator is 3x3. The encoder of this module can be simply described by the formula below (1).
[0050] f encoder =(att_block(conv(feature) BXCXHXW ))) (1)
[0051] In the formula, f encoder This represents the encoder output, att_block is the attention mechanism module, conv is the convolutional layer, and feature... BXCXHXW The output of the previous module is a feature map with dimensions BXCXHXW. Step S2: Regarding the attention mechanism in step S1, this method combines spatial attention and channel attention mechanisms to extract effective information from the corresponding layer. The corresponding layer refers to the output feature map of the convolutional structure preceding the attention module. The specific implementation process is described in equations (2), (3), and (4). spatial f channel These are the output feature maps of the spatial attention module and the channel attention module, respectively.
[0052]
[0053]
[0054] To characterize the strength of the extracted effective feature information, the two attention feature information obtained above are jointly represented, f out The final output features after joint attention; as the model iterates, the features obtained after spatial attention are once again assigned corresponding weights according to the importance of the channels by the sigmoid(.) function to weaken the information on unnecessary channels;
[0055] f out =sigmoid(f spatial )*f spatial (5)
[0056] Step S3: After constructing the network structure of the teacher model through steps S1 and S2, the next step is to train the teacher model. The optimizer selected in this method is the Adam optimizer, with an initial learning rate of 3e-4 and an iteration count of 110 epochs. Every 20 epochs, the learning rate is reduced by half, and training stops when the learning rate drops to 1e-6.
[0057] Step S4: Following the previous step, the trained teacher model is obtained. All layer parameters of the teacher network are frozen. Next, the student network structure is built. For the student network, this application removes all attention mechanisms from the teacher model, using the remaining network structure as the student model's architecture. The specific student model structure is as follows: Figure 3 As shown, the decoding module and the encoding module, and the encoding and decoding module both contain a convolution and activation function ReLU.
[0058] Step S5: Train the student model; use feature distillation to select appropriate feature maps of corresponding layers in the teacher and student models. This application selects the minimum scale and the output of the last attention module in the teacher model. As guiding features, features at the corresponding positions of the student model are extracted accordingly. The difference between the two feature maps is calculated by the following formula (6). If the loss of formula (6) is smaller, it indicates that the student model can learn the knowledge of the teacher model better, which is conducive to improving the generalization of the student model.
[0059]
[0060] In the formula, w1 and w2 are the weight coefficients of the corresponding layers, which are used to select the importance of feature guidance.
[0061] Step S6: After constructing the student model, train the student model.
[0062] The parameters involved in training the student model are the same as those used in training the teacher model. If the input noisy image is `input` and the student model is `model_student`, then the model's output noise is represented as `noise_out`, as shown in the following equation:
[0063] noise_out=model_stunet(input) (7)
[0064] pred_denoise_img=input-α*noise_out (8)
[0065] loss total=|pre_denoise_img-target_img|*(1-w1-w2)+loss1 (9)
[0066] In the formula, `pre_denoise_img` represents the final predicted clean image, `α` controls the denoising intensity; the larger the value, the cleaner the image denoising, but it will damage image details, making the image blurry; the smaller the value of `α`, the greater the residual noise on the denoised image; loss total This is the total loss function for training the student model.
[0067] Specifically, the complete training process of the implementation example is as follows: Figure 2 The complete training process mentioned above refers to the overall training process in this method, including:
[0068] Sa, paired noise-clean raw image; Sb and Sd are performed separately;
[0069] Sb, determine if it is teacher model training? If yes, proceed to step Sc; otherwise, end.
[0070] Sc, training the teacher model;
[0071] Sd, frozen teacher model;
[0072] Se extracts feature maps of the corresponding layers from the teacher model and the student model, respectively.
[0073] Sf, trains the student model, and the loss function is the difference between the student model's output and the clean map, as well as the loss between the feature maps;
[0074] Sg, determine if the number of iterations is less than epoch? epoch needs to be specified manually; if not, return to step S6; if yes, end.
[0075] In summary, the key technical solution of this application lies in:
[0076] 1. In the raw image domain, feature distillation is used to transfer the denoising effect of a large model to a small model;
[0077] 2. Construct a teacher model by combining spatial and channel attention modules.
[0078] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for implementing a lightweight denoising network for raw domain knowledge distillation, characterized in that, The method employs feature distillation. First, it selects appropriate convolutional layer output features from the trained and frozen teacher network layer as a guide layer. Then, it selects the convolutional output of the corresponding layer from the constructed student model. Finally, it calculates the total loss of the corresponding layer and the loss between the student model's own output and the clean image, and applies it to the student model, causing the student model to iterate continuously until a stable result is achieved based on this loss.
2. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 1, characterized in that, The method specifically includes the following steps: S1: First, a suitable large model needs to be built as the teacher model. The network consists of two encoders and two decoders. Each encoder contains a convolution, an activation function ReLU, and an attention mechanism module. The convolution kernel of the convolution operator is 3x3. The encoder of this module is described by the formula, which is described as Equation (1). f encoder = (att_block(conv(feature BXCXHXW ))) (1) In the formula, f encoder represents the output of the encoder, att_block is the attention mechanism module, conv is the convolution layer, feature BXCXHXW is the feature map output by the previous module, with dimensions of BXCXHXW; S2: For the attention mechanism in step S1, the spatial attention and channel attention mechanisms are combined to extract the effective information of the corresponding layer. The corresponding layer refers to the output feature map of the convolutional structure before the attention module. S3: After building the network structure of the teacher model through steps S1 and S2, the next step is to train the teacher model. S4: Following step S3, the trained teacher model is obtained. All layer parameters of the teacher network are frozen. Next, the student network structure is built. S5: Construct the student model; use feature distillation to select appropriate feature maps for corresponding layers in the teacher and student models; select the minimum scale and the output of the last attention module in the teacher model. As guiding features, features at the corresponding positions of the student model are extracted accordingly. The difference between the two feature maps is calculated by the following formula (6). If the loss of formula (6) is smaller, it indicates that the student model can learn the knowledge of the teacher model better, which is conducive to improving the generalization of the student model. In the formula, w1 and w2 are the weight coefficients of the corresponding layers, which are used to select the importance of feature guidance; S6: After constructing the student model, train it. The parameters involved in training the student model are the same as those used in training the teacher model. If the input noise image is input and the student model is model_student, then the output noise of the model is represented as noise_out, as shown in the following formula: noise_out=model_stunet(input) (7) pred_denoise_img=input-α*noise_out (8) loss total =|pre_denoise_img-target_img|*(1-w1-w2)+loss1 (9) In the formula, `pre_denoise_img` represents the final predicted clean image, `α` controls the denoising intensity; the larger the value, the cleaner the image denoising, but it will damage image details, making the image blurry; the smaller the value of `α`, the greater the residual noise on the denoised image; loss total This is the total loss function for training the student model.
3. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 2, characterized in that, Step S1 further includes: A new network structure based on the basic network structure of UET is designed as the teacher model. The overall framework of the distillation model is as follows: The noisy image input is used as input to both the teacher model and the student model. The teacher model feeds the noisy image to the student model through feature distillation, and the student model finally outputs the predicted denoised image pre_denoise_img.
4. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 2, characterized in that, Step S2 further includes: The specific implementation process is described in equations (2), (3), and (4), f spatial f channel These are the output feature maps of the spatial attention module and the channel attention module, respectively. f spatial =sigmoid(conv2(conv1(featureBXCXHXW)))*feature BXCXHXW (2) f channel =conv2(conv1(maxpool(feature BXCXHXW )+avgpool(feature BXCXHXW ))) (3) f channel =sigmoid(f channel )*f channel (4) To characterize the strength of the extracted effective feature information, the two attention feature information obtained above are jointly represented, f out The final output features after joint attention; as the model iterates, the features obtained after spatial attention are once again assigned corresponding weights according to the importance of the channels by the sigmoid(.) function to weaken the information on unnecessary channels; f out =sigmoid(f spatial )*f spatial (5)。 5. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 4, characterized in that, Step S3 further includes: The optimizer selected in this method is the Adam optimizer, with an initial learning rate of 3e-4 and an epoch number of 110. After every 20 epochs, the learning rate is reduced by half, and training is stopped when the learning rate drops to 1e-6.
6. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 5, characterized in that, Step S4 further includes: For the student network, all attention mechanisms in the teacher model are removed, and the remaining network structure is used as the student model network architecture.
7. The implementation method of a lightweight denoising network for raw domain knowledge distillation according to claim 1, characterized in that, The method is applied to the field of low-light denoising. In order to reduce the computational load of the denoising network, the knowledge distillation method is used to compress the model, and denoising is carried out at the same time as obtaining a small model. It can use the pre-trained neural network to remove noise from various cameras in extremely dark scenes.