An adjustable video noise reduction method based on deep learning

By introducing a non-end-to-end spatial domain denoising model and a method of converting motion masks into denoising intensity maps during video denoising, the problem of the inability to adjust denoising intensity in existing technologies is solved, enabling flexible control of video clarity and moving target recognition, and improving video quality.

CN122243783APending Publication Date: 2026-06-19HEFEI JUNZHENG TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HEFEI JUNZHENG TECH CO LTD
Filing Date
2024-12-17
Publication Date
2026-06-19

Smart Images

  • Figure CN122243783A_ABST
    Figure CN122243783A_ABST
Patent Text Reader

Abstract

This invention provides an adjustable video denoising method based on deep learning, comprising: S1. Extracting the current frame image from the video stream to be processed, and obtaining the previous fused frame image and the previous motion mask from a preset buffer; S2. Inputting the current frame image, the previous fused frame image, and the previous motion mask into a temporal denoising model, and outputting the current frame fused image and the current motion mask; S3. Inputting the current frame fused image into a spatial denoising model, and outputting a corresponding noise map; S4. Converting the current motion mask into a denoising intensity map according to a preset mapping rule; S5. Calculating the spatial denoising result of the current fused frame based on the current frame fused image, the noise map output by the spatial denoising model, and the denoising intensity map converted by the current motion mask. This method solves the problem that when using end-to-end denoising networks for video denoising, the denoising intensity cannot be adjusted, making it difficult to adapt to users' different preferences for video clarity and the recognizability of moving targets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of video image noise reduction technology, and specifically relates to an adjustable video noise reduction method based on deep learning. Background Technology

[0002] In low-light environments, the image sensor receives fewer photons, resulting in a low signal-to-noise ratio. The effective information in the video is often overwhelmed by noise. To improve video quality, noise reduction processing is required. Temporal noise reduction can only remove noise with zero or near-zero mean characteristics in time series, while spatial noise reduction can remove the noise remaining after temporal noise reduction. Usually, temporal noise reduction and spatial noise reduction are used in combination to achieve the best noise reduction effect.

[0003] With the rapid development of deep learning technology, existing image denoising methods can be divided into non-deep learning methods and deep learning-based methods.

[0004] Non-deep learning-based temporal denoising methods smooth noise by stacking consecutive frames in a video and averaging or medianing the values. They also mitigate motion blur caused by misalignment in motion regions through motion estimation and compensation. Deep learning-based temporal denoising methods, driven by data, input multiple consecutive frames from a video into a network, directly outputting the denoised image. The denoising model is trained through backpropagation, resulting in better denoising performance compared to non-deep learning-based methods. End-to-end spatial denoising models based on CNN and Transformer architectures also generally outperform traditional spatial denoising algorithms such as bilateral filtering and BM3D.

[0005] However, the shortcomings of existing technology are:

[0006] Although non-deep learning-based denoising methods are usually designed in multiple stages and have a high degree of denoising intensity adjustability, their denoising effect is generally worse than that of deep learning-based denoising methods. Deep learning-based end-to-end denoising methods directly output the denoised image and cannot adjust the denoising intensity, thus failing to meet users' different preferences for video clarity and the recognizability of moving targets.

[0007] Furthermore, the terminology used in this art includes:

[0008] Temporal denoising: In video or dynamic image sequences, the average or median of multiple consecutive frames over time is calculated to reduce random noise. Advanced temporal denoising methods often incorporate motion estimation and motion compensation to reduce ghosting and blurring issues.

[0009] Spatial domain denoising: This involves processing an image within its spatial domain to reduce or eliminate noise. Here, "spatial domain" refers to the two-dimensional space of the image, that is, each pixel and its surrounding pixels. Spatial domain denoising techniques primarily focus on local features of the image, analyzing and processing each pixel and its neighborhood to achieve noise reduction.

[0010] CNN: Convolutional Neural Networks (CNNs) are a type of deep learning model that excels in tasks such as image and video recognition, classification, and segmentation. By mimicking the workings of the human visual system, CNNs can automatically learn useful features from images.

[0011] Transformer: The Transformer model is a deep learning model that uses an attention mechanism. This mechanism can assign different weights according to the different importance of each part of the input data. Thanks to its powerful ability to capture long-distance dependencies, it is widely used in many fields such as computer vision and natural language processing.

[0012] BM3D: BM3D (Block-Matching and 3D filtering) is a block-based image denoising algorithm proposed by Dabov et al. in 2007. Summary of the Invention

[0013] To address the aforementioned issues, the purpose of this application is to provide a deep learning-based adjustable video denoising method. This method addresses the problem that when using end-to-end denoising networks for video denoising, the denoising intensity cannot be adjusted, making it difficult to adapt to users' different preferences regarding video clarity and the recognizability of moving targets.

[0014] Specifically, the present invention provides an adjustable video noise reduction method based on deep learning, the method comprising the following steps:

[0015] S1. Extract the current frame image from the video stream to be processed, and obtain the previous fused frame image and the previous motion mask from the preset buffer;

[0016] The video stream to be processed refers to the RAW images continuously output by the image sensor over a continuous period of time. The current frame image is the RAW image output at the current moment. The preset buffer is a memory space reserved for storing the preceding fused frame image and the preceding motion mask. The preceding fused frame and the preceding motion mask refer to the output of the temporal denoising model of the previous moment at the current moment. The preceding motion mask reflects the motion confidence of all pixels in the preceding fused frame image.

[0017] S2. Input the current frame image, the previous fused frame image, and the previous motion mask into the temporal denoising model, and output the current frame fused image and the current motion mask; the current motion mask contains the motion confidence of all pixels in the current fused frame image;

[0018] S3. Input the fused image of the current frame into the spatial domain noise reduction model and output the corresponding noise map;

[0019] The spatial domain denoising model is a non-end-to-end model, which differs from end-to-end networks that directly output denoised images from input fused frame images. The provided spatial domain denoising model outputs a noise map of the input fused frame image, allowing for free adjustment of the denoising intensity. The noise map refers to the noise information in the fused frame image calculated by the spatial domain denoising model.

[0020] S4. Convert the current motion mask into a noise reduction intensity map according to a preset mapping rule; In order to solve the problem of inconsistent noise levels in different regions of the video, which leads to inconsistencies in the image after noise reduction, a spatial domain noise reduction method with partitioned noise reduction intensity adjustment is adopted; Considering that the current motion mask output by the temporal domain noise reduction model contains the motion confidence of all pixels in the current fused frame image, and the current fused frame has the characteristics of high noise intensity in the motion area and low noise intensity in the static area, the motion mask can be converted into a noise reduction intensity map corresponding to the current fused frame, and the noise reduction intensity of each pixel in the current fused frame is determined by its motion state;

[0021] S5. Calculate the spatial denoising result of the current fused frame based on the current frame fused image, the noise map output by the spatial denoising model, and the denoising intensity map of the current motion mask transformation.

[0022] In step S2, the type of the temporal denoising model is not limited, including but not limited to different forms of network models such as convolutional neural networks and recurrent neural networks, which have implicit motion detection and can output a motion mask while outputting the fused image; the value range of the motion mask is [0,1], where 0 indicates that the pixel is in a completely static state, 1 indicates that the pixel is in a moving state, and the value between 0 and 1 indicates that the pixel is in a moving state at different degrees.

[0023] Step S3 further includes:

[0024] S3.1 Constructing a non-end-to-end spatial domain denoising network; the non-end-to-end spatial domain denoising model is a convolutional neural network with an encoder-decoder architecture, mainly composed of an input layer, encoder, decoder, cross-layer connections, and output layer; wherein, the encoder consists of consecutive convolutional layers, activation layers, and downsampling layers, with the downsampling layers being convolutional layers with a stride of 2; the decoder consists of consecutive convolutional layers, activation layers, and upsampling layers, with the upsampling layers being transposed convolutions with a stride of 2; the input layer consists of convolutional layers and activation layers, and the output layer is a single convolutional layer; the current fused frame is input into this model, the encoder progressively extracts feature information at different scales, and the feature information extracted by the encoder is passed to the decoder through cross-layer connections, and then the decoder progressively fits the noise distribution in the current fused frame, distinguishes between image details and noise, and finally outputs a noise map containing noise information;

[0025] S3.2 Noise Reduction Method of Non-End-to-End Spatial Domain Noise Reduction Network; Assuming the input fused frame image is I fused The noise map output by the spatial domain noise reduction model is The noise reduction intensity is coeff, and the noise-reduced image I denoised The calculation method is shown in formula (1).

[0026]

[0027] Assuming the noise reduction intensity coeff is set to a value between 0 and 1, a value of 0 indicates that no noise reduction is performed on the fused frame image, and a value of 1 indicates that the noise reduction is performed on the fused frame image at the maximum intensity. By adjusting the noise reduction intensity coeff, the noise and sharpness of the fused frame image can be controlled.

[0028] In step S4, since the spatial position of the moving target in the video stream is constantly changing, the current frame image can reflect the true position of the moving target at the current moment. When fusing the current frame image and the previous fused frame image, in order to reduce the motion blur of the moving target, the motion region of the current fused frame mainly refers to the current frame image. At the same time, in order to reduce the noise intensity and jitter of the motion region, it is necessary to refer to a part of the previous fused frame image. In the process of cyclic denoising, motion blur is inevitably introduced. Therefore, the current fused frame image output by the temporal denoising model can be divided into three regions: motion region, blur region, and static region. The noise intensity is arranged in ascending order as static region, blur region, and motion region. Since the noise level in the image is uneven, denoising with a uniform spatial domain denoising intensity will result in an inconsistent image style.

[0029] In step S4, the noise reduction intensity of each pixel in the current fusion frame is determined by its motion state. Assuming that the motion confidence of the pixel at coordinates (x,y) in the current fusion frame is 1 in the corresponding motion mask, it can be considered that the noise intensity of the pixel is large and a large noise reduction coefficient needs to be assigned. The large noise reduction coefficient is 0.8 to 1.0.

[0030] Step S4 further includes:

[0031] S4.1, Align the current motion mask with the image resolution of the current fused frame;

[0032] The image resolution of the motion mask output by the temporal denoising model is not limited, including the same resolution as the current fused frame image or a smaller resolution. Since the motion mask needs to be stored in a preset cache, storing it at a smaller resolution can save memory space.

[0033] When the image resolution of the current motion mask is lower than that of the current fusion frame, the current motion mask should be interpolated to the same image resolution as the current fusion frame. There are no restrictions on the interpolation method; it can be selected according to the requirements of motion mask accuracy and computational efficiency.

[0034] S4.2, Design the mapping rules for converting the motion mask into a noise reduction intensity map, and calculate the noise reduction intensity map of the current fused frame based on the current motion mask;

[0035] Theoretically, a motion mask can be converted into a noise reduction intensity map through a simple linear transformation. However, considering that the relationship between motion confidence and noise intensity in the motion mask may be non-linear, motion state judgment is introduced in the conversion process: based on the motion mask, it is determined whether the pixel value is less than the motion threshold; if so, the noise reduction intensity is calculated according to the mapping rules of the non-motion area; if not, the noise reduction intensity is calculated according to the mapping rules of the motion area; finally, the noise reduction intensity map is obtained.

[0036] In step S4.1, the interpolation method includes nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

[0037] In step S4.2, the motion state judgment introduced during the conversion process further includes:

[0038] When the motion confidence α of any pixel in the motion mask is less than the preset motion threshold THR, the pixel is determined to be in a non-motion state and converted according to the mapping rules of the non-motion region; when the motion confidence α of any pixel in the motion mask is greater than or equal to the preset motion threshold THR, the pixel is determined to be in a motion state and converted according to the mapping rules of the motion region. The mapping rules for converting the motion confidence of the non-motion region and the motion region into the noise reduction intensity are shown in formula (2):

[0039]

[0040] In formula (2), S0 represents the noise reduction intensity in the static region, K1 controls the noise reduction intensity in the non-motion region (such as the motion blur region), and K2 controls the noise reduction intensity in the motion region. For all pixels in the current fused frame image, they are converted into noise reduction intensity according to the mapping rule to obtain a noise reduction intensity map S with the same image resolution as the current fused frame;

[0041] The four parameters—static area noise reduction intensity S0, non-motion area noise reduction intensity K1, motion area noise reduction intensity K2, and motion threshold THR—are all adjustable and can be freely adjusted according to user preferences in different scenarios to obtain the best video quality.

[0042] Step S5 further includes:

[0043] According to the current frame fusion image I fused The noise map output by the spatial domain noise reduction model Based on the noise reduction intensity map S of the current motion mask transformation, calculate the spatial noise reduction result of the current fused frame, and calculate it as shown in formula (3).

[0044]

[0045] In formula (3), the symbol ⊙ represents the multiplication of corresponding elements. The result of multiplying the corresponding elements of the noise intensity map and the noise map by subtracting the current fused frame is output as the result of the current frame image after temporal and spatial denoising.

[0046] Therefore, this application has the following advantages:

[0047] 1. The provided non-end-to-end spatial noise reduction model directly outputs the noise map, allowing for free adjustment of the noise reduction intensity;

[0048] 2. A method for regional adjustment of noise reduction intensity is provided, which converts the motion mask output by the temporal denoising module into a noise reduction intensity map, thereby controlling the noise reduction intensity of static areas, motion areas and motion areas in the video. It has high adjustability and can meet the different needs of users for video clarity and the recognizability of moving targets. Attached Figure Description

[0049] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.

[0050] Figure 1 This is a flowchart illustrating the adjustable video denoising method based on deep learning proposed in this application.

[0051] Figure 2This is a schematic diagram of the calculation process for converting a motion mask into a denoised intensity map, as provided in the embodiments of this application.

[0052] Figure 3 This is a schematic diagram of the non-end-to-end spatial domain noise reduction model architecture in the embodiments of this application. Detailed Implementation

[0053] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.

[0054] This invention provides an adjustable video denoising method based on deep learning. The method extracts the current frame image from the video stream to be processed and obtains the preceding fused frame image and the preceding motion mask from a preset buffer. The current frame image, the preceding fused frame image, and the preceding motion mask are input into a temporal denoising model to calculate the current motion mask and the current fused frame image. The current motion mask contains the motion confidence of all pixels in the current fused frame image. The current frame fused image is input into a spatial denoising model to calculate a noise map containing noise information. Combined with the denoising intensity map transformed by the current motion mask according to a preset mapping rule, the current fused frame image is denoised to obtain the final denoised current frame image.

[0055] The overall process of the method is as follows: Figure 1 As shown, it includes the following steps:

[0056] Step S1. Extract the current frame image from the video stream to be processed, and obtain the preceding fused frame image and preceding motion mask from the preset buffer. It can be understood that the video stream to be processed refers to the RAW images continuously output by the image sensor within a continuous time period; the current frame image is the RAW image output at the current moment; the preset buffer is memory space reserved specifically for storing the preceding fused frame image and preceding motion mask; the preceding fused frame and preceding motion mask refer to the output of the temporal denoising model at the previous moment; and the preceding motion mask reflects the motion confidence of all pixels in the preceding fused frame image.

[0057] Step S2. Input the current frame image, the previous fused frame image, and the previous motion mask into the temporal denoising model, and output the current frame fused image and the current motion mask. In this embodiment, the type of the temporal denoising model is not limited, and includes, but is not limited to, different forms of network models such as convolutional neural networks and recurrent neural networks, which have implicit motion detection and can output the motion mask simultaneously with the fused image. The current motion mask contains the motion confidence of all pixels in the current fused frame image. In this embodiment, the value range of the motion mask is [0,1], where 0 indicates that the pixel is completely stationary, 1 indicates that the pixel is in motion, and values ​​between 0 and 1 indicate that the pixel is in different degrees of motion.

[0058] Step S3. Input the current frame fused image into the spatial domain denoising model and output the corresponding noise map. In this embodiment, a non-end-to-end spatial domain denoising model is provided, which differs from the end-to-end network that directly outputs the denoised image from the input fused frame image. The provided spatial domain denoising model outputs the noise map of the fused frame image input to the model, so as to freely adjust the denoising intensity. The noise map refers to the noise information in the fused frame image calculated by the spatial domain denoising model. Step S3.1 Build a non-end-to-end spatial domain denoising network. The non-end-to-end spatial domain denoising model is a convolutional neural network with an encoder-decoder architecture, mainly composed of an input layer, encoder, decoder, cross-layer connections, and output layer. The model architecture is as follows: Figure 3 As shown in the diagram, the encoder consists of consecutive convolutional layers, activation layers, and downsampling layers, with the downsampling layers being convolutional layers with a stride of 2. The decoder consists of consecutive convolutional layers, activation layers, and upsampling layers, with the upsampling layers being transposed convolutions with a stride of 2. The input layer consists of convolutional layers and activation layers, and the output layer is a single convolutional layer. The current fused frame is input into this model, where the encoder progressively extracts feature information at different scales. This extracted feature information is then passed to the decoder through cross-layer connections. The decoder then progressively fits the noise distribution in the current fused frame and distinguishes between image details and noise, ultimately outputting a noise map containing noise information. This distinction is based on a data-driven neural network-based denoising model, which learns on large-scale training data to acquire the ability to distinguish between image details and noise.

[0059] Step S3.2 Noise reduction method of non-end-to-end spatial domain noise reduction network. Assume the input fused frame image is I. fused The noise map output by the spatial domain noise reduction model is The noise reduction intensity is coeff, and the noise-reduced image I denoised The calculation method is shown in formula (1).

[0060]

[0061] In this embodiment, the noise reduction intensity coeff is set to a value between 0 and 1. A value of 0 indicates that no noise reduction is performed on the fused frame image, and a value of 1 indicates that the noise reduction is performed on the fused frame image at the maximum intensity. By adjusting the noise reduction intensity coeff, the noise and sharpness of the fused frame image can be controlled.

[0062] Step S4. Convert the current motion mask into a noise reduction intensity map according to a preset mapping rule. Since the spatial position of moving targets in the video stream is constantly changing, the current frame image can reflect the true position of the moving target at the current moment. When fusing the current frame image and the preceding fused frame image, to reduce motion blur, the motion region of the current fused frame mainly references the current frame image; simultaneously, to reduce the noise intensity and jitter in the motion region, it is necessary to reference a portion of the preceding fused frame image, inevitably introducing motion blur during the cyclic noise reduction process. Therefore, the current fused frame image output by the temporal denoising model can be divided into three regions: motion region, blur region, and static region. The noise intensity is arranged in ascending order as static region, blur region, and motion region. Due to the uneven noise level in the image, denoising with a uniform spatial denoising intensity will result in an inconsistent image style.

[0063] To address the issue of inconsistent noise levels across different regions of a video, leading to inconsistencies in the denoised image, this application further provides a spatial denoising method with zoned adjustment of denoising intensity. Considering that the current motion mask output by the temporal denoising model contains the motion confidence of all pixels in the current fused frame image, and that the current fused frame has characteristics of high noise intensity in moving areas and low noise intensity in static areas, the motion mask can be converted into a denoising intensity map corresponding to the current fused frame. The denoising intensity of each pixel in the current fused frame is determined by its motion state. For example, if the pixel at coordinates (x, y) in the current fused frame has a motion confidence of 1 in the corresponding motion mask, it can be considered to have high noise intensity and requires a larger denoising coefficient, typically between 0.8 and 1.0.

[0064] Step S4.1 Align the image resolutions of the current motion mask and the current fused frame. This embodiment does not limit the image resolution of the motion mask output by the temporal denoising model, including the same resolution as the current fused frame image or a smaller resolution. Since the motion mask needs to be stored in a preset cache, storing it at a smaller resolution saves memory space. When the image resolution of the current motion mask is smaller than that of the current fused frame, the current motion mask needs to be interpolated to the same image resolution as the current fused frame. This embodiment does not limit the interpolation method, including but not limited to nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation, which can be flexibly selected according to the requirements of motion mask accuracy and computational efficiency.

[0065] Step S4.2 Design the mapping rules for converting the motion mask into a denoised intensity map, and calculate the denoised intensity map of the current fused frame based on the current motion mask. Theoretically, the motion mask can be converted into a denoised intensity map through a simple linear transformation. However, considering that the relationship between motion confidence and noise intensity in the motion mask may be non-linear, motion state judgment is introduced during the conversion process. The calculation process is as follows: Figure 2 As shown:

[0066] Based on the motion mask, determine whether the pixel value is less than the motion threshold; if so, calculate the denoising intensity according to the mapping rules of the non-motion area; if not, calculate the denoising intensity according to the mapping rules of the motion area; finally, obtain the denoising intensity map.

[0067] Specifically, when the motion confidence α of any pixel in the motion mask is less than the preset motion threshold THR, the pixel is determined to be in a non-motion state and converted according to the mapping rules of the non-motion region; when the motion confidence α of any pixel in the motion mask is greater than or equal to the preset motion threshold THR, the pixel is determined to be in a motion state and converted according to the mapping rules of the motion region. The mapping rules for converting the motion confidence of the non-motion region and the motion region into the noise reduction intensity are shown in formula (2).

[0068]

[0069] In formula (2), S0 represents the noise reduction intensity in the static region, K1 controls the noise reduction intensity in the non-motion region (such as the motion blur region), and K2 controls the noise reduction intensity in the motion region. For all pixels in the current fused frame image, they are converted into noise reduction intensity according to the mapping rule to obtain a noise reduction intensity map S with the same image resolution as the current fused frame.

[0070] The four parameters—static area noise reduction intensity S0, non-motion area noise reduction intensity K1, motion area noise reduction intensity K2, and motion threshold THR—are all adjustable and can be freely adjusted according to user preferences in different scenarios to obtain the best video quality.

[0071] Step S5. Based on the current frame fused image I fused The noise map output by the spatial domain noise reduction model Based on the noise reduction intensity map S of the current motion mask transformation, calculate the spatial noise reduction result of the current fused frame, and calculate it as shown in formula (3).

[0072]

[0073] In formula (3), the symbol ⊙ represents the multiplication of corresponding elements. The result of multiplying the corresponding elements of the noise intensity map and the noise map by subtracting the current fused frame is output as the result of the current frame image after temporal and spatial denoising.

[0074] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A deep learning-based adjustable video noise reduction method, characterized in that, The method includes the following steps: S1. Extract the current frame image from the video stream to be processed, and obtain the preceding fused frame image and the preceding motion mask from the preset buffer; the video stream to be processed refers to the RAW images continuously output by the image sensor in a continuous time period, the current frame image is the RAW image output at the current moment, the preset buffer is the memory space reserved for storing the preceding fused frame image and the preceding motion mask, the preceding fused frame and the preceding motion mask refer to the output of the temporal denoising model of the previous moment at the current moment, and the preceding motion mask reflects the motion confidence of all pixels in the preceding fused frame image; S2. Input the current frame image, the previous fused frame image, and the previous motion mask into the temporal denoising model, and output the current frame fused image and the current motion mask; the current motion mask contains the motion confidence of all pixels in the current fused frame image; S3. Input the current frame fused image into the spatial domain denoising model and output the corresponding noise map; the spatial domain denoising model adopts a non-end-to-end spatial domain denoising model, which is different from the end-to-end network that directly outputs the denoised image from the input fused frame image. The provided spatial domain denoising model outputs the noise map of the fused frame image input to the model, so as to freely adjust the denoising intensity; the noise map refers to the noise information in the fused frame image calculated by the spatial domain denoising model; S4. Convert the current motion mask into a noise reduction intensity map according to a preset mapping rule; In order to solve the problem of inconsistent noise levels in different regions of the video, which leads to inconsistencies in the image after noise reduction, a spatial domain noise reduction method with partitioned noise reduction intensity adjustment is adopted; Considering that the current motion mask output by the temporal domain noise reduction model contains the motion confidence of all pixels in the current fused frame image, and the current fused frame has the characteristics of high noise intensity in the motion area and low noise intensity in the static area, the motion mask can be converted into a noise reduction intensity map corresponding to the current fused frame, and the noise reduction intensity of each pixel in the current fused frame is determined by its motion state; S5. Calculate the spatial denoising result of the current fused frame based on the current frame fused image, the noise map output by the spatial denoising model, and the denoising intensity map of the current motion mask transformation.

2. The adjustable video denoising method based on deep learning according to claim 1, characterized in that, In step S2, the type of the temporal denoising model is not limited, including but not limited to different forms of network models such as convolutional neural networks and recurrent neural networks, which have implicit motion detection and can output a motion mask while outputting the fused image; the value range of the motion mask is [0,1], where 0 indicates that the pixel is in a completely static state, 1 indicates that the pixel is in a moving state, and the value between 0 and 1 indicates that the pixel is in a moving state at different degrees.

3. The adjustable video noise reduction method based on deep learning according to claim 1, characterized in that, Step S3 further includes: S3.1 Constructing a non-end-to-end spatial domain denoising network; the non-end-to-end spatial domain denoising model is a convolutional neural network with an encoder-decoder architecture, mainly composed of an input layer, encoder, decoder, cross-layer connections, and output layer; wherein, the encoder consists of consecutive convolutional layers, activation layers, and downsampling layers, with the downsampling layers being convolutional layers with a stride of 2; the decoder consists of consecutive convolutional layers, activation layers, and upsampling layers, with the upsampling layers being transposed convolutions with a stride of 2; the input layer consists of convolutional layers and activation layers, and the output layer is a single convolutional layer; the current fused frame is input into this model, the encoder progressively extracts feature information at different scales, and the feature information extracted by the encoder is passed to the decoder through cross-layer connections, and then the decoder progressively fits the noise distribution in the current fused frame, distinguishes between image details and noise, and finally outputs a noise map containing noise information; S3.2 Noise Reduction Method of Non-End-to-End Spatial Domain Noise Reduction Network; Assuming the input fused frame image is I fused The noise map output by the spatial domain noise reduction model is The noise reduction intensity is coeff, and the noise-reduced image I denoised The calculation method is shown in formula (1). Assuming the noise reduction intensity coeff is set to a value between 0 and 1, a value of 0 indicates that no noise reduction is performed on the fused frame image, and a value of 1 indicates that the noise reduction is performed on the fused frame image at the maximum intensity. By adjusting the noise reduction intensity coeff, the noise and sharpness of the fused frame image can be controlled.

4. The adjustable video noise reduction method based on deep learning according to claim 1, characterized in that, In step S4, since the spatial position of the moving target in the video stream is constantly changing, the current frame image can reflect the true position of the moving target at the current moment. When fusing the current frame image and the previous fused frame image, in order to reduce the motion blur of the moving target, the motion region of the current fused frame mainly refers to the current frame image. At the same time, in order to reduce the noise intensity and jitter of the motion region, it is necessary to refer to a part of the previous fused frame image. In the process of cyclic denoising, motion blur is inevitably introduced. Therefore, the current fused frame image output by the temporal denoising model can be divided into three regions: motion region, blur region, and static region. The noise intensity is arranged in ascending order as static region, blur region, and motion region. Since the noise level in the image is uneven, denoising with a uniform spatial domain denoising intensity will result in an inconsistent image style.

5. The adjustable video noise reduction method based on deep learning according to claim 1, characterized in that, In step S4, the noise reduction intensity of each pixel in the current fusion frame is determined by its motion state. Assuming that the motion confidence of the pixel at coordinates (x,y) in the current fusion frame is 1 in the corresponding motion mask, it can be considered that the noise intensity of the pixel is large and a large noise reduction coefficient needs to be assigned. The large noise reduction coefficient is 0.8 to 1.

0.

6. The adjustable video denoising method based on deep learning according to claim 1, characterized in that, Step S4 further includes: S4.1, Align the current motion mask with the image resolution of the current fused frame; The image resolution of the motion mask output by the temporal denoising model is not limited, including the same resolution as the current fused frame image or a smaller resolution. Since the motion mask needs to be stored in a preset cache, storing it at a smaller resolution can save memory space. When the image resolution of the current motion mask is lower than that of the current fusion frame, the current motion mask should be interpolated to the same image resolution as the current fusion frame. There are no restrictions on the interpolation method; it can be selected according to the requirements of motion mask accuracy and computational efficiency. S4.2, Design the mapping rules for converting the motion mask into a noise reduction intensity map, and calculate the noise reduction intensity map of the current fused frame based on the current motion mask; Theoretically, a motion mask can be converted into a noise reduction intensity map through a simple linear transformation. However, considering that the relationship between motion confidence and noise intensity in the motion mask may be non-linear, motion state judgment is introduced in the conversion process: based on the motion mask, it is determined whether the pixel value is less than the motion threshold; if so, the noise reduction intensity is calculated according to the mapping rules of the non-motion area; if not, the noise reduction intensity is calculated according to the mapping rules of the motion area; finally, the noise reduction intensity map is obtained.

7. The adjustable video denoising method based on deep learning according to claim 5, characterized in that, In step S4.1, the interpolation method includes nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

8. The adjustable video denoising method based on deep learning according to claim 5, characterized in that, In step S4.2, the motion state judgment introduced during the conversion process further includes: When the motion confidence α of any pixel in the motion mask is less than the preset motion threshold THR, the pixel is determined to be in a non-motion state and converted according to the mapping rules of the non-motion area; when the motion confidence α of any pixel in the motion mask is greater than or equal to the preset motion threshold THR, the pixel is determined to be in a motion state and converted according to the mapping rules of the motion area; the mapping rules for converting the motion confidence of the non-motion area and the motion area into the noise reduction intensity are shown in formula (2): In formula (2), S0 represents the noise reduction intensity in the static region, K1 is used to control the noise reduction intensity in the non-motion region, and K2 is used to control the noise reduction intensity in the motion region. For all pixels in the current fused frame image, they are converted into noise reduction intensity according to the mapping rule to obtain a noise reduction intensity map S with the same image resolution as the current fused frame. The four parameters—static area noise reduction intensity S0, non-motion area noise reduction intensity K1, motion area noise reduction intensity K2, and motion threshold THR—are all adjustable and can be freely adjusted according to user preferences in different scenarios to obtain the best video quality.

9. The adjustable video denoising method based on deep learning according to claim 1, characterized in that, Step S5 further includes: According to the current frame fusion image I fused The noise map output by the spatial domain noise reduction model Based on the noise reduction intensity map S of the current motion mask transformation, the spatial noise reduction result of the current fused frame is calculated as shown in formula (3): In formula (3), the symbol ⊙ represents the multiplication of corresponding elements. The result of multiplying the corresponding elements of the noise intensity map and the noise map by subtracting the current fused frame is output as the result of the current frame image after temporal and spatial denoising.