Electronic apparatus and controlling method thereof
The neural network training method addresses the challenge of balancing accuracy and perceived quality in image processing by using a statistics estimator and adaptive blur correction, enhancing detailing and maintaining naturalness in HDR images.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-12-16
- Publication Date
- 2026-06-25
AI Technical Summary
Existing image and video processing methods struggle to balance the accuracy of reproducing the original scene with perceived image quality, often resulting in noise reduction artifacts, uneven noise levels, and loss of naturalness, especially when dealing with high dynamic range (HDR) images.
A neural network training method that uses a statistics estimator unit and adaptive blur correction unit to calculate unbiased mathematical expectation estimation and account for different image degradations, ensuring accurate reproduction of the original scene while maintaining a comfortable noise level for human perception.
The method enhances image and video quality by providing high detailing without noise reduction artifacts, ensuring naturalness for human perception, and achieving consistent noise levels across different image areas.
Smart Images

Figure IB2025062924_25062026_PF_FP_ABST
Abstract
Description
DescriptionTitle of Invention :ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOFTechnical Field
[0001] The present invention relates to the field of image and video processing, in particular to methods for temporary noise suppression in an image and video and to methods for obtaining images and video with high dynamic range (HDR).Background Art
[0002] Modem neural network methods for enhancing images / video quality, in particular, methods for enhancing quality of image / video details that are noisy and / or blurry, and generating HDR content, i.e. images / video with high dynamic range of brightness, which provides high quality of image / video details in both dark and light areas, typically implement two main approaches based either on extracting a useful (meaningful) signal from a useful signal and noise mixture, for example, by combining several frames into one, or on generating (“finishing”) details taking into account a content of an original image using generative models.
[0003] The requirements imposed for modem methods for processing images or video are in the following:
[0004] - ensuring high detailing of an image or a video frame;
[0005] - noise reduction in an image or a video frame without occurrence of noise reduction artifacts;
[0006] - blur removal;
[0007] - generation of content with high dynamic range (HDR).
[0008] An image whose dynamic range (i.e. a ratio between the luminance of the brightest displayable pixels and the luminance of the pixels with the minimum resolvable non-zero luminance) exceeds a dynamic range of a device with which the image was captured, is understood under a high dynamic range image. This result is typically achieved by capturing a series of images sequentially at various exposures such that different parts of the dynamic range are represented in the different images. In order to obtain an output high dynamic range image from this input set of lower dynamic range images, processing is performed using special algorithms known from prior art.
[0009] Known image processing methods, which use, e.g., convolutional neural networks (CNNs) trained to recover a useful signal from a blurry and noisy signal by pixel-by-pixel regression, are fast to be trained and operate. Such networks trained with a regression loss function (i.e. trained to minimize the difference between an output image and a reference image) often fail to achieve a sufficient level of details in an image. Meanwhile, the obtained image often loses its naturalness for human perception, in particular faces on them have a “plastic” or “cartoonish” appearance.
[0010] Known image processing methods using diffusion neural networks or generative adversarial networks (GANs) compensate for the deficiencies of the simple methods described above by generating (“inventing”) new image details those were not present in an original scene, but those are in good matching with it. Such solutions have a complex architecture and require long training on large sets of images. In contrast to models trained with a regression loss function, such methods allow images to be obtained that are subjectively higher quality, however they are not adapted to accurately reproduce details and content of an original image. This approach may be unacceptable for smartphone and digital camera users who expect to receive an output image that accurately reproduces the original scene.
[0011] The trade-off between the accuracy of reproducing the original scene and a perceived image quality has been described in literature and is an important issue in image processing.
[0012] A document US 20190096046 Al (publication date is 28.03.2019) is known from prior art, which discloses a device, system and method for generating HDR images from an initial set of low dynamic range (LDR) images using convolutional neural networks (CNN). Alignment of images is accomplished by using optical flow, wherein low and high exposure LDR images are aligned with a medium exposure image. The drawback of the known solution is that if the original images contain highlights or motion with partial overlaps (when one object partially overlaps another), the resulting HDR image has poor detailing.
[0013] A document Northwestern Polytechnical University / D01: 10.1109 / TCSVT.2023.3326293 / 20.10.2023, Towards High-quality HDR Deghosting with Conditional Diffusion Models / Q. Yan et al. is known from prior art, which discloses a method for generating HDR images from a set of low dynamic range (LDR) images using a conditional diffusion model. For conditioning, domain features (in this case, a domain is formed by features of images with various exposures) extracted from LDR images in intermediate layers of a neural network are used. In order to avoid ghosting artifacts when the HDR images are generated, these domain features are aligned (combined) by using an affine transformation. A sliding window noise estimator unit is also used to sample (select) a smooth global noise by patches to avoid patching artifacts in a final image. The neural network is trained using a combination of loss functions, including the L2 loss function. The known method provides good image detailing even if LDR images contain object motions. Since the method assumes the presence of only global affine motion in the scene, quality of an output result directly depends on the validity of this assumption. Thus, if there is local motion in the scene (for example, moving people or cars), the quality of the output result decreases. The drawback of the known solution is also the long training time of the neural network and the possibility of details occurrence in the resulting image that were not in the original image.
[0014] A document Megvii Technology / University of Electronic Science and Technology of China / doi: 10.1109 / CVPRW53098.2021.00058 / 2021.06 EBSR: Feature Enhanced Burst Super-Resolution with Deformable Alignment / Z. Luo et al. is known from prior art, which discloses that a RAW image burst is merged into one output RGB image with enhanced detailing (increased resolution) using a neural network by means of extracting features from low-resolution images, removing noise and enhancing the features, aligning them (combining) using a pyramidal architecture block and deformable convolutions, merging, scaling and reconstructing a high-resolution image based on them. The neural network is trained using a LI loss function. The drawbacks of the known approach are the architecture complexity of the model used, as well as the inevitability of losses in subjective quality when a regression function is used.
[0015] A document US 11107205 B2 (publication date is 31.08.2021) is known from prior art, which discloses a method for reducing image blur by combining multiple frames with various exposures using a spatial-temporal recurrent convolutional neural network. The convolutional neural network is used to generate blending maps associated with image frames. The blending maps are based on both a measure of motion in the image frames and a measure of how well exposed different regions of the image frames are. A final image of a scene may be generated by blending at least some of the original frames using at least some of the blending maps, and the final image of the scene may include details that were lost in at least one original frame due to under-exposure or over-exposure. The drawbacks of the known approach are the architecture complexity of the model used. Due to the use of the blending maps in this approach, a side effect may be uneven noise reduction in different areas of the image. In areas where motion is not accurately calculated or there is a partially overlapping of objects, the resultingpixels will be obtained by blending fewer input pixels, resulting in higher levels of noise in these areas. For the end user, an image with spatially uneven quality (an image with low noise level interspersed with small areas of high noise level) may be even more unpleasant than an image with high noise level but with greater uniformity.
[0016] In addition, images with some high-frequency component are more acceptable to a human eye, especially when depicting objects that, according to life experience, are rough, such as stones, grass, earth, skin surface, etc. Meanwhile, it is expected that metal objects will not contain this high-frequency component, and, on the contrary, will be absolutely smooth.
[0017] In order to improve human perception of an image, it is necessary not to suppress noise completely, as is done in approaches known from the prior art, but to maintain a noise level in an image at the level at which the image would look natural for human visual perception. In addition, it is necessary to maintain details in an image that are suppressed along with a noise in conventional approaches known from the prior art.Solution to ProblemIn an embodiment, an electronic apparatus comprising memory storing instructions, and at least one processor including processing circuitry, The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to obtain a reference image from a training image dataset, obtain a plurality of degradation images corresponding to the reference image, obtain a plurality of output images by inputting the plurality of degradation images into a neural network model, obtain a statistic value based on the plurality of output images, obtain a loss function value based on the reference image and the statistic value, and update at least one weight of the neural network model based on the loss function value.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to obtain the reference image among a plurality of reference images included in the training image dataset.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to obtain the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to, based on the reference image being shifted or rotated, obtain the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to obtain a first set of a plurality of degradation images corresponding to the reference image, obtain a second set of a plurality of degradation images corresponding to the reference image, obtain a first output image by inputting the first set into the neural network model, obtain a second output image by inputting the second set into the neural network model, and obtain the statistic value based on the first output image and the second output image.The plurality of degradation images included in the second set may be different from the plurality of degradation images included in the first set.The statistic value may include an unbiased mathematical expectation estimation of the plurality of output images.The plurality of degradation images may simulate processes that occur when shooting with a real camera.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to update the at least one weight of the neural network model based on the loss function value by using a backpropagation technique.The instructions, when executed by the at least one processor individually or collectively, may cause the electronic apparatus to store the neural network model for enhancing a quality of an image in the memory, and based on an input image being received, obtain a result image corresponding to the input image by inputting the input image to the neural network modelIn an embodiment, a controlling method of an electronic apparatus, the controlling method comprising obtaining a reference image from a training image dataset, obtaining a plurality of degradation images corresponding to the reference image, obtaining a plurality of output images by inputting the plurality of degradation images into a neural network model, obtaining a statistic value based on the plurality of output images, obtaining a loss function value based on the reference image and the statistic value, and updating at least one weight of the neural network model based on the loss function value.The obtaining the reference image may include obtaining the reference image among a plurality of reference images included in the training image dataset.The obtaining the plurality of degradation images may include obtaining the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.The obtaining the plurality of degradation images may include based on the reference image being shifted or rotated, obtaining the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.The obtaining the plurality of degradation images may include obtaining a first set of a plurality of degradation images corresponding to the reference image, and obtaining a second set of a plurality of degradation images corresponding to the reference image. The obtaining the plurality of output images may include obtaining a first output image by inputting the first set into the neural network model, and obtaining a second output image by inputting the second set into the neural network model. The obtaining the statistic value may include obtaining the statistic value based on the first output image and the second output image.Brief Description of Drawings
[0018] The above and other features and advantages of the present invention are explained in the following description illustrated by the drawings, in which the following is presented:
[0019] Fig. 1 schematically illustrates a method for training a neural network with a statistics estimator unit.
[0020] Fig. 2 schematically illustrates a method for training a neural network with an adaptive blur correction unit.
[0021] Fig. 3 schematically illustrates a method for training a neural network with a statistics estimator unit and an adaptive blur correction unit, wherein the adaptive blur correction unit is located after the statistics estimator unit.
[0022] Fig. 4 schematically illustrates a method for training a neural network with an adaptive blur correction unit and a statistics estimator unit, wherein the statistics estimator unit is located after the adaptive blur correction unit.
[0023] Fig. 5 schematically illustrates an embodiment of the invention with a unit for weighing outputs of a trainable neural network.
[0024] Fig. 6 schematically illustrates an embodiment of the invention with a unit for weighing outputs of a trainable neural network and with an adaptive blur correction unit.
[0025] Fig. 7 illustrates an embodiment of the controlling method of the electronic apparatus.
[0026] Fig. 8 illustrates an embodiment of the controlling method of the electronic apparatus. Description of Embodiments
[0027] Devices and methods for training a neural network to enhance images / video quality are proposed. Due to the use of the proposed group of inventions, the images / video quality is enhanced, namely, high detailing of an image or a video frame is provided, blurring is eliminated and noise is suppressed without the occurrence of noise reduction artifacts. Meanwhile, the resulting image / video frame maintains its naturalness for human perception.
[0028] The proposed group of inventions can be applied to neural network architectures intended for image reconstruction, such as convolutional neural networks, transformers, recurrent neural networks, etc. Such neural networks are used, in particular in image / video processing pipelines (ISP).
[0029] In the proposed group of inventions, in the process of training a neural network, when a loss function is calculated, instead of directly the output image, as it is done in known approaches to training, either statistics calculated on the basis of several output images obtained by applying the trainable neural network to several sets of input images with different degradations, or several images obtained by applying different blur operators to the output image formed by the neural network from a set of input images with degradations, taking into account the weights, or a combination of these variants are used. Due to the approaches used, an unbiased mathematical expectation estimation of an output image becomes as close as possible to a reference image, but no requirements are imposed on each individual output image .
[0030] During training, a training set (dataset) of high-quality images is used, such training datasets of images are known from the prior art and can be stored either in a memory of an electronic device on which the proposed invention is being implemented, or on a data storage, a remote server, in a cloud storage, etc.
[0031] In one of embodiments, the proposed method for training a neural network with a statistics estimator unit to enhance images / video quality is performed on a device for training a neural network for enhancing images / video quality. Meanwhile, as shown in Fig. 1, the device comprises, being operatively coupled:
[0032] a degradation modeling unit 1,
[0033] a trainable neural network 2,
[0034] a statistics estimator unit 3,
[0035] a loss function calculation unit 4,
[0036] an optimizer unit 5.
[0037] The proposed method for training a neural network is schematically illustrated in Fig.1 and consists in the following.
[0038] A) A reference image x from a training dataset of images is supplied to an input of the degradation modeling unit 1.
[0039] As shown in Fig. 1, at least two sets of images Xk (k = 1 . . .K, K > 2, k is a number of the set, Xk is the number of the sets, each set Xk consists of at least one image xkn (n = 1 . . .N, N > I, where N is the number of images in the set) with introduced degradations, which corresponds to the selected reference image x, wherein all the images in each set have different degradations (quality deterioration), that is Xl={xl l, xl2,... xlN}, X2={x21, x22,... x2N},... , Xk={xkl, xk2,... xkN},... , XK={ xK1, xK2,... xKN}, see Fig. 1) are generated from one reference image x in the degradation modeling unit 1. In other words, quality deteriorations (degradations) are introduced into the reference image in the degradation modeling unit 1.
[0040] All images with degradations in all sets, having the same numbers n, have the same shooting settings, i.e., for example, exposure and ISO, and modeling of degradations for each image xkn is performed taking into account the shooting parameters. The degradation modelingunit 1 implements functions describing image transformation and is intended for modeling processes which occur when shooting with a real photo / video camera, which, as a rule, generates an image based on several captured frames (for example, burst or bracketing photography modes). Degradations are modeled (introduced) into an image in the degradation modeling unit at the step under consideration and may include, for example, adding noise to a reference image; adding of blurring to the reference image to simulate a camera motion, or adding an intrinsic motion of scene objects, or blurring which occurs due to that some elements of the scene are out of focus; adding, to the reference image, linear shifts and rotations of images in the set of images Xk and then compensation therefor; reducing a resolution of the reference image; reducing a resolution of the reference image by means of Bayer sampling; modeling, on the reference image, artifacts which occur from insufficiently high-quality motion estimation and compensation; inserting random (i.e. different from the reference image) frames into the reference image at an arbitrary position k in the set of images Xk, etc. The degradation parameters are set depending on a problem to be solved: so, for example, ISO / exposure combinations suitable for day or night shooting determine a noise level that needs to be imposed and a blurring degree that needs to be applied. The noise level for a particular camera model can be measured experimentally.
[0041] The set of functions of the degradation modeling unit 1 is determined by a problem to be solved. In particular, in order to solve a noise reduction problem, the degradation may include noise superposition; for a resolution increasing problem, the degradation may include resolution reduction and noise superposition; for a color interpolation problem, the degradation may include Bayer resolution reduction and noise addition; for a problem of obtaining HDR images, the degradation may include all of the above, as well as modeling of smoothing and artifacts associated with motion estimation. If the trainable neural network is used to process data from a camera sensor, a quantization effect is also modeled within the degradation unit.
[0042] B) The obtained sets of images with different degradations are supplied to the input of the trainable neural network 2, which generates one output image yk for each set of images Xk with different degradations to form a set of output images Y={yl, y2,..., yK} (see Fig. 1). Meanwhile, the sets Xk are supplied to the input of the trainable neural network 2 sequentially and independently of each other (in more detail see Fig. 1, unit 2a).
[0043] C) The obtained set Y={yl, y2,... , yK} of output images is supplied to the statistics estimator unit 3, which calculates a plurality of statistics of the set of output images, wherein one of the statistics is a point-by-point (pixel-by-pixel) unbiased mathematical expectation estimation y of the output image.
[0044] Statistics in this case is a measurable numerical function (functional dependence) of the combination of output images Y. The plurality of statistics is denoted as f(Y). The statistics estimator unit 3 comprises a set of known functions for calculating statistics of the set of output images.
[0045] Statistics, in the context of the proposed invention, may be, for example, an unbiased mathematical expectation estimation of an output image, a standard deviation, and other statistics known from the prior art:
[0046] f(y) = {y) (7, ... } (1),
[0047] where y is an unbiased mathematical expectation estimation of an output image, which has the same dimension (i.e. H x W x C, where H is a height of the image in points (pixels), W is a width of the image in points (pixels), C is the number of color channels of the image) as each of the output images yk,
[0048] o is a point-by-point (pixel-by-pixel) standard deviation, has the same dimension as each of the output images yk,
[0049] “...” is other statistics of the set of output images.
[0050] The unbiased mathematical expectation estimation y can be calculated by the formula:
[0052] Thus, the unbiased mathematical expectation estimation y acts as a “mean” output image.
[0053] D) The calculated statistics f(Y) of the combination of output images are supplied to the loss function calculation unit 4.
[0054] In the loss function calculation unit 4, a loss function is calculated based on the reference image x and the calculated statistics.
[0055] The loss function in this case may have a form:
[0056] Loss = || x - y ||p + R(f (T)) (3),
[0057] where ||x — y\\p,p G IR; p > 1 is a p-norm of a difference between the reference image x and the unbiased mathematical expectation estimation y of the output image, IR is set of real numbers,
[0058] R(f(Y)) is the regularization term.
[0059] The term ||x — y||pof the loss function is responsible for that the unbiased mathematical expectation estimation of the output image corresponds to the current input image. The p-norm is calculated by the known formula:
[0061] where the elements of the difference x — y are denoted through = xt— y; •
[0062] In particular, for images defined by arrays of the dimension H x W x C, the formula takes the form:
[0064] here, the elements of the difference x — y are denoted through cq7-c= %C— y^c, H is a height of the image in points (pixels), W is a width of the image in points (pixels), C is the number of color channels of the image.
[0065] The regularization term R(f(Y)) is additional requirements (constraints) those are introduced into the condition of an incorrectly set problem of images / video quality enhancement (in particular, enhancement of quality of image / video details that are noisy and / or blurry, and HDR content generation) to find its approximate solution. Such a technique is known and consists in adding further information to the condition of an incorrectly set problem (for example, a problem with many solutions or a problem without solution), which allows it to be reduced to a similar correctly set problem. In this case, there is an incorrectly set problem with many solutions, when a plurality of output images can be obtained from one input image. The term R(f(Y)) in the context of the present invention allows the output images of the trainable neural network to be controlled and the problem solution to be provided with properties that are set in accordance with a priori expectations, for example, by this term it is possible to control a dispersion of the output images of the trainable neural network and thus obtain a required noise level.
[0066] Additional terms can also be introduced into the loss function, for example terms defining perception (perceptual), such as LPIPS (Learned Perceptual Image Patch Similarity), SSIM (Structural Similarity Index Measure), as well as components calculated by a specially trained discriminator within the GAN approach, etc.
[0067] Due to use of several output images from the trainable neural network when calculating a loss function value for different degradations, the trainable neural network learns to improve the details, and at the same time not to suppress the noise completely, but to generate such output images that allow in conjunction for maintaining some small noise level on the image, comfortable for a human eye. A level of this noise depends on the degradation parameters ofthe degradation modeling unit, as well as on a parameter K (the number of image sets); the concept “comfortable” is subjective and can be established based on an opinion of a group of experts which evaluate the image.
[0068] Such a loss function allows, on the one hand, the requirement of accurate transmission of a reference image to be maintained in force, and, on the other hand, imposes this requirement only “on average” for a set of different degradations, which allows the network to generate images that may be some different when input data are slightly different, thereby improving the visual quality of the output image generated by the trained neural network.
[0069] E) The loss function value is supplied to the optimizer unit 5 which, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique (it is denoted as “updated weights” in Figure 1).
[0070] Steps (A) through (E) are repeated with different reference images from the training dataset until either the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0071] In one embodiment of the invention, the proposed method for training a neural network to enhance images / video quality with an adaptive blur correction unit is performed on a device schematically illustrated in Fig. 2. Meanwhile, the device comprises, being operatively coupled:
[0072] a degradation modeling unit 1 ;
[0073] a trainable neural network 2;
[0074] an adaptive blur correction unit 6, which includes:
[0075] - at least two blur units 6a,
[0076] - at least two attention units 6b,
[0077] - a probability calculation unit 6c;
[0078] a loss function calculation unit 4;
[0079] an optimizer unit 5.
[0080] In this embodiment of the invention, the following steps of training a neural network to enhance images / video quality are performed:
[0081] A) A reference image x selected from a training dataset of reference images is supply to an input of a degradation modeling unit 1.
[0082] As shown in Fig. 2, one set of images X={xl, x2,...xN} with introduced degradations (hereinafter a set of images with introduced degradations) is generated from one reference image x in the degradation modelling unit 1, wherein all images in the set have different introduced degradations.
[0083] B) The obtained set of images with different introduced degradations is supplied to an input of a trainable neural network 2, which generates one output image y.
[0084] C) As shown in Fig. 2, an adaptive blur correction unit 6 comprises:
[0085] - M blur units 6a Bl... BM described by Bm (m = 1...M, M > 2) operators, which set different blurring degree (level) of the image and form:
[0086] M reference images with different blurring degree from the reference image,
[0087] and M output images with different blurring degrees from the output image obtained from the trainable neural network,
[0088] - M attention units 6b, each of which is configured to generate an attention map reflecting the degree of proximity of each pixel of the reference image with the corresponding blurring degree to each pixel of the output image y generated by the trainable neural network (that is, without blurring), all attention units being the same and implementing the same attention mechanism;
[0089] - a probability calculation unit 6c, which generates, from the attention maps, corresponding weights used for components of the loss function corresponding to different blurring levels.
[0090] In the adaptive blur correction unit 6:
[0091] - the reference image x is supplied to M blur units 6a to form M reference images having different blurring degrees Bm(x) (m = 1... M);
[0092] - the output image y of the neural network is supplied to M blur units 6a to form M output images with different blurring degrees Bm(y) (m = 1... M), which have the same dimension as the output image;
[0093] (wherein the reference image and the output image y are received by M blur units 6a sequentially and independently of each other);
[0094] - each of the M reference images Bm(x) (m = 1... M) having different blurring degree is supplied to the corresponding attention unit 6b together with the output image y and an attention map zM between Bm(x) and y is calculated, which is a matrix of coefficients of dimension H * W and reflects the degree of proximity (similarity) of each pixel of the reference image with the corresponding blurring degree Bm(x) to each pixel of the output image y.
[0095] The mechanism of operation of the attention units can be implemented by known methods, for example, by calculating the scalar (dot) product zm = Bm(x)-y or the root of degree p from the difference between Bm(x) and y in the degree p for each image channel taken with the inverse sign:
[0097] when p G IR; p > 1, C is the number of color channels of the image, or both options, but based on features extracted from Bm(x) and y by some known neural network (for example, VGG-19 [Simonyan, Karen and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR abs / 1409.1556 (2014)]). A method for extracting features is also known and described, for example, in the article by Tammina, Srikanth (2019), Transfer Learning Using VGG-16 with Deep Convolutional Neural Network for Classifying Images, International Journal of Scientific and Research Publications (IJSRP), 9, p.9420. 10.29322 / IJSRP.9. 10.2019.p9420.
[0098] The attention maps zm calculated in the attention units 6b are transmitted to the probability calculation unit 6c, which transfer the attention maps zm into weights Wm which are two-dimensional arrays of numerical coefficients (a width and a height of which are equal to a width and a height y and Bm(y)) which define the significance of each point (pixel) of the output image with a given blurring degree, and are used for components of a loss function, which correspond to different blurring levels, so that in each pixel of the output image the sum of the weights is equal to unity:
[0099] = 1; Wmi> 0 (7).
[0100] The probability calculation unit 6c can be implemented on the basis of, for example, the known function Softmax [Goodfellow L, Bengio Y., Courville A., “6.2.2.3 Softmax Units for Multinoulli Output Distributions,” Deep Learning, MIT Press, pp. 180-184, 2016]:
[0101] 2.718 ... (8),
[0102] where zmi is a value of the attention map between Bm(x) and y (m=l . . . M) in the i-th pixel;
[0103] zui is a value of the attention map between Bu(x) and y (u=l . . .M) in the i-th pixel;
[0104] i is the number of pixel (point) of the image.
[0105] That is, due to the use of weights Wm a blurring degree in each pixel of the image is taken into account separately and thus the adaptive blur correction is achieved.
[0106] The function used in the probability calculation unit should provide a smooth change of weights within the image for a smooth transition between areas of focus / defocus and a clear / blurry image.
[0107] D) The weights Wm, the output images with different blurring degrees Bm(y) are supplied to the loss function calculation unit 4, and a loss function value is calculated.
[0108] The loss function in this case is set
[0109] either as:
[0110] Loss = Zmwm||Bm(y) - x||p+ R(y) (9),[oni] where wm= - 14^ . is a mean value of the matrix of weights Wm,
[0112] U is the number of pixels of the matrix of weights Wm,
[0113] R(y) is the regularization term;
[0114] or as:
[0115] Loss = ||ZmWmBm(y) - %||p+ R(y) (10).
[0116] In the first case (equation (9)), the loss function is written as the sum of the loss functions taken with weights between each blurry image Bm(y) and the reference image. In the second case (equation (10)), the loss function is written as a loss function between a locally weighted sum of the outputs of the blur operators and the reference image and has additionally the capability to influence the effect of weights Wm using the norm.
[0117] In deblurring problems, loss functions of a || Bm(y) — x||ptype are generally used, which cause the model to deblur (that is, eliminate blurring that has occured in an image) in the same manner (uniformly) across the entire image and generate an output image that is close in blurring degree to the reference image.
[0118] So, for portions in the focus area with clear details, the blur operators Bm far from a unity (identical) value by values will receive low weights Wm and, thus, the effect of deblurring will be small, fine details will be maintained, and in areas of the image that are defocused or blurry, all the blur operators Bm will receive weights Wm close in values, and an average blur operator with a sufficiently strong effect will be formed from them, and the model will learn to reverse it and, thus, the effect of deblurring will also be strong enough.
[0119] The blurring functions and the optimal number of blur operators can be determined on the basis of the following considerations. Typically, when training neural networks for image reconstruction problems using loss regression functions, an output image is obtained to be more blurry than a reference one. This phenomenon is known as the regression-to-the-mean problem in incorrectly set inverse problems (that is, in problems where an image close to a reference one is obtained on the basis of the input images) when the formulation allows for many solutions [Mauricio Delbracio, Hossein Talebei, and Peyman Milanfar, Projected distribution loss for image enhancement, In 2021 IEEE International Conference on Computational Photography (ICCP), pp. 1-12, IEEE Computer Society, 2021b.]. In order to estimate blurring introduced by a neural network model, a test set of T images x is supplied to the basic neural network model (without an adaptive blur correction unit) and T output images y are obtained, which are more blurry than their corresponding input images x. For each obtained pair xt and yt parameters (weights of blurring core, i.e. numerical coefficients of a matrix) of the blur operator Bt are found by optimizing through minimizing the difference:
[0120] || Bt(x) — y||p (11).
[0121] The obtained T blur operators Bt are clustered into M clusters based on for example calculation of the L2 norm (p-norm, p=2) between the blurring cores Bt. A center of each cluster is calculated, which will be the blur operator Bm. A single (identical) operator can be added to the set of operators constructed in this way. The number M is set by a specialist conducting the training of a neural network based on the analysis of the distribution of cores Bt and additional information about the problem to be solved.
[0122] E) The loss function value is supplied to the optimizer unit 5. The optimizer unit 5, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique (it is denoted as “updated weights” in Fig. 2).
[0123] Steps (A) through (E) are repeated with different reference images from the training dataset until the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0124] In another embodiment of the invention, the proposed method for training a neural network to enhance images / video quality with an adaptive blur correction unit is performed on a device for training a neural network to enhance images / video quality, which is schematically illustrated in Fig. 3. Meanwhile, the adaptive blur correction unit is located after the statistics evaluation unit, and the device comprises, being operatively coupled:
[0125] a degradation modeling unit 1;
[0126] a trainable neural network 2;
[0127] a statistics estimator unit 3;
[0128] an adaptive blur correction unit 6, which includes:
[0129] - at least two blur units 6a,
[0130] - at least two attention units 6b,
[0131] - a probability calculation unit 6c;
[0132] a loss function calculation unit 4,
[0133] an optimizer unit 5.
[0134] In this embodiment of the proposed invention, the steps (A), (B) are performed in the same manner as in the embodiment describing Fig. 1, i.e.
[0135] A) A reference image x from a training dataset of images is supplied to an input of the degradation modeling unit 1. At least two sets of images are generated from the supplied reference image by means of the degradation modeling unit 1, each set of images consisting of at least one image corresponding to said reference image, all images in each set having different degradations.
[0136] B) The obtained sets of images with introduced degradations are supplied to the trainable neural network 2, by means of which one output image is generated for each set of images with introduced degradations, wherein the output images obtained for all the sets of images with introduced degradations form a set of output images.
[0137] C) The set of output images is supplied to the statistics estimator unit 3, which calculates also a plurality of statistics (see equation (1)), wherein one of the statistics is an unbiased mathematical expectation estimation y of the output image.
[0138] D) This step is performed in the adaptive blur correction unit 6 in the same manner as step (C) described for the embodiment illustrated in Fig. 2, however, instead of the output image y, the unbiased mathematical expectation estimation y of the output image obtained in step (B) in the embodiment describing Fig. 3. That is, the unbiased mathematical expectation estimations of the output image with different blurring degrees Bm(y), and weights Wm are obtained at an output of the adaptive blur correction unit 6.
[0139] The rest of the plurality of statistics f(Y) calculated in the statistics estimator unit 3 is supplied to the loss function calculation unit 4.
[0140] E) The reference image x, the weights Wm, the unbiased mathematical expectation estimations of the output image with different blurring degrees Bm(y), the rest of the plurality of statistics f(Y) (except for the unbiased mathematical expectation estimation of the output image) are supplied to the loss function calculation unit 4, and a loss function value is calculated.
[0141] The loss function in this case can be set
[0142] either as:
[0144] where wm= - Wm. is a mean value of the matrix of weights Wm,
[0145] U is the number of pixels of the matrix of weights Wm,
[0146] or as:
[0148] As described above, due to the use of weights, the blurring degree in each pixel of the image is taken into account separately, and thus the adaptive blur correction is achieved.
[0149] F) The loss function value is supplied to the optimizer unit 5 which, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique (it is denoted as “updated weights” in Fig. 3).
[0150] Steps (A) through (F) are repeated with different reference images from the training dataset until the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0151] In another embodiment of the invention, the proposed method for training a neural network to enhance images / video quality with an adaptive blur correction unit is performed on a device for training a neural network to enhance images / video quality, which is schematically illustrated in Fig. 4. Meanwhile, the adaptive blur correction unit is located before the statistics evaluation unit, and the device comprises, being operatively coupled:
[0152] a degradation modeling unit 1;
[0153] a trainable neural network 2;
[0154] an adaptive blur correction unit 6, which includes:
[0155] - at least two blur units 6a,
[0156] - at least two attention units 6b,
[0157] - a probability calculation unit 6c;
[0158] a statistics estimator unit 3;
[0159] a loss function calculation unit 4,
[0160] an optimizer unit 5.
[0161] In the embodiment of the proposed invention illustrated in Fig. 4, the steps (A), (B) are performed in the same manner as when describing the embodiment of the invention illustrated in Fig. 1, that is:
[0162] A) A reference image x from a training dataset of images is supplied to an input of the degradation modeling unit 1. At least two sets of images are generated from the supplied reference image by means of the degradation modeling unit 1, each set of images consisting of at least one image corresponding to said reference image, all images in each set having different degradations.
[0163] B) The obtained sets of images with introduced degradations are supplied to the trainable neural network 2, by means of which one output image is generated for each set of images with introduced degradations, the output images obtained for all the sets of images with introduced degradations form a set of output images.
[0164] C) The set of output images Y={yl, y2,..., yK} and the reference image x are supplied to the adaptive blur correction unit 6.
[0165] In the adaptive blur correction unit 6:
[0166] - a reference image x is supplied to M blur units 6a to form M reference images having different blurring degrees Bm(x) (m = 1... M);
[0167] - each of the output images yk of the set of output images Y={y 1, y2,..., yK} is supplied sequentially and independently of each other to M blur units 6a Bm(m = 1... M) to form M sets of output images with different blurring degrees from K output images {Bm(yl), Bm (y2)...Bm (yK)} (m = 1 ... M, k=l ... K), the blurring degree being the same within the set, the blurring degrees being different between the sets;
[0168] - each of the M reference images with different blurring degrees Bm(x) (m = 1... M) is supplied to the corresponding attention unit 6b together with the set of output images Y={yl, y2, ...,yK}, where for each output image yk (k = 1 . . . K) from this set, an attention map zm(yk) between Bm(x) and yk is calculated, which is a matrix of coefficients of dimension H x W and reflects the degree of proximity (similarity) of each pixel of the reference image with the corresponding blurring degree Bm(x) to yk. Thus, the sets of attention maps {zm(yk)} are obtained, each of the sets corresponds to one blurring degree.
[0169] The mechanism of operation of the attention units can be implemented by the known methods mentioned above.
[0170] - the sets of attention maps {zm(yk) } calculated in the attention units 6b are transmitted to the probability calculation unit 6c, which transfers each of the attention maps {zm(yk)} in each set of attention maps into weights Wm(yk) which are two-dimensional arrays of numerical coefficients (a width and a height of which are equal to a width and a height of yk and Bm(yk)) defining the significance of each point (pixel) of the image with a given blurring degree, wherein in each pixel of the output image (it is indicated by index i below), the sum of the weights is equal to unity (as mentioned above, equation (7)). Thus, sets of weights {Wm(yk)}, (m = 1... M, k=l...K) are obtained, each of the sets corresponds to one blurring degree.
[0171] D) The obtained sets {Bm(yk)} and {Wm(yk)} are supplied to the statistics estimator unit 3, which calculates a plurality of statistics on totality of all the sets (see equation (1)), wherein one of the statistics is an unbiased mathematical expectation estimation y of the output image. In this embodiment of the invention, the unbiased mathematical expectation estimation y of the output image:
[0174] E) In the loss function calculation unit 4, a loss function value is calculated on the basis of the reference image x and the calculated statistics, as described above.
[0175] F) The loss function value is supplied to the optimizer unit 5. The optimizer unit 5, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique (it is indicated as “updated weights” in Fig. 4).
[0176] Steps (A) through (F) are repeated with various images from the training dataset until the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0177] Fig. 5 schematically illustrates an embodiment of the invention with weighting outputs of a neural network using an attention mechanism before calculating of statistics. In this embodiment, the proposed method for training a neural network is performed in a device for training a neural network to improve images / video quality, the device comprising, being operatively coupled:
[0178] a degradation modeling unit 1,
[0179] a trainable neural network 2,
[0180] a unit 7 of weighting outputs of the trainable neural network,
[0181] a statistics estimator unit 3,
[0182] a loss function calculation unit 4,
[0183] an optimizer unit 5.
[0184] The proposed method for training a neural network is schematically illustrated in Fig.5 and consists in the following.
[0185] A) As shown in Fig. 5 and described above, at least two sets of images Xk corresponding to the selected reference image x are generated from one reference image x in the degradation modelling unit 1, all images in each set having different degradations.
[0186] B) As in the previous embodiments, the obtained sets of images with different introduced degradations are supplied to an input of the trainable neural network 2, which generates one output image yk for each set of images Xk with different introduced degradations to form a set of output images Y={yl, y2, ...,yK} (see Fig. 5). Meanwhile, the sets Xk are supplied to the input of the trainable neural network 2 sequentially and independently of each other.
[0187] C) The obtained set of output images Y={yl, y2, ...,yK} is supplied to the unit 7 of weighting outputs of the trainable neural network.
[0188] The unit 7 of weighting outputs of the trainable neural network includes:
[0189] - at least two attention units 6b, each of which is configured to form an attention map zk reflecting the degree of proximity of each pixel of the reference image x to each pixel of the output image yk generated by the trainable neural network, wherein the attention mechanism can be implemented similarly to the cases described above, for example, through the dot product zk=x-yk,
[0190] - a probability calculation unit 6c configured to form weights Wk from the attention maps as described above, with which the output images yk will be taken into account when calculating statistics;
[0191] D) The obtained weights are used when calculating statistics f(Y) in the statistics calculation unit. Meanwhile, when calculating statistics f(Y), the unbiased mathematical expectation estimation of the output image in this case is calculated
[0192] either by the formula:
[0194] in order to cause the trainable neural network 2 to pay more attention to correcting those output images yk that are obtained less similar to the reference image x;
[0195] or by the formula:10196] y =kWkyk; k = 1, 2 ... , K; K > 2 (16)
[0197] in order to cause the trainable neural network 2 to generate more various output images yk.
[0198] E) The calculated statistics f(Y) of the set of output images are supplied to the loss function calculation unit 4.
[0199] In the loss function calculation unit 4, a loss function value is calculated on the basis of the reference image x and the calculated statistics, as described above.
[0200] F) The loss function value is supplied to the optimizer unit 5. The optimizer unit 5, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique.
[0201] Steps (A) through (F) are repeated with various images from the training dataset until the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0202] Fig. 6 schematically illustrates an embodiment of the invention with weighting outputs of a neural network using an attention mechanism before calculating of statistics and with an adaptive blur correction unit. In this embodiment, the proposed method for training a neural network is performed in a device fortraining a neural network to improve images / video quality, the device comprising, being operatively coupled:
[0203] a degradation modeling unit 1,
[0204] a trainable neural network 2,
[0205] a unit 7 of weighting outputs of the trainable neural network, which includes:
[0206] - at least two attention units 6b,
[0207] - a probability calculation unit 6c;
[0208] a statistics estimator unit 3,
[0209] an adaptive blur correction unit 6, which includes:
[0210] - at least two blur units 6a,
[0211] - at least two attention units 6b,
[0212] - a probability calculation unit 6c;
[0213] a loss function calculation unit 4,
[0214] an optimizer unit 5.
[0215] In this embodiment of the proposed invention, the steps (A), (B), (C), (D) are performed in the same manner as in the embodiment describing Fig. 5, and the steps (E), (F) are performed in the same manner as (D), (E) in the embodiment describing Fig. 3, that is:
[0216] A) At least two sets of images Xk corresponding to the selected reference image x are generated from one reference image x in the degradation modeling unit 1, all images in each set having different degradations.
[0217] B) The obtained sets of images with different introduced degradations are supplied to an input of the trainable neural network 2, which generates one output image yk for each set of images Xk with different introduced degradations to form a set of output images Y={yl, y2,...yK}. Meanwhile, the sets Xk are supplied to the input of the trainable neural network 2 sequentially and independently of each other.
[0218] C) The obtained set of output images Y={yl, y2,...yK} is supplied to the unit 7 of weighing outputs of the trainable neural network, which forms weights Wk with which the output images yK will be taken into account when calculating statistics.
[0219] D) The obtained weights are used when calculating statistics f(Y) in the statistics calculation unit. Meanwhile, when calculating statistics f(Y), an unbiased mathematical expectation estimation of the output image in this case is calculated either by formula (15) or by formula (16).
[0220] E) This step is performed in the adaptive blur correction unit 6 in the same manner as the step (D) described for the embodiment illustrated in Fig. 3. At an output of the adaptive blur correction unit 6, unbiased mathematical expectation estimations of the output image with different blurring degree Bm(y)and weights IVm(m = 1... M) are obtained.
[0221] The rest of the plurality of statistics f(Y) calculated in the statistics estimator unit 3 is supplied to the loss function calculation unit 4.
[0222] F) The reference image x, the weights Wm, the unbiased mathematical expectation estimations of the output image with different blurring degrees Bm(y), the rest of the plurality of statistics f(Y) (except for the unbiased mathematical expectation estimation of the output image) are supplied to the loss function calculation unit 4 and a loss function value is calculated either by formula (12) or by formula (13).
[0223] G) The loss function value is supplied to the optimizer unit 5, which, based on the loss function value, updates the weights of the trainable neural network 2 by a backpropagation technique (it is denoted as “updated weights” in Fig. 6).
[0224] Steps (A) through (G) are repeated with various images from the training dataset until the loss function value stops to decrease or the number of iterations determined by a specialist conducting the training is completed.
[0225] Once trained, the neural network can be used for image processing. In particular, it can be adapted to run on a smartphone. By pressing a shooting button by a user, the smartphone will make a series of consecutive frames, which will be transmitted to the input of the trainedneural network. The trained neural network will combine them into one frame with enhanced quality, namely:
[0226] when the statistics estimator unit is present in a scheme:
[0227] - the image detailing will be enhanced, but at the same time some low noise level comfortable for a human eye will maintain;
[0228] when the adaptive blur correction unit is present in the scheme:
[0229] - in the focus area in the final image there will be clear fine details and some noise level that is comfortable for a human eye corresponding to the level set when training will maintain (i.e., the image will not look “cartoonish”);
[0230] - in the areas that are out of focus or blurred (motion), the clarity of blurred details will be enhanced by more aggressive deblurring, because in these areas the details are already greatly blurred and there is no risk of suppressing the noise along with them.
[0231] A similar approach can be used to train neural networks for video processing.
[0232] In one of embodiments of the invention, it is possible to use a system of coaxial cameras to collect a training dataset of images, which allow a series of frames with natural degradations to be shot. A system of coaxial cameras described in a patent RU 2797757 Cl (publication date is 08.06.2023) is two (or more) digital cameras optically coupled using a beam splitter (or beam splitters), one of the cameras being the reference one (shooting high- quality images) and the other (rest) being the target one (shooting low-quality images). In this case, it is possible to obtain pairs of image series with good spatial and temporal alignment, where one series of images has high quality and subsequently a reference image will be formed from this series, and the other series of images will contain natural degradations and will be used as input data for a neural network. Meanwhile, a real noise on the frames shot by the target camera is understood under the natural degradations, and a value of this noise is determined by characteristics of a radiation receiver and camera settings, such as ISO, exposure time.
[0233] Thus, when using two (or more) coaxial cameras, “natural” degradations will be reproduced on the images.
[0234] There are possible two options for shooting a training dataset of images using the system of coaxial cameras: shooting (capturing) images shown on a digital display, or shooting a natural scene. When shooting images shown on the digital display, some original dataset with high-quality images is taken. From each such image, L (L>N, N>1) images with various synthetic motions (shifts, rotation, blurring, etc.) are generated to simulate the camera motion when shooting a series of frames. The images with synthetic motions (L images) are displayed on the digital display sequentially. The reference camera shoots a series of Kref images (Kref > 1) for each digital motion with the same or various exposure time and ISO settings. The target camera shoots a series of K images (K > 2) with the same or various exposure time and ISO settings for each digital motion.
[0235] From the L * K images shot by the target camera, K sets Xk (K > 2) of N (N > 1) images are formed such that at least one common (i.e. the same for all) motion is present in all the K sets. The obtained sets Xk will be supply to the input of the neural network during training.
[0236] From the L * Kref images (Kref > 1) shot by the reference camera, L reference frames are formed for each of the L synthetic motions (for example, Kref images may be combined into one frame, or one frame may be selected from the Kref images).
[0237] The sets Xk have common motion by construction. A reference frame with the same motion is selected as the reference x for the set Xk during neural network training and is supplied to the loss function calculation unit. Motion compensation for the image sets Xk is performed in relation to the common motion. Thus, for each of the images of the original set of images, K series of N images with arbitrary motions and natural degradations, as well as the reference frame x corresponding thereto are obtained.
[0238] The protocol for shooting a natural scene with the system of coaxial cameras is similar to the above protocol for shooting images from the digital display. Except that the set L of motions is formed by motions in a scene in a natural way.
[0239] The proposed invention can be used in smartphone cameras, photo cameras, 3D scanners, virtual reality glasses to enhance a text image quality for subsequent recognition of it, as well as in software products designed to improve and process images.
[0240] Fig. 7 illustrates an embodiment of the controlling method of the electronic apparatus.
[0241] In one of embodiments, a device for training a neural network to enhance images / video quality is proposed, the device comprising, being operatively coupled:
[0242] a degradation modeling unit configured to generate, for each reference image from a training dataset of images, at least two sets of images with introduced degradations, wherein each set consists of at least one image corresponding to said reference image, wherein the introduced degradations are different for each image from the set (S710);
[0243] a trainable neural network configured to generate one output image for each set of images with introduced degradations to form a set of output images (S720);
[0244] a statistics estimator unit configured to calculate a plurality of statistics based on said set of output images (S730);
[0245] a loss function calculation unit configured to calculate a loss function value based on the reference image and said plurality of statistics of the set of output images (S740);
[0246] an optimizer unit configured to update weights of the trainable neural network by a backpropagation technique (S750).
[0247] Meanwhile, the introduced degradations can simulate processes that occur when shooting with a real camera. Meanwhile, the nature of the introduced degradations may depend on a problem of image quality enhancement. Meanwhile, one of the statistics is an unbiased mathematical expectation estimation of the output image.
[0248] Meanwhile, the proposed device may further comprise:
[0249] an adaptive blur correction unit located after (downstream of) the statistics estimator unit, including:
[0250] - at least two blur units, each configured to form the reference image with blurring, wherein the blurring degrees obtained on the reference image in each of the at least two blur units are different, to form blurring of the unbiased mathematical expectation estimation of the output image in each of the at least two blur units, wherein the blurring degrees are different,
[0251] - at least two attention units, each of which is configured to form an attention map based on the supplied reference image with blurring and the biased mathematical expectation estimation of the output image,
[0252] - a probability calculation unit configured to calculate weights defining the significance of each point of the output image with a given blurring degree from the attention maps;
[0253] wherein the loss function calculation unit is configured to calculate a loss function value further taking into account the weights and the unbiased mathematical expectation estimations of the output image with different blurring degrees.
[0254] Meanwhile, the proposed device may further comprise: a unit for weighing outputs of the trainable neural network, located before (upstream of) the statistics estimator unit and comprising:
[0255] - at least two attention units, each of which is configured to form an attention map based on the reference image and one image from the set of output images, wherein each attention map reflects the degree of proximity of each pixel of the reference image to each pixel of the output image from the set of output images,
[0256] - a probability calculation unit configured to calculate weights based on the attention maps, wherein the weights define the significance of each point of this output image from the set of output images;
[0257] wherein the statistics estimator unit is configured to determine statistics further taking into account said weights.
[0258] Meanwhile, the proposed device may further comprise:
[0259] a unit for weighing outputs of the trainable neural network, located before the statistics estimator unit and comprising:
[0260] - at least two attention units configured to form attention maps based on the reference image and one image from the set of output images, wherein each attention map reflects the degree of proximity of each pixel of the reference image to each pixel of this output image from the set of output images;
[0261] - a probability calculation unit configured to calculate weights based on the attention maps, wherein the weights define the significance of each point of this output image from the set of output images;
[0262] wherein the statistics estimator unit is configured to determine statistics further taking into account said weights.
[0263] In one of embodiments, a method for training a neural network to enhance images / video quality is proposed, the method performed by means of the proposed device mentioned above and comprising the steps of:
[0264] supplying, to an input of a trainable neural network, at least two sets of images with introduced degradations obtained in a degradation modeling unit, each of which consists of at least one image corresponding to said reference image, wherein all images in each set have different introduced degradations;
[0265] generating, by the trainable neural network, one output image for each set to form a set of output images;
[0266] supplying the obtained set of output images to the statistics estimator unit where a plurality of statistics of the set of output images is calculated;
[0267] calculating a loss function value based on the reference image and the plurality of statistics of the set of output images in a loss function calculation unit;
[0268] supplying the loss function value into the optimizer unit which, based on the loss function value, updates the weights of the trainable neural network by a backpropagation technique.
[0269] Meanwhile, the method may further comprise steps performed in a further adaptive blur correction unit located after the statistics estimator unit and comprising at least two blur units, at least two attention units, a probability calculation unit, which are operatively coupled,
[0270] wherein the further steps are in the following:
[0271] - forming a reference image with blurring in each of the at least two blur units, wherein the blurring degrees obtained in the reference images are different;
[0272] - forming of blurring of the unbiased mathematical expectation estimation of the output image in each of the at least two blur units, wherein the blurring degrees are different,
[0273] - calculating attention maps, in each of the at least two attention units, based on the supplied reference image with blurring and the biased mathematical expectation estimation of the output image,
[0274] - calculating, in the probability calculation unit, the weights based on the attention maps, wherein the weights define the significance of each point of the output image with a given blurring degree;
[0275] wherein the loss function value is calculated further taking into account the weights and the unbiased mathematical expectation estimations of the output image, which have different blurring degrees.
[0276] In another embodiment, the method may further comprise steps performed in a further unit for weighing outputs of the trainable neural network, located before the statistics estimator unit and comprising at least two attention units, a probability calculation unit,
[0277] wherein the further steps are in the following:
[0278] forming attention maps based on the reference image and one image from the set of output images in each of the attention units, wherein each attention map reflects the degree of proximity of each pixel of the reference image to each pixel of the output image from the set of output images;
[0279] calculating the weights, in the probability calculation unit, wherein the weights define the significance of each point of this output image from the set of output images;
[0280] wherein the statistics estimator unit determines statistics further taking into account said weights.
[0281] In another embodiment, the method may further comprise steps performed in a further unit for weighing outputs of the trainable neural network, located before the statistics estimator unit and comprising at least two attention units, and a probability calculation unit,
[0282] wherein the further steps are in the following:
[0283] forming attention maps based on the reference image and one image from the set of output images in each of the at least two attention units, wherein each attention map reflects the degree of proximity of each pixel of the reference image to each pixel of this output image from the set of output images;
[0284] calculating, in the probability calculation unit, the weights based on the attention maps, wherein the weights define the significance of each point of this output image from the set of output images;
[0285] wherein the statistics estimator unit determines the statistics further taking into account said weights.
[0286] In one of embodiments, a device for training a neural network to enhance images / video quality is proposed, the device comprising, being operatively coupled:
[0287] a degradation modeling unit configured to generate, for each reference image from a training dataset of images, one set of images with introduced degradations, wherein said set consists of at least one image corresponding to said reference image, wherein the introduced degradations are different for each image from the set;
[0288] a trainable neural network configured to generate one output image for said set;
[0289] an adaptive blur correction unit, which includes:
[0290] - at least two blur units,
[0291] wherein each blur unit is configured to
[0292] form a reference image with blurring, wherein the blurring degrees obtained in each of the at least two blur units are different,
[0293] form an output image with blurring, wherein the blurring degrees obtained in each of the at least two blur units are different;
[0294] - at least two attention units, each of which is configured to form an attention map reflecting the degree of proximity of each pixel of the reference image with blurring to each pixel of the output image formed by the trainable neural network;
[0295] - a probability calculation unit configured to form weights from the attention maps, wherein the weights define the significance of each point of the output image with a given blurring degree;
[0296] a loss function calculation unit configured to calculate a loss function value based on the reference image, the weights, the output image with different blurring degrees;
[0297] an optimizer unit configured to update the weights of the trainable neural network by a backpropagation technique.
[0298] In one of embodiments, a method for training a neural network to enhance images / video quality is also proposed, the method performed by means of the proposed device mentioned above and comprising the steps of:
[0299] supplying, to an input of a trainable neural network, one set of images with introduced degradations obtained in a degradation modeling unit, wherein said set consists of at least one image corresponding to said reference image, wherein the introduced degradations are different for each image from the set;
[0300] the trainable neural network generates one output image based on said set;
[0301] in an adaptive blur correction unit:
[0302] - forming a reference image with blurring in each of at least two blur units, wherein the blurring degrees obtained in the reference images are different;
[0303] - forming an output image with blurring in each of the at least two blur units, wherein the blurring degrees are different;
[0304] - calculating an attention map based on the supplied reference image with blurring and the output image,
[0305] - calculating, in a probability calculation unit, weights based on the attention maps, wherein the weights define the significance of each point of the output image with a given blurring degree;
[0306] calculating, in the loss function calculation unit, a loss function value based on: the reference image, the weights, the output images with different blurring degrees;
[0307] supplying the loss function value into the optimizer unit which, based on the loss function value, updates the weights of the trainable neural network by a backpropagation technique.
[0308] In one of embodiments, a device for training a neural network to enhance images / video quality is also proposed, the device comprising, being operatively coupled:
[0309] a degradation modeling unit configured to generate, for each reference image from a training dataset of images, at least two sets of images with introduced degradations, wherein each set consists of at least one image corresponding to said reference image, wherein the introduced degradations are different for each image from the set;
[0310] a trainable neural network configured to generate one output image for each set of images with introduced degradations to form a set of output images;
[0311] an adaptive blur correction unit, which includes:
[0312] - at least two blur units, each configured to:
[0313] form a reference image with blurring, wherein the blurring degrees obtained on the reference image in each of the at least two blur units are different;
[0314] form a set of output images with blurring from each of the output images of the set of output images, wherein the blurring degree is the same within one set of output images with blurring, wherein the blurring degrees are different between the sets of output images with blurring;
[0315] - at least two attention units, each of which is configured to calculate an attention map based on the supplied reference image with blurring and each of the output images from the set of output images to obtain sets of attention maps, wherein each of the sets of attention maps corresponds to one of the blurring degrees;
[0316] - a probability calculation unit configured to calculate sets of weights from each of the attention maps in each set of attention maps, wherein the weights define the significance ofeach point of the output image with a given blurring degree, wherein each of the sets of weights corresponds to one of the blurring degrees;
[0317] a statistics estimator unit configured to calculate a plurality of statistics based on the obtained sets of output images with different blurring degrees and the obtained sets of weights;
[0318] a loss function calculation unit configured to calculate a loss function value based on the reference image and said plurality of statistics;
[0319] an optimizer unit configured to update the weights of the trainable neural network by a backpropagation technique.
[0320] In one of embodiments, a method for training a neural network to enhance images / video quality is also proposed, the method performed by means of the proposed device described above and comprising the steps of:
[0321] supplying, to a input of a trainable neural network, at least two sets of images with introduced degradations obtained in a degradation modeling unit, each of which consists of at least one image corresponding to said reference image, wherein all images in each set have different introduced degradations;
[0322] generating, by the trainable neural network, one output image for each set to form a set of output images;
[0323] in an adaptive blur correction unit:
[0324] - forming a reference image with blurring in each of at least two blur units, wherein the blurring degrees obtained in the reference images are different;
[0325] - forming a set of output images with blurring from each of the output images of the set of output images, wherein the blurring degree is the same within the set of output images with blurring, wherein the blurring degrees are different between the sets of output images with blurring;
[0326] - calculating attention maps based on the supplied reference image with blurring and each of the output images from the set of output images to obtain sets of attention maps, wherein each of the sets of attention maps corresponds to one of the blurring degrees different from the others;
[0327] - calculating, in a probability calculation unit, sets of weights from each of the attention maps in each set of attention maps, wherein the weights define the significance of each point of the output image with a given blurring degree, wherein each of the sets of weights corresponds to one of the blurring degrees different from the others;
[0328] - calculating, in a statistics estimator unit, a plurality of statistics based on the obtained sets of output images with different blurring degree and the obtained sets of weights;
[0329] calculating, in a loss function calculation unit, loss function values based on the reference image and said plurality of statistics;
[0330] updating the weights of the trainable neural network by a backpropagation technique by means of the optimizer unit.
[0331] At least one of the plurality of modules (units) may be implemented through an artificial intelligence (Al) model. A function associated with Al may be performed by means of a non-volatile memory, a volatile memory, and a processor.
[0332] The processor may include one or more processors. The one or more processors may be a general-purpose processor such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VP), and / or a dedicated Al processor such as a neural processing unit (NP).
[0333] The one or more processors control processing of input data in accordance with a predefined operation rule or an artificial intelligence (Al) model stored in a non-volatile memory and a volatile memory. The predefined operation rule or artificial intelligence modelis provided by means of training or learning. Here, “provided through training” means that by applying a learning algorithm to a plurality of training data, a predefined operation rule or an Al model with a desired characteristic is created.
[0334] Fig. 8 illustrates an embodiment of the controlling method of the electronic apparatus.
[0335] The electronic apparatus may include memory storing instructions and at least one processor including processing circuitry.
[0336] The at least one processor may obtain a reference image from a training image dataset, obtain a plurality of degradation images corresponding to the reference image, obtain a plurality of output images by inputting the plurality of degradation images into a neural network model, obtain a statistic value based on the plurality of output images, obtain a loss function value based on the reference image and the statistic value, and update at least one weight of the neural network model based on the loss function value.
[0337] The reference image may refer to an original image used as a ground-truth sample for training an image processing model. The reference image may refer to an original, nondegraded image included in a training image dataset and used as a ground-truth target for training the neural network model.
[0338] The reference image may be described as a first image, a baseline image, a target image, or a ground-truth image.
[0339] A degradation image may refer to an image generated from a reference image by applying one or more degradation operations. A degradation image may represent a modified version of the reference image that includes artificially introduced quality deterioration.
[0340] The degradation image may be described as a degraded image, a deterioration image, a distortion-applied image, or a degradation-processed image.
[0341] The output image may refer to an image generated by a neural network model in response to one or more degradation images. The output image may be described as a reconstructed image, an enhanced image, a generated image, or a network-produced image.
[0342] The statistic value may refer to a numerical measure calculated from a plurality of output images generated by a neural network model. The statistic value may represent a mathematical characteristic such as an unbiased expectation estimation, a standard deviation, or another functional metric derived from the output images. The statistic value may be a parameter used to constrain a loss function during model training.
[0343] The statistic value may be described as a statistical measure, a numerical characteristic, a statistical parameter, or a functional metric.
[0344] The loss function value may refer to a numerical value calculated based on a reference image and one or more statistics derived from output images. The loss function value may represent an evaluation measure indicating how closely a neural network model reconstructs or approximates the reference image.
[0345] The loss function value may be described as a training loss, an error value, a reconstruction loss, or a model-update metric.
[0346] The at least one processor may adjust one or more weights of the neural network model based on the calculated loss function value. The at least one processor may modify the weights so that the neural network model produces output images with a reduced loss in subsequent training iterations.
[0347] The at least one processor may obtain the reference image among a plurality of reference images included in the training image dataset.
[0348] The at least one processor may obtain the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.
[0349] The at least one processor may obtain a degradation image by inserting noise into the image frame of the reference image. The at least one processor may thereby reduce the visual quality of the reference image by introducing sensor-like noise that deteriorates pixel-level clarity.
[0350] The at least one processor may obtain a degradation image by adding blurring to the reference image. The at least one processor may thereby degrade the original sharpness of the reference image to simulate motion blur or defocus blur.
[0351] The at least one processor may obtain a degradation image by shifting the reference image. The at least one processor may thereby generate a lower-quality image by misaligning the frame relative to the original reference image.
[0352] The at least one processor may obtain a degradation image by rotating the reference image. The at least one processor may thereby degrade the image quality by introducing rotational offsets that distort the original spatial arrangement.
[0353] The at least one processor may obtain a degradation image by inserting a random frame into the reference image. The at least one processor may thereby lower the quality of the resulting image by mixing content that does not match the original reference frame.
[0354] based on the reference image being shifted or rotated, the at least one processor may obtain the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.
[0355] The at least one processor may obtain a first set of a plurality of degradation images corresponding to the reference image, obtain a second set of a plurality of degradation images corresponding to the reference image, obtain a first output image by inputting the first set into the neural network model, obtain a second output image by inputting the second set into the neural network model, and obtain the statistic value based on the first output image and the second output image.
[0356] The plurality of degradation images included in the second set may be different from the plurality of degradation images included in the first set.
[0357] The statistic value may include an unbiased mathematical expectation estimation of the plurality of output images. The unbiased mathematical expectation estimation may represent a point-by-point mean value computed across multiple output images generated under different degradation conditions.
[0358] The plurality of degradation images may simulate processes that occur when shooting with a real camera. The plurality of degradation images may reproduce real-world artifacts such as noise, motion blur, misalignment, exposure variations, or unintended frame mixing that naturally arise during actual image capture.
[0359] The at least one processor may update the at least one weight of the neural network model based on the loss function value by using a backpropagation technique. The backpropagation technique may refer to a training algorithm that adjusts weights of a neural network model by propagating an error signal backward through the network.
[0360] The backpropagation technique may be described as a gradient-based training method, an error-propagation algorithm, a backward-gradient update method, or a gradient-descent- driven training technique.
[0361] The at least one processor may store the neural network model for enhancing a quality of an image in the memory. Based on an input image being received, the at least one processor may obtain a result image corresponding to the input image by inputting the input image to the neural network model.
[0362] In an embodiment, controlling method of an electronic apparatus, the controlling method comprising: obtaining (S810) a reference image from a training image dataset, obtaining (S820) a plurality of degradation images corresponding to the reference image,obtaining (S830) a plurality of output images by inputting the plurality of degradation images into a neural network model, obtaining (S840) a statistic value based on the plurality of output images, obtaining (S850) a loss function value based on the reference image and the statistic value, and updating (S860) at least one weight of the neural network model based on the loss function value.
[0363] The obtaining the reference image may include obtaining the reference image among a plurality of reference images included in the training image dataset.
[0364] The obtaining the plurality of degradation images may include obtaining the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.
[0365] The obtaining the plurality of degradation images may include based on the reference image being shifted or rotated, obtaining the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.
[0366] The obtaining the plurality of degradation images may include obtaining a first set of a plurality of degradation images corresponding to the reference image, and obtaining a second set of a plurality of degradation images corresponding to the reference image.
[0367] The obtaining the plurality of output images may include obtaining a first output image by inputting the first set into the neural network model, and obtaining a second output image by inputting the second set into the neural network model.
[0368] The obtaining the statistic value may include obtaining the statistic value based on the first output image and the second output image.
[0369] Although the invention is described in connection with some illustrative embodiments, it should be understood that the invention essence is not limited to these specific embodiments. On the contrary, the invention essence is intended to include all alternatives, corrections, and equivalents that can be included within the spirit and scope of the claims.
[0370] In addition, the invention maintains all equivalents of the claimed invention, even if the claims are changed during the consideration process.
[0371] Meanwhile, according to an embodiment of the disclosure, the aforementioned various embodiments may be implemented as software including instructions stored in machine- readable storage media, which can be read by machines (e.g.: computers). The machines refer to apparatuses that call instructions stored in a storage medium, and can operate according to the called instructions, and the apparatuses may include an electronic apparatus according to the aforementioned embodiments. In case an instruction is executed by a processor, the processor may perform a function corresponding to the instruction by itself, or by using other components under its control. An instruction may include a code that is generated or executed by a compiler or an interpreter. A storage medium that is readable by machines may be provided in the form of a non-transitory storage medium. Here, the term ‘non-transitory’ only means that a storage medium does not include signals, and is tangible, and the term does not distinguish a case wherein data is stored in the storage medium semi-permanently and a case wherein data is stored temporarily.
[0372] Also, according to an embodiment of the disclosure, the methods according to the aforementioned various embodiments may be provided while being included in a computer program product. A computer program product refers to a product, and it can be traded between a seller and a buyer. A computer program product can be distributed in the form of a storage medium that is readable by machines (e.g.: compact disc read only memory (CD-ROM)), or distributed on-line through an application store. In the case of on-line distribution, at least a portion of a computer program product may be stored in a storage medium such as the serverof the manufacturer, the server of the application store, and the memory of the relay server at least temporarily, or may be generated temporarily.
[0373] In addition, each of the components (e.g.: a module or a program) according to the aforementioned various embodiments may consist of a singular object or a plurality of objects. Also, among the aforementioned corresponding sub components, some sub components may be omitted, or other sub components may be further included in the various embodiments. Alternatively or additionally, some components (e.g.: a module or a program) may be integrated as an object, and perform functions that were performed by each of the components before integration identically or in a similar manner. Further, operations performed by a module, a program, or other components according to the various embodiments may be executed sequentially, in parallel, repetitively, or heuristically. Or, at least some of the operations may be executed in a different order or omitted, or other operations may be added.
[0374] Also, while preferred embodiments of the disclosure have been shown and described, the disclosure is not limited to the aforementioned specific embodiments, and it is apparent that various modifications may be made by those having ordinary skill in the technical field to which the disclosure belongs, without departing from the gist of the disclosure as claimed by the appended claims. Further, it is intended that such modifications are not to be interpreted independently from the technical idea or prospect of the disclosure.
Claims
Claims
1. An electronic apparatus comprising: memory storing instructions, and at least one processor including processing circuitry, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: obtain a reference image from a training image dataset, obtain a plurality of degradation images corresponding to the reference image, obtain a plurality of output images by inputting the plurality of degradation images into a neural network model, obtain a statistic value based on the plurality of output images, obtain a loss function value based on the reference image and the statistic value, and update at least one weight of the neural network model based on the loss function value.
2. The electronic apparatus as claimed in claim 1, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: obtain the reference image among a plurality of reference images included in the training image dataset.
3. The electronic apparatus as claimed in claim 1, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: obtain the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.
4. The electronic apparatus as claimed in claim 3, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: based on the reference image being shifted or rotated, obtain the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.
5. The electronic apparatus as claimed in claim 1, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: obtain a first set of a plurality of degradation images corresponding to the reference image, obtain a second set of a plurality of degradation images corresponding to the reference image,obtain a first output image by inputting the first set into the neural network model, obtain a second output image by inputting the second set into the neural network model, and obtain the statistic value based on the first output image and the second output image.
6. The electronic apparatus as claimed in claim 5, wherein the plurality of degradation images included in the second set are different from the plurality of degradation images included in the first set.
7. The electronic apparatus as claimed in claim 1, wherein the statistic value includes an unbiased mathematical expectation estimation of the plurality of output images.
8. The electronic apparatus as claimed in claim 1, wherein the plurality of degradation images simulate processes that occur when shooting with a real camera.
9. The electronic apparatus as claimed in claim 1, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: update the at least one weight of the neural network model based on the loss function value by using a backpropagation technique.
10. The electronic apparatus as claimed in claim 1, wherein the instructions, when executed by the at least one processor individually or collectively, cause the electronic apparatus to: store the neural network model for enhancing a quality of an image in the memory, and based on an input image being received, obtain a result image corresponding to the input image by inputting the input image to the neural network model
11. A controlling method of an electronic apparatus, the controlling method comprising: obtaining a reference image from a training image dataset, obtaining a plurality of degradation images corresponding to the reference image, obtaining a plurality of output images by inputting the plurality of degradation images into a neural network model, obtaining a statistic value based on the plurality of output images, obtaining a loss function value based on the reference image and the statistic value, and updating at least one weight of the neural network model based on the loss function value.
12. The controlling method of claim 11, wherein the obtaining the reference image comprises: obtaining the reference image among a plurality of reference images included in the training image dataset.
13. The controlling method of claim 11, wherein the obtaining the plurality of degradation images comprises: obtaining the plurality of degradation images corresponding to the reference image by at least one of adding noise to the reference image, adding blurring to the reference image, shifting the reference image, rotating the reference image or inserting random frame into the reference image.
14. The controlling method of claim 13, wherein the obtaining the plurality of degradation images comprises: based on the reference image being shifted or rotated, obtaining the plurality of degradation images corresponding to the reference image by compensating the shifted reference image or the rotated reference image.
15. The controlling method of claim 11, wherein the obtaining the plurality of degradation images comprises: obtaining a first set of a plurality of degradation images corresponding to the reference image, and obtaining a second set of a plurality of degradation images corresponding to the reference image, wherein the obtaining the plurality of output images comprises: obtaining a first output image by inputting the first set into the neural network model, and obtaining a second output image by inputting the second set into the neural network model, and wherein the obtaining the statistic value comprises: obtaining the statistic value based on the first output image and the second output image.