A single image super-resolution reconstruction method with hair-level details

By introducing a dual-branch network structure that combines nonlocal attention mechanism and depthwise separable convolution, the artifact and blur problems in image super-resolution reconstruction are solved, and image restoration at the hair-level detail is achieved, improving image quality and detail restoration effect.

CN116362969BActive Publication Date: 2026-06-26GUANGZHOU UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU UNIVERSITY
Filing Date
2023-03-02
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies suffer from artifacts and blurring issues in image super-resolution reconstruction, and fail to effectively utilize global information within the image, resulting in low fidelity and overly smooth images.

Method used

We introduce a combination of non-local attention mechanism and depthwise separable convolution, and use a dual-branch network structure with attention mechanism added to the channels. We train the network using L1 loss, perceptual loss and adversarial loss, simplifying it to the mapping relationship IHR=F(ILR). We remove the fuzzy kernel estimation and add bicubic interpolation upsampling.

Benefits of technology

It achieves super-resolution reconstruction with hair-level detail, which can restore local fragments and texture details, improve image quality, and is significantly superior to existing technologies, especially in texture and edge processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116362969B_ABST
    Figure CN116362969B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of single-image super-resolution, and discloses a single-image super-resolution reconstruction method with hair-level details, which first introduces a non-local attention mechanism to restore local fragments by learning the whole image area. Then, it is found that the use and estimation of a blur kernel are unnecessary for existing methods. Based on the finding, a double-branch network structure is created to combine the non-local attention mechanism and the depth separable convolution, but since neither of the two considers information interaction between channels, a channel attention mechanism is connected after the double branch to obtain information interaction between channels.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of single-image super-resolution technology, specifically to a single-image super-resolution reconstruction method with hair-level detail. Background Technology

[0002] With the rapid development of mobile devices such as smartphones and tablets, people are no longer satisfied with browsing low-quality images and prefer to view high-resolution images with better visual effects. Simultaneously, with the gradual development of network technology, most electronic devices can now support receiving and displaying ultra-high-resolution images. However, images are easily distorted during acquisition, transmission, and storage. Therefore, single-image super-resolution reconstruction has become an important research topic in computer vision. Furthermore, since images viewed on the internet in practical applications lack original images as references, this has promoted the development of the field of blind image super-resolution reconstruction.

[0003] Generally, distortion includes natural, real distortion and artificial, synthetic distortion. Real distortion is a mixture of natural distortions caused by underexposure, overexposure, blurring due to photographer movement, and compression errors during the shooting process. Synthetic distortion refers to distortions caused by artificial additions such as white noise, Gaussian blur, JPEG2000, salt-and-pepper noise, and reduced global contrast. Both real and synthetic distortions are widely present in images. Therefore, inventing a method that can perform super-resolution reconstruction of images with any type of distortion is of great significance.

[0004] Existing technical solution 1: Liu Ruopeng, A super-resolution reconstruction method, 2020.

[0005] This invention provides a super-resolution reconstruction method, comprising: establishing an image dataset; constructing a neural network structure, wherein the neural network structure is used to extract features from the image dataset during neural network training; establishing a loss function for the neural network structure, wherein the loss function is used to guide neural network training; training the image dataset to obtain a neural network model; and reconstructing images using the neural network model, taking a low-resolution image as input and outputting a high-resolution image. The method improves upon SRGAN (Super-Resolution Generative Adversarial Network) by modifying the network structure of the generative network G-NET and improving the loss function. Because the improved G-NET extracts more accurate features, the super-resolution reconstruction effect is superior, resulting in better performance in detection, recognition, and semantic segmentation.

[0006] Existing technical solution 2: Li Jixiang, A method for super-resolution image reconstruction, 2020.

[0007] This invention discloses an image super-resolution reconstruction method, comprising the following steps: constructing a low-resolution feature space and a high-resolution feature space into multiple paired low-resolution feature subspaces and high-resolution feature subspaces; establishing a linear mapping relationship between the paired low-resolution feature subspaces and high-resolution feature subspaces; and reconstructing a high-resolution reconstructed image from a low-resolution reconstructed image according to the linear mapping relationship. The image super-resolution reconstruction method of this invention can quickly obtain high-quality, high-resolution images.

[0008] Existing technical solution 3: Chang Kan, Image super-resolution method based on reconstruction, 2019.

[0009] This application discloses a reconstruction-based image super-resolution method, comprising: a) generating an initial HR image X0 from a low-resolution LR image Y, and using the initial HR image as the most recently reconstructed HR image X'; b) calculating a guiding kernel for each image block on the most recently reconstructed HR image X', and thereby establishing a corresponding homogeneous pixel extraction matrix; performing adaptive shape block matching for each image block on the most recently reconstructed HR image X' using the homogeneous pixel extraction matrix, and calculating the predicted value of the similar image block group of the i-th image block; calculating the gradient prediction value of the HR image using the most recently reconstructed HR image X' and a pre-trained denoiser; e) determining the current reconstructed HR image X. If the iteration count limit has not been reached, the current reconstructed HR image X is used as the most recently reconstructed HR image X' and the process returns to step b; otherwise, the current reconstructed HR image X is saved or output. Applying this application can improve the performance of super-resolution.

[0010] The drawback of the existing technical solution 1 is that although GAN can bring good visual effects to the image, GAN often brings artifacts to the image, resulting in low fidelity of the generated image.

[0011] The drawbacks of existing technical solutions 2 and 3 are that neither of these methods considers the global information inside the image, and the images generated by methods that do not use GANs are often too blurry and smooth.

[0012] In summary, we propose a single-image super-resolution reconstruction method with hair-level detail. Summary of the Invention

[0013] (a) Technical problems to be solved

[0014] To address the shortcomings of existing technologies, this invention provides a single-image super-resolution reconstruction method with hair-level detail. First, a non-local attention mechanism is introduced to recover local fragments by learning from the entire image region. Then, we find that existing methods do not require the use and estimation of blur kernels. Based on this finding, we create a two-branch network structure that combines the non-local attention mechanism and depthwise separable convolution. However, since neither of these considers inter-channel information interaction, we append a channel attention mechanism after the two branches to obtain inter-channel information interaction.

[0015] (II) Technical Solution

[0016] To achieve the above-mentioned objectives, the present invention provides the following technical solution: a single-image super-resolution reconstruction method with hair-level detail, comprising the following steps:

[0017] Step 1: Perform LayerNorm operation on the feature map of the channel;

[0018] Step 2: Add a gated Dconv feedforward network module, remove the estimated blur kernel module, and add bicubic interpolation upsampling;

[0019] Step 3: Introduce a non-local attention mechanism to recover local segments by learning from the entire image region;

[0020] Step 4: Create a dual-branch network structure that combines nonlocal attention mechanisms with depthwise separable convolutions;

[0021] Step 5: Using the above network as the generator, and the U-Net discriminator with spectral normalization as the discriminator, train a PSNR-oriented model using L1 loss. Then, use the trained PSNR-oriented model as the generator, and train a GAN-oriented model using a combination of L1 loss, perceptual loss, and adversarial loss. The total network loss is expressed as follows:

[0022] L total =L1+L perc +0.1×L adv ;

[0023] Among them, L1, L perc and L adv These represent L1 loss, perceptual loss, and adversarial loss, respectively.

[0024] Step 6: Set the weights to 1, 1 and 0.1 respectively, and use the pre-trained VGG19 to use the conv1, ..., conv5 feature maps before the activation function as the perceptual loss;

[0025] Step 7: The image degradation process is described as follows:

[0026]

[0027] Where I LR Indicates a low-resolution image, I HR Represents a high-resolution image. This represents the convolution operation, where K represents the blur kernel, ↓ bic This indicates bicubic downsampling, and N represents additive white Gaussian noise.

[0028] Step 8: Simplify the super-resolution reconstruction process into the following formula:

[0029] I HR =F(I LR );

[0030] F represents a single-image super-resolution reconstruction method that provides hair-level detail.

[0031] Preferably, the dual-branch network structure includes a branch passing through an NLSA block and another branch passing through two layers of 3×3 depthwise separable convolutions, with the two branches connected in series on the channel.

[0032] Preferably, a 1×1 convolution is used to fuse the two branches, and then CA is used to handle the information exchange between channels.

[0033] (III) Beneficial Effects

[0034] Compared with existing technologies, this invention provides a single-image super-resolution reconstruction method with hair-level detail, which has the following beneficial effects:

[0035] 1. This single-image super-resolution reconstruction method, which possesses hair-level detail, can utilize non-local information within an image to perform super-resolution reconstruction, enabling the image to possess hair-level detail, as shown in the attached figure. Figure 3 As shown, it can perform super-resolution reconstruction of animal hair, repetitive structures in buildings, and repetitive stripes very well.

[0036] 2. This single-image super-resolution reconstruction method with hair-level detail is shown in the attached document. Figure 4 As shown in the third row, the compared methods produced slanted stripes, as indicated by the red box. Our approach preserves almost identical straight stripes to the original image. This is because existing techniques do not focus on information in similar regions around the image, such as the horizontal stripes in the blue area. Instead, our approach focuses on non-local information, thus producing the same stripes as the original image, achieving hair-level detail super-resolution reconstruction. Attached Figure Description

[0037] Figure 1 A diagram illustrating the network setup process;

[0038] Figure 2 This is a schematic diagram showing the final network details;

[0039] Figure 3 Visual representations of the results on the Set5, Set14, BSD100, Urban100, and Manga109 test sets;

[0040] Figure 4 Visual representation of the DIV2KRK test set;

[0041] Figure 5 This is a schematic diagram of an ablation experiment. Detailed Implementation

[0042] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0043] Please see Figure 1-5 A single-image super-resolution reconstruction method with hair-level detail includes the following:

[0044] The process of image degradation is described as follows:

[0045]

[0046] Where I LR Indicates a low-resolution image, I HR Represents a high-resolution image. This represents the convolution operation, where K represents the blur kernel, ↓ bic Let represent bicubic downsampling, and N represent additive white Gaussian noise. Based on experiments, we found that good performance can be achieved without initializing the blur kernel. Therefore, we simplify the super-resolution reconstruction process to the following formula:

[0047] I HR =F(I LR )

[0048] F is the method proposed in this paper. By training the network, F can directly learn the mapping relationship from low-resolution images to high-resolution images without the need for additional blur kernel estimation.

[0049] Research shows that adding a channel attention (CA) module to a network can easily cause gradient explosion during training. (See attached image.) Figure 1As shown in (I), to solve this problem, we perform a LayerNorm operation on the feature map of the channel. (See attached diagram.) Figure 1 As shown in (II), based on (I), we added a gated Dconv feedforward network (GDFN) module, which is equivalent to a nonlinear activation function. Furthermore, as shown in the appendix... Figure 1 As shown in (III), we removed the module for estimating the fuzzy kernel and added a bicubic interpolation upsampling operation.

[0050] Research shows that increasing the depth of a neural network by stacking 3×3 convolutions only increases the theoretical receptive field, not the actual receptive field. To address this issue, see the attached... Figure 1 As shown, our network employs a Non-Local Attention (NLA) module to capture long-range dependencies. While NLA can capture long-range dependencies, it ignores inter-channel information interaction. Therefore, we add a Channel Interaction (CA) module after NLA to capture inter-channel information interaction. Although adding the NLA block improves performance, it significantly increases the model's computational complexity. Therefore, we replace the NLA module with a Non-Local Sparse Attention (NLSA) module to reduce model complexity.

[0051] Furthermore, to preserve the inductive bias property of convolution, we propose a two-branch structure. One branch passes through an NLSA block, and the other branch passes through two 3×3 depthwise separable convolutions. We then concatenate these two branches along the channels. Since neither the NLSA block nor the depthwise separable convolutions account for inter-channel information exchange, we suggest using 1×1 convolutions to fuse the two branches, and then using CA to handle the inter-channel information exchange. (See attached...) Figure 2 As shown, to reduce the increase in network complexity caused by increasing network depth and to enable the network to learn feature map information at different scales, our network uses the Unet structure. In this invention, we name the proposed network NLCUnet (Non-local&Local&Channel Unet).

[0052] Previous research has shown that GANs can produce images with good visual quality. Therefore, we use NLCUnet as the generator and a U-Net discriminator with spectral normalization as the discriminator, which we call NLCUnetGAN. Our training process consists of two stages. First, we train a PSNR-oriented model using L1 loss. Then, we use the trained PSNR-oriented model as the generator and train a GAN-oriented model using a combination of L1 loss, perceptual loss, and adversarial loss. Finally, the total loss of our network is expressed as follows:

[0053] L total =L1+L perc +0.1×L adv

[0054] Among them, L1, L perc and L adv These represent L1 loss, perceptual loss, and adversarial loss, respectively. Through experiments, we set the weights to 1, 1, and 0.1, respectively. We used a pre-trained VGG19 (weights 0.1, 0.1, 1, 1, 1) to use the conv1, ..., conv5 feature maps before the activation function as the perceptual loss.

[0055] Using PSNR and SSIM as evaluation metrics, and DF2K (DIV2K and Flickr2K) as the training set, we tested our method on six mainstream test sets, including Set5, Set14, BSD100, Urban100, Manga109, and DIV2KRK. Our method is the best among existing methods. We set up two experimental configurations, designated Configuration 1 and Configuration 2. Configuration 1 primarily considers the case of isotropic Gaussian blur kernels. Configuration 2 primarily considers the case of irregular blur kernels.

[0056] Configuration 1: We set the kernel size to 21. During training, the width of the blur kernel is uniformly sampled in the ranges [0.2, 2.0], [0.2, 3.0], and [0.2, 4.0] for ×2, ×3, and ×4, respectively. We collect HR images for quantitative evaluation from widely used benchmark datasets, namely Set5, Set14, BSD100, Urban100, and Manga109. We select 8 kernels from the ranges [0.80, 1.60], [1.35, 2.40], and [1.8, 3.2], corresponding to ×2, ×3, and ×4, respectively. To create synthetic test images, the HR images are first downsampled and then blurred by the selected blur kernels.

[0057] Configuration 2: For ×2 and ×4, the blur kernel sizes are set to 11×11 and 31×31, respectively. During training, the anisotropic Gaussian kernel is generated by randomly selecting a kernel width from the range (0.6, 5) and rotating it from the range [-Π, Π]. For testing, we first perform a centralized cropping of the DF2K dataset (512×512), then randomly crop within the 512×512 block (64×64), and add anisotropic Gaussian kernels for degradation. Note that the variance of the Gaussian kernel is sampled between 0.175 and 2.5 for ×2 and ×4.

[0058] Experimental Results and Analysis:

[0059] For the ×2 scaling factor, the PSNRs for Set5, Set14, BSD100, Urban100, Manga109, and DIV2KRK are 37.75, 33.59, 32.35, 31.10, 37.55, and 28.38, respectively, and the SSIMs are 0.9551, 0.9091, 0.8985, 0.9105, 0.9727, and 0.8354, respectively.

[0060] For the ×3 ratio, the PSNR of Set5, Set14, BSD100, Urban100, and Manga109 are 34.27, 30.37, 29.15, 28.77, and 34.09, respectively, and the SSIM are 0.9236, 0.8338, 0.7987, 0.8596, and 0.9453, respectively. In particular, Set5 and Urban100 are 0.318 dB higher than the best existing technology.

[0061] For ×4 scaling, the PSNRs for Set5, Set14, BSD100, Urban100, Manga109, and DIV2KRK are 32.29, 28.66, 27.96, 25.69, 30.16, and 27.13, respectively, and the SSIMs are 0.8931, 0.7751, 0.7518, 0.7597, 0.9004, and 0.7595, respectively. In particular, the PSNR for Urban100 is 0.35 dB higher than the best existing technology, and the PSNR for Manga109 is 0.467 dB higher than DCLS.

[0062] Ablation experiment:

[0063] As attached Figure 5 As shown, LN indicates that we added the LayerNorm operation to the model, which solves the gradient explosion problem caused by CA. No_Ker indicates that we removed the fuzzy kernel estimation operation, which has almost no negative impact on the model. Bic indicates that we added a bicubic upsampling operation to the model, thereby reducing the training cycle and improving performance. We use NLC as the basic component of the model and Unet as the model framework to achieve optimal performance.

[0064] As attached Figure 3 and attached Figure 4 The results demonstrate the visual performance of our model on the test set. As you can see, compared to other techniques, our invention is remarkably effective in super-resolution of similar content in textures, edges, and images.

[0065] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A single-image super-resolution reconstruction method with hair-level detail, characterized in that, Includes the following steps: Step 1: Perform LayerNorm operation on the feature map of the channel; Step 2: Add a gated Dconv feedforward network module, remove the estimated blur kernel module, and add bicubic interpolation upsampling; Step 3: Introduce a non-local attention mechanism to recover local segments by learning from the entire image region; Step 4: Create a dual-branch network structure that combines nonlocal attention mechanisms with depthwise separable convolutions; Step 5: Using the above network as the generator, and the U-Net discriminator with spectral normalization as the discriminator, a PSNR-oriented model is trained using L1 loss. Then, the trained PSNR-oriented model is used as the generator, and a GAN-oriented model is trained using a combination of L1 loss, perceptual loss, and adversarial loss. The total network loss is expressed as follows: L total =L1+L perc +0.1×L adv ; Among them, L1, L perc and L adv These represent L1 loss, perceptual loss, and adversarial loss, respectively. Step 6: Set the weights to 1, 1 and 0.1 respectively, and use the pre-trained VGG19 to use the conv1, ..., conv5 feature maps before the activation function as the perceptual loss; Step 7: The image degradation process is described as follows: Where I LR Indicates a low-resolution image, I HR Represents a high-resolution image. This represents the convolution operation, where K represents the blur kernel, ↓ bic This indicates bicubic downsampling, and N represents additive white Gaussian noise. Step 8: Simplify the super-resolution reconstruction process into the following formula: I HR =F(I LR ); F represents a single-image super-resolution reconstruction method that provides hair-level detail.

2. The single-image super-resolution reconstruction method with hair-level detail according to claim 1, characterized in that: The dual-branch network structure consists of a branch passing through an NLSA block and another branch passing through two layers of 3×3 depthwise separable convolutions, with the two branches connected in series along the channels.

3. The single-image super-resolution reconstruction method with hair-level detail according to claim 2, characterized in that: Use 1×1 convolutions to fuse the two branches, and then use CA to handle the information exchange between channels.