A defocus deblurring method based on hierarchical model
By decomposing the defocusing and deblurring task into basic feature encoding and context encoding using a hierarchical model, and extracting high-level semantic information using the ViT architecture, the problem of existing methods failing to fully utilize context information is solved, resulting in a significant improvement in image restoration quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2024-08-05
- Publication Date
- 2026-06-23
AI Technical Summary
Existing defocusing and deblurring methods fail to fully utilize high-level semantic context information, resulting in poor image restoration quality, which has a significant impact, especially in computer vision applications.
A hierarchical model-based approach is adopted to decompose the deblurring task into two independent subtasks: basic feature encoding and context encoding. The basic feature encoder is used to extract primary representations, and the context encoder is used to obtain clear abstract representations. Finally, the clear image is reconstructed through the decoder, and high-level semantic context information is fused.
It significantly improves image restoration quality, enhances human visual perception-related indicators, reduces artifacts, accurately restores details, and achieves an effective balance between high-level visual tasks and image restoration tasks.
Smart Images

Figure CN119151825B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a defocusing and blur removal method, belonging to the field of image processing technology. Background Technology
[0002] Defocus deblurring has always been a challenging problem in dealing with spatial blurring caused by changes in scene depth. Although recent work has made significant progress in network architecture design, these advances have mainly focused on processing high-frequency details, and the importance of scene understanding in the deblurring process has not received sufficient attention. The core of scene understanding lies in the utilization of contextual information, which provides key support for capturing high-level semantic cues of the environment and object contours. By identifying and effectively utilizing these cues, the quality of image restoration can be significantly improved. Based on this, we propose a novel method that integrates spatial details and contextual information, providing an effective solution to the defocus deblurring problem. Specifically, we introduce a novel hierarchical model based on the Visual Transformer (ViT), which can seamlessly integrate spatial details and contextual information. Our method decomposes the deblurring task into two independent subtasks: the first subtask is handled by the basic feature encoder, which transforms the blurred image into a detailed representation of basic features; the second subtask is handled by the context encoder, which extracts an abstract and clear representation from the basic feature representation. The outputs of the two encoders are merged and reconstructed into a clear target image through a decoder.
[0003] Defocus blur occurs when the depth range of a scene exceeds the depth of field (DoF) limit of a given camera. In this case, objects in non-focal areas appear blurred, leading to the loss of crucial details. This image degradation has a significant negative impact on various computer vision applications, including object detection, image super-resolution, and text recognition. Therefore, restoring a fully focused image to reveal its high-resolution details is of significant research importance and practical application value in the field of computer vision. To address the defocus blur problem, researchers have developed various algorithms that can transform blurred images into sharp ones. These methods typically involve using specific network modules, such as dynamic kernels and multi-scale modules, which have demonstrated promising performance in standard defocus deblurring benchmarks. The current mainstream research trend is to combine advanced network architectures with accurately modeled defocus priors to achieve optimal deblurring results. Representative models such as NRKNet and KPAC demonstrate the practical effectiveness of this trend. The aforementioned deblurring algorithms mainly focus on reconstructing high-frequency detail features of images, but often fail to fully capture the so-called "contextual information," i.e., high-level semantic cues. This type of information is closely related to the semantics or context of the input image, and is particularly important for defocusing and deblurring, as deblurring requires separate processing for different regions of the image. Currently, deblurring networks attempt to achieve a broad receptive field by employing large-scale kernels and self-attention mechanisms; however, these networks often focus only on capturing local spatial information and fail to deeply understand the overall semantics of the scene. In the field of computer vision, significant progress has recently been made in exploring the semantic context of images. This trend aligns with the need for semantic information in defocusing and deblurring, providing new possibilities for overcoming the limitations of existing methods. In this area, the Vision Transformer (ViT) architecture stands out. Models based on ViT, such as MAE and CLIP, have demonstrated excellent image understanding capabilities even when dealing with degraded inputs. These advances highlight the enormous potential of incorporating high-level semantic contextual information into defocusing and deblurring tasks. However, the incompatibility between high-level visual requirements and image restoration tasks remains a significant challenge. Summary of the Invention
[0004] To address the incompatibility between high-level vision requirements and image restoration tasks, this invention proposes a defocusing and deblurring method based on a hierarchical model.
[0005] The technical solution adopted by the present invention to solve the above problems is as follows: The present invention specifically includes:
[0006] Step 1: Use a separator to separate the blurred image Divided into The basic feature encoder accepts the segmented data. As input, and aiming to learn a primary representation that retains as much detail as possible. The goal of the basic feature encoder is to map a blurred image into a latent space and extract primary representations. The aim is to generate high-fidelity inputs to aid the context encoder; the basic feature encoder comprises four layers of ViT.
[0007] Step 2, The three-layer ViT layer in the input context encoder is obtained and will The input fuzzy-to-sharp converter module obtains a clearer and more abstract representation. Specifically, it includes:
[0008] Using linear layers from predict :
[0009] (1),
[0010] In formula (1), Indicates the weights of the linear layer;
[0011] Reference features By applying the same encoder to the reference image Obtain, reference image The dimension is ;
[0012] Introducing a classification loss function based on binary cross-entropy promotes and Learn the features that distinguish between blurry and sharp images:
[0013] (2),
[0014] ,
[0015] In formula (2), Indicates a label, and 0 indicates blurry, 1 indicates clear. Tensor The total number of elements in the middle. This represents the sigmoid function, used to normalize the eigenvalue interval, which is (0,1).
[0016] The overall loss of the fuzz-to-sharp converter module is defined as:
[0017] (3),
[0018] In formula (3), Hyperparameters used to calculate the distance between representations Used to balance the importance of classification loss with other loss components;
[0019] By using loss function Train the context encoder to acquire fuzzy-related representations and predict the fuzzy context. A clear representation;
[0020] Introducing Joint Embedding Architecture Enhanced Representation in Context Encoder The architecture includes an information content preservation module, which uses a regularization criterion defined by VICReg to remove details that are irrelevant to fuzziness. The information content preservation module maps the representation to an embedding space through a multilayer perceptron and derives the regularization loss from it.
[0021] Step 3, the decoder will and As input, reconstruct the blurred image. .
[0022] Furthermore, regularization losses are used in the information content preservation module. These losses are based on three principles: variance, invariance, and covariance. When calculating the regularization losses, the information content preservation module receives two sets of inputs: a batch of reference representations. and the representation derived from the fuzzy-to-clear converter ,pass Embedding the above representations into a low-dimensional space enhances the ability to abstract:
[0023] (4),
[0024] In formula (4), The characteristic averaging operator, The dimension is When calculating the average, the [CLS]| tag used for classification in the converter architecture is excluded; two sets of batch embeddings are derived from the mapping, represented as and ,in It refers to the batch size;
[0025] The regularization loss function of the information content preservation module is defined as:
[0026] (5),
[0027] In formula (5), express or , and Indicates hyperparameters, This represents the mean of the batch embeddings, i.e. The total loss function of the information content preservation module includes three regularization terms, which are defined as follows:
[0028] (6),
[0029] In formula (6), and It is a hyperparameter used to balance the effects of various loss terms.
[0030] Furthermore, the decoder utilizes the features extracted by the basic feature encoder and the context encoder. and Generate the deblurred image Using traditional Loss function training decoder:
[0031] (7),
[0032] For the entire model, the final loss is defined as:
[0033] (8),
[0034] In formula (8), and It is a parameter that balances different loss terms.
[0035] Furthermore, the training strategy for training the decoder is as follows: the base feature encoder parameterized by the pre-trained MAE is frozen during the training of the context encoder, and then this model is trained on a large-scale ImageNet dataset; widely accepted data augmentation techniques are employed, including random cropping to a size of 224×224, random scaling from 0.2 to 1.0, and random horizontal flipping; the loss of the entire model is minimized according to the loss function defined by Equation (8), and the context encoder is initialized with the parameters from the pre-training stage.
[0036] The beneficial effects of this invention are: on metrics directly related to human visual perception, such as MUSIQ, this invention improves the score by 1.65 points compared to the state-of-the-art NRKNet; compared to methods with specific auxiliary input structures, such as KPAC, this invention also achieves a significant performance improvement of 0.042 on FSIM; the metrics of this invention better reflect human visual perception; and this invention has significant potential in single-image defocusing and deblurring tasks.
[0037] This invention effectively restores the content structure of an image, while other methods often leave significant background blur. This invention successfully reconstructs the image structure, sharpens edges, and produces a clear and visually pleasing image. This invention can more accurately restore details, reduce artifacts, and precisely handle defocused areas.
[0038] This invention integrates high-level contextual information into a single-image defocusing and deblurring task. It fully leverages the capabilities of pre-trained ViT in extracting high-level contextual representations. This invention effectively addresses the issues of extracting abstract features from semantic understanding and preserving high-frequency details. It also effectively mitigates the potential conflict between high-level visual tasks and image restoration tasks. Furthermore, this invention decomposes the deblurring task into two independent sub-tasks and uses two dedicated encoders to handle complex blurred scenes, thereby effectively eliminating defocusing blur caused by spatial variations.
[0039] This invention integrates high-level semantic context information with high-frequency details into the defocusing and deblurring task, achieving an effective balance between these two key aspects. It also innovatively decomposes the image deblurring task into two parts: abstract context feature extraction and specific detail extraction. This approach effectively compensates for the shortcomings of existing defocusing and deblurring methods in context feature extraction. Attached Figure Description
[0040] Figure 1 This is a schematic diagram of a layered model;
[0041] Figure 2 This is a visual comparison diagram of single-image defocusing and deblurring performed on the DPDD dataset of this invention. Detailed Implementation
[0042] Specific implementation method one: as follows Figure 1 and Figure 2 As shown, a defocusing and deblurring method based on a hierarchical model specifically includes:
[0043] Step 1: Use a separator to separate the blurred image Divided into The basic feature encoder accepts the segmented data. As input, and aiming to learn a primary representation that retains as much detail as possible. The goal of the basic feature encoder is to map a blurred image into a latent space and extract primary representations. The aim is to generate high-fidelity inputs to aid the context encoder; the basic feature encoder comprises four layers of ViT.
[0044] Step 2, The three-layer ViT layer in the input context encoder is obtained and will The input fuzzy-to-sharp converter module obtains a clearer and more abstract representation. Specifically, it includes:
[0045] Using linear layers from predict :
[0046] (1),
[0047] In formula (1), Indicates the weights of the linear layer;
[0048] Reference features By applying the same encoder to the reference image Obtain, reference image The dimension is ;
[0049] Introducing a classification loss function based on binary cross-entropy promotes and Learn the features that distinguish between blurry and sharp images:
[0050] (2),
[0051] ,
[0052] In formula (2), Indicates a label, and 0 indicates blurry, 1 indicates clear. Tensor The total number of elements in the middle. This represents the sigmoid function, used to normalize the eigenvalue interval, which is (0,1).
[0053] The overall loss of the fuzz-to-sharp converter module is defined as:
[0054] (3),
[0055] In formula (3), Hyperparameters used to calculate the distance between representations Used to balance the importance of classification loss with other loss components;
[0056] By using loss function Train the context encoder to acquire fuzzy-related representations and predict the fuzzy context. A clear representation;
[0057] Introducing Joint Embedding Architecture Enhanced Representation in Context Encoder The architecture includes an information content preservation module, which uses a regularization criterion defined by VICReg to remove details that are irrelevant to fuzziness. The information content preservation module maps the representation to an embedding space through a multilayer perceptron and derives the regularization loss from it.
[0058] Step 3, the decoder will and As input, reconstruct the blurred image. .
[0059] Among them, the defocusing deblurring method based on a hierarchical model is based on the standard ViT architecture, which includes a basic feature encoder, a context encoder, and a decoder. The basic feature encoder receives the segmented... As input, and aiming to learn a primary representation that retains as much detail as possible. This representation not only preserves the integrity of the input but also distinguishes between different degrees of ambiguity. Subsequently, the context encoder uses... Learning clearer and more abstract representations This effectively eliminates irrelevant and ambiguous features. The decoder then... and As input, reconstruct the deblurred image. .
[0060] The goal of the basic feature encoder is to map the blurred image into the latent space and extract primary representations. The aim is to generate high-fidelity inputs to assist the context encoder; MAE is used as the framework for the basic feature encoder to extract primary representations. The architecture of MAE is the same as that of standard ViT. I used several early layers of MAE as the basic feature encoder to keep the model simple and feasible for optimization, and initialized it with parameters trained on MIM.
[0061] By using loss function Train the context encoder to acquire fuzzy-related representations and predict the fuzzy context. A clear representation.
[0062] Specific implementation method two: such as Figure 1 and Figure 2 As shown, in step 2, the context encoder uses Learning clearer and more abstract representations Specifically, it includes:
[0063] Using linear layers from predict :
[0064] (1),
[0065] In formula (1), Indicates the weights of the linear layer;
[0066] Reference features By applying the same encoder to the reference image Obtain, reference image The dimension is ;
[0067] Introducing a classification loss function based on the cross-entropy of two academies promotes and Learn the features that distinguish between blurry and sharp images:
[0068] (2),
[0069] ,
[0070] In formula (2), Indicates a label, and 0 indicates blurry, 1 indicates clear. Tensor The total number of elements in the middle. This represents the sigmoid function, used to normalize the eigenvalue interval, which is (0,1).
[0071] The overall loss of the fuzz-to-sharp converter module is defined as:
[0072] (3),
[0073] In formula (3), Hyperparameters used to calculate the distance between representations Used to balance the importance of classification loss with other loss components;
[0074] By using loss function Train the context encoder to acquire fuzzy-related representations and predict the fuzzy context. A clear representation.
[0075] Specific implementation method three: such as Figure 1 and Figure 2 As shown, a joint embedding architecture is introduced to enhance the representation in the context encoder. The architecture includes an information content preservation module, which uses a regularization criterion defined by VICReg to remove details that are irrelevant to fuzziness. The information content preservation module maps the representation to an embedding space through a multilayer perceptron and derives the regularization loss from it.
[0076] Regularization losses are used in the information content preservation module. These losses are based on three principles: variance, invariance, and covariance. When calculating the regularization losses, the information content preservation module receives two sets of inputs: a batch of reference representations. and the representation derived from the fuzzy-to-clear converter ,pass Embedding the above representations into a low-dimensional space enhances the ability to abstract:
[0077] (4),
[0078] In formula (4), The characteristic averaging operator, The dimension is When calculating the average, the [CLS]| tag used for classification in the converter architecture is excluded; two sets of batch embeddings are derived from the mapping, represented as and ,in It refers to the batch size;
[0079] The regularization loss function of the information content preservation module is defined as:
[0080] (5),
[0081] In formula (5), express or , and Indicates hyperparameters, This represents the mean of the batch embeddings, i.e. The total loss function of the information content preservation module includes three regularization terms, which are defined as follows:
[0082] (6),
[0083] In formula (6), and It is a hyperparameter used to balance the effects of various loss terms.
[0084] Specific implementation method four: such as Figure 1 and Figure 2 As shown, the decoder utilizes the basic feature encoder and the context encoder to extract... and Generate the deblurred image Using traditional Loss function training decoder:
[0085] (7),
[0086] For the entire model, the final loss is defined as:
[0087] (8),
[0088] In formula (8), and It is a parameter that balances different loss terms.
[0089] Specific implementation method five: such as Figure 1 and Figure 2 As shown, the training strategy for training the decoder is as follows: the base feature encoder parameterized by the pre-trained MAE is frozen during the training of the context encoder, and then this model is trained on a large-scale ImageNet dataset; widely accepted data augmentation techniques are employed, including random cropping to a size of 224×224, random scaling from 0.2 to 1.0, and random horizontal flipping; the loss of the entire model is minimized according to the loss function defined by Equation (8), and the context encoder is initialized with the parameters from the pre-training stage.
[0090] Dataset and Implementation Details
[0091] The DPDD dataset was acquired using the dual-pixel technology of the Canon EOS 5D Mark IV digital SLR camera. It contains 500 different indoor and outdoor scenes. For each scene, two out-of-focus subviews were captured using a wide aperture, and a corresponding sharp and realistic image was captured using a narrow aperture. The dataset consists of 350 pairs of training samples, 74 pairs of validation samples, and 76 pairs of test samples.
[0092] Implementation details: The basic feature encoder and decoder each contain four ViT layers, the context encoder consists of three ViT layers, and the Information Content Preservation (ICR) module consists of three fully connected layers with batch normalization, depending on the VICReg configuration. coefficients in loss Set it to 0.25, and Losses and The loss function of the overall model is set to 0.25 and 0.01 respectively. In the middle, hyperparameters and All values were set to 1. The experiment was conducted using the PyTorch framework on a platform equipped with an RTX 3090 GPU. During the pre-training phase on the ImageNet dataset, the LARS optimizer was used for 100 epochs. For the training phase on the defocused dataset, the AdamW optimizer was used. and The initial learning rates were 0.9 and 0.99, respectively, and were set to 2e-4, gradually decreasing to 2e-5 over 200 epochs. The batch size was set to 8. When using ViT-B as the backbone model, the same word segmentation method as the original ViT-B was employed, with a block size of [missing information]. During the testing phase, in order to handle high-resolution blurry images, the image was segmented into... The blocks are used, and a sliding window method is employed to optimize computational efficiency.
[0093] Defocused dataset performance evaluation
[0094] Six standard evaluation metrics were selected, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), Learned Perceptual Patch Similarity (LPIPS), Multiscale Image Quality Evaluator (MUSIQ), Feature Similarity (FSIM), and Conditional Knowledge Distillation Network (CKDN); among them, LPIPS, MUSIQ, and FSIM are closely related to human visual perception.
[0095] Table 1: Defocusing and Deblurring Performance Comparison; This table summarizes the defocusing and deblurring performance comparisons on the DPDD test dataset, RealDOF dataset, and LFDOF dataset. The best and second-best results for each metric are indicated by bold and underline, respectively, to facilitate quick identification of model performance.
[0096] Table 1
[0097]
[0098] Table 1 presents detailed quantitative comparisons of this invention with all current state-of-the-art (SOTA) methods on the DPDD dataset. Although this invention employs a ViT-based architecture typically used for advanced vision tasks—an architecture not specifically designed for image restoration—it demonstrates superior performance across various evaluation metrics. Particularly in metrics directly related to human visual perception, such as MUSIQ, this invention outperforms the state-of-the-art NRKNet by 1.65 points. Compared to methods with specific auxiliary input structures, such as KPAC, this invention also achieves a significant performance improvement of 0.042 on FSIM. Notably, this invention remains competitive on metrics such as LPIPS, MUSIQ, FSIM, and CKDN. These metrics better reflect human visual perception, indicating that the deblurred images generated by this invention are visually more appealing. The PSNR and SSIM metrics, specifically designed to quantify pixel-level differences in the image domain, further highlight the significant potential of this invention in single-image defocusing deblurring tasks.
[0099] Furthermore, to explore the impact of different parameter configurations, the core of this invention's model integrates different types of ViT architectures. The main difference between ViT-B and ViT-T lies in the dimension of the hidden layers, a difference crucial for word segmentation. ViT-B transforms 16x16x3 image patches into 768-dimensional vectors, thus more effectively encapsulating information. In contrast, ViT-T transforms the same image patches into 192-dimensional vectors, potentially leading to information loss. According to the quantitative analysis of this invention, ViT-B, with its larger hidden vector dimension, retains more information, resulting in significantly better performance than ViT-T. However, even using ViT-T as the backbone network, this invention demonstrates competitiveness comparable to other state-of-the-art methods in defocusing and deblurring tasks.
[0100] exist Figure 2 In this paper, we present three case studies comparing the effectiveness of five different methods with our method. In the first case, our invention effectively restores the content structure of the image, while other methods typically leave significant background blur. In terms of deblurring, KPAC and DPDNet introduce artifacts, particularly noticeable in the image edge regions, which is evident in the second case. In contrast, our invention successfully reconstructs the image structure, sharpens edges, and produces a clear and visually pleasing image. These results highlight our invention's advantage in understanding image context, enabling more accurate detail recovery, reduced artifacts, and precise handling of defocused areas.
[0101] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent substitutions, and improvements made to the above embodiments without departing from the scope of the present invention, based on the technical essence of the present invention and within the spirit and principles of the present invention, shall still fall within the protection scope of the present invention.
Claims
1. A defocusing and deblurring method based on a hierarchical model, characterized in that, Specifically, it includes: Step 1: Use a word segmenter to segment the blurred image. Divided into The basic feature encoder accepts the segmented data. As input, and aiming to learn a primary representation that retains as much detail as possible. The goal of the basic feature encoder is to map a blurred image into a latent space and extract primary representations. The aim is to generate high-fidelity inputs to aid the context encoder; the basic feature encoder comprises four layers of ViT. Step 2, The three-layer ViT layer in the input context encoder is obtained and will The input fuzzy-to-sharp converter module obtains a clearer and more abstract representation. Specifically, this includes: Using linear layers from predict : (1), In formula (1), Indicates the weights of the linear layer; Reference features By applying the same encoder to the reference image Obtain, reference image The dimension is ; Introducing a classification loss function based on binary cross-entropy promotes and Learn the features that distinguish between blurry and sharp images: (2), , In formula (2), Indicates a label, and 0 indicates blurry, 1 indicates clear. Tensor The total number of elements in the middle. This represents the sigmoid function, used to normalize the eigenvalue interval, which is (0,1). The overall loss of the fuzz-to-sharp converter module is defined as: (3), In formula (3), Hyperparameters used to calculate the distance between representations Used to balance the importance of classification loss with other loss components; By using loss function Train the context encoder to acquire fuzzy-related representations and predict the fuzzy context. A clear representation; Introducing Joint Embedding Architecture Enhanced Representation in Context Encoder The architecture includes an information content preservation module, which uses a regularization criterion defined by VICReg to remove details that are irrelevant to fuzziness. The information content preservation module maps the representation to an embedding space through a multilayer perceptron and derives the regularization loss from it. Step 3, the decoder will and As input, reconstruct the blurred image. .
2. The defocusing and deblurring method based on a hierarchical model according to claim 1, characterized in that, Regularization losses are used in the information content preservation module. These losses are based on three principles: variance, invariance, and covariance. When calculating the regularization losses, the information content preservation module receives two sets of inputs: a batch of reference representations. and the representation derived from the fuzzy-to-clear converter ,pass Embedding the above representations into a low-dimensional space enhances the ability to abstract: (4), In formula (4), The characteristic averaging operator, The dimension is When calculating the average, the [CLS]| tag used for classification in the converter architecture is excluded; two sets of batch embeddings are derived from the mapping, represented as and ,in It refers to the batch size; The regularization loss function of the information content preservation module is defined as: (5), In formula (5), express or , and Indicates hyperparameters, This represents the mean of the batch embeddings, i.e. The total loss function of the information content preservation module includes three regularization terms, which are defined as follows: (6), In formula (6), and It is a hyperparameter used to balance the effects of various loss terms.
3. The defocusing and deblurring method based on a hierarchical model according to claim 1, characterized in that, The decoder utilizes the basic feature encoder and the context encoder to extract... and Generate the deblurred image Using traditional Loss function training decoder: (7), For the entire model, the final loss is defined as: (8), In formula (8), and It is a parameter that balances different loss terms.
4. The defocusing and deblurring method based on a hierarchical model according to claim 3, characterized in that, The training strategy for the decoder is as follows: the base feature encoder parameterized by the pre-trained MAE is frozen during the training of the context encoder, and then the model is trained on the large-scale ImageNet dataset; widely accepted data augmentation techniques are employed, including random cropping to a size of 224×224, random scaling from 0.2 to 1.0, and random horizontal flipping; the loss of the entire model is minimized according to the loss function defined by Equation (8), and the context encoder is initialized with the parameters from the pre-training stage.