Image inpainting method based on text-guided and detail-preserving diffusion model

By adopting an image inpainting method based on text guidance and a detail-preserving diffusion model, the problems of insufficient restoration efficiency and adaptability in low-quality image restoration are solved. This method achieves high-quality image generation and detail preservation under various degradation conditions, and supports users to precisely control the restoration process.

CN120451010BActive Publication Date: 2026-06-26HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2025-04-25
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies suffer from poor restoration efficiency and adaptability in low-quality image restoration, especially in the restoration of old photos. They struggle to generate high-fidelity images with vibrant colors, rich details, and high realism, and lack sufficient control over color details at the target level.

Method used

An image inpainting method based on text-guided and detail-preserving diffusion model is adopted. The encoder encodes the clear image into the latent space and adds noise. The conditional diffusion model is used to predict the noise. The decoder generates a new degraded image. The model is trained by hybrid loss to ensure image detail preservation and semantic consistency.

Benefits of technology

It improves the quality and adaptability of image restoration, enabling the generation of high-fidelity, detailed images under various degradation conditions, and supports users to precisely control the restoration process through text conditions, thereby enhancing the model's generalization ability and restoration efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120451010B_ABST
    Figure CN120451010B_ABST
Patent Text Reader

Abstract

The embodiment of the present disclosure provides an image restoration method based on a text-guided and detail-preserving diffusion model, which belongs to the technical field of image processing, and specifically comprises the following steps: obtaining a preliminary degraded image according to a clear image, encoding the clear image into a latent space through an encoder and adding noise to obtain a noise latent variable; inputting the preliminary degraded image and the noise latent variable into a conditional diffusion model as conditions to predict noise added in the t-th step, calculating a latent variable that preserves details and clarity, and decoding a new degraded image through a decoder according to the latent variable; inputting the new degraded image and the noise latent variable into the conditional diffusion model as conditions again to obtain a final predicted noise, and calculating a loss according to the final predicted noise and a preliminary predicted noise to train the conditional diffusion model; saving the model weight of the trained conditional diffusion model, and completing the image restoration process of a target image by using the inference process of DDIM. Through the scheme of the present disclosure, the restoration efficiency and adaptability are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and in particular to an image restoration method based on a text-guided and detail-preserving diffusion model. Background Technology

[0002] Currently, visual artifact restoration of low-quality images, such as old photographs affected by various distortions, remains a core challenge in computer vision and has not yet been fully resolved. Despite significant progress in data-driven methods in recent years, the field still faces two key issues. First, generating high-fidelity restored images that are vibrant in color, rich in detail, and highly realistic remains a challenge. Second, target-level color detail control in old photograph restoration tasks has not yet been effectively addressed.

[0003] Traditional methods typically train specialized image inpainting models from scratch to preserve image details as much as possible. These methods often use low-quality images as additional input to constrain the output space. While these methods have achieved some success in tasks such as super-resolution and deblurring, their generalization ability is limited because they are often designed for specific types of image degradation and require training from scratch. This makes them difficult to adapt to complex scenarios with multiple unknown degradations.

[0004] Meanwhile, large-scale diffusion models have demonstrated superior performance in image generation and text-to-image generation. Therefore, some research has attempted to leverage the generative priors of diffusion models for image inpainting, introducing constraints during the backdiffusion process to improve restoration quality. However, these methods typically rely on accurate prior knowledge to describe the image degradation process and require individual optimization for each image, significantly limiting their practicality. Consequently, current research based on diffusion models rarely addresses image inpainting tasks that simultaneously handle multiple unknown degradation patterns, especially in the specific application scenario of old photo restoration with complex degradation modes, where efficient and universal solutions are still lacking.

[0005] It is evident that there is an urgent need for an image inpainting method based on a text-guided and detail-preserving diffusion model that offers both high efficiency and adaptability. Summary of the Invention

[0006] In view of this, the present disclosure provides an image restoration method based on a text-guided and detail-preserving diffusion model, which at least partially solves the problems of poor restoration efficiency and adaptability in the prior art.

[0007] This disclosure provides an image inpainting method based on a text-guided and detail-preserving diffusion model, including:

[0008] Step 1: Obtain a preliminary degraded image based on the clear image; and encode the clear image into the latent space using an encoder and add noise to obtain the noise latent variable.

[0009] Step 2: Using the initial degraded image and noise latent variables as conditions, the noise added in step t is predicted in the conditional diffusion model. Then, the latent variables that retain clear details are obtained through calculation, and the new degraded image is decoded by the decoder accordingly.

[0010] Step 3: Input the new degraded image and noise latent variables as conditions back into the conditional diffusion model to obtain the final predicted noise. Calculate the loss based on the final predicted noise and the initial predicted noise to train the conditional diffusion model.

[0011] Step 4: Save the model weights of the trained conditional diffusion model and use the inference process of DDIM to complete the target image restoration process.

[0012] According to a specific implementation of an embodiment of this disclosure, step 1 specifically includes:

[0013] Step 1.1, define a degradation model that will sharpen the image I hq The degradation process yields a preliminary degraded image I. lq The expression for the degradation modeling is:

[0014] I lq =φ ω (I hq )

[0015] Where, φ ω (·) indicates a variety of preset degradation modeling operations;

[0016] Step 1.2, using a preset autoencoder, extract the clear image I hq The data is mapped to a latent space variable z0, and then noise is gradually added to the latent space variable z0. During the noise addition process, the hyperparameters time step t and α are set. t With variance β t Gaussian noise ∈ (0,1) is added to the encoded latent space variable z0 to generate the noise latent variable z. t The expression for the latent space variable is given by the noise addition formula at time t.

[0017] z0=E(I hq );

[0018] The noise addition formula at time t is as follows:

[0019]

[0020] Where E(·) represents the autoencoder, t∈{1,…,T},∈~N(0,I), represents the noise of the standard Gaussian distribution, and α t =1-βt,

[0021] According to a specific implementation of an embodiment of this disclosure, step 2 specifically includes:

[0022] Step 2.1, the initially degraded image I lq and noise latent variable z t As a condition, the noise added at step t is fed into the conditional diffusion model to predict the noise.

[0023]

[0024] Where, ∈ θ (·) represents the conditional diffusion model, z t Let I be the latent noise variable at step t, where t is the time step for denoising, and the initially degraded image is... lq As a condition, P is the text condition corresponding to the clear image;

[0025] Step 2.2, using the noise predicted by the model ∈ θ (z t ,t,I lq The latent variables at step t-1 of the preliminary prediction are obtained by calculation (P). and hidden variables that retain clear details

[0026]

[0027] Step 2.3 will preserve the hidden variables with clear details. The new degraded image is decoded by the decoder.

[0028]

[0029] in, For decoders, used to extract latent variables that retain clear details. Decoding to pixel space yields a new degraded image.

[0030] According to a specific implementation of this disclosure, before step 2.1, the method further includes:

[0031] Using the BLIP model to obtain text conditions corresponding to sharp images

[0032] P = BLIP(I) hq )

[0033] Wherein, BLIP(·) represents the BLIP model.

[0034] According to a specific implementation of an embodiment of this disclosure, step 3 specifically includes:

[0035] Step 3.1, based on the new degraded image and noise latent variable z t The noise is then re-input into the conditional diffusion model as a condition, resulting in the final predicted noise ∈ t ;

[0036] Step 3.2, using the noise predicted by the model ∈ t The latent variable z at step t-1 of the final prediction is obtained through calculation. t-1

[0037]

[0038] Step 3.3, calculate the final predicted noise ∈ t The mean square error between the actual noise ∈ and the actual noise ∈ is used as the first loss.

[0039]

[0040] Where E represents the expected value of different sampling variables, i.e., the average value among multiple samples, U(1,T) represents a uniform distribution on the set 1,…,T, and N(0,I) represents a multidimensional standard Gaussian distribution with a mean of 0 and a covariance matrix of identity matrix I. This represents the square of the L2 norm, which is the square of the Euclidean distance, used to measure the difference between predicted noise and actual noise.

[0041] Step 3.4, calculate the noise of the initial prediction. The mean square error between the actual noise ∈ and the actual noise ∈ is used as the second loss.

[0042]

[0043] Step 3.5: Train the conditional diffusion model using the first loss and the second loss.

[0044] According to a specific implementation of an embodiment of this disclosure, step 4 specifically includes:

[0045] Noise from a standard Gaussian distribution A noise latent variable z is sampled from the sample. t and the corresponding preliminary degradation image I lq For different time steps sampled by DDIM, the trained conditional diffusion model ∈ θ The model weights are used to predict the noise at each time step t. θ (zt ,t,I lq Then, the clear latent variables are calculated, and the clear latent variables are then processed by the decoder to obtain the final clear image, thus completing the image restoration task.

[0046] The image inpainting scheme based on a text-guided and detail-preserving diffusion model in this embodiment includes: Step 1, obtaining a preliminary degraded image based on a clear image, and encoding the clear image into a latent space using an encoder and adding noise to obtain a noise latent variable; Step 2, using the preliminary degraded image and the noise latent variable as conditions, inputting them into a conditional diffusion model to predict the noise added in step t, and then calculating the latent variable that preserves clear details, thereby decoding a new degraded image using a decoder; Step 3, re-inputting the new degraded image and the noise latent variable as conditions into the conditional diffusion model to obtain the final predicted noise, and calculating the loss based on the final predicted noise and the preliminary predicted noise to train the conditional diffusion model; Step 4, saving the model weights of the trained conditional diffusion model, and using the inference process of DDIM to complete the target image inpainting process.

[0047] The beneficial effects of the embodiments disclosed herein are as follows:

[0048] 1. The method of the present invention improves the automatic generation of high-quality images from low-quality degraded images. This method avoids the limitations of manually designing degradation models and ensures the preservation of image details, thereby improving the overall quality of the restored image.

[0049] 2. This invention improves the learning objective of the diffusion model. During the training process, two different loss conditions are randomly selected to train the model, which improves the model's generalization ability to accept different degradation conditions, while preserving image details to the greatest extent.

[0050] 3. This invention allows users to precisely control the repair of low-quality images through text conditions, ensuring the semantic consistency of the generated high-quality images. Attached Figure Description

[0051] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0052] Figure 1 This is a flowchart illustrating an image restoration method based on a text-guided and detail-preserving diffusion model, provided in an embodiment of this disclosure. Detailed Implementation

[0053] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0054] The following specific examples illustrate the implementation of this disclosure. Those skilled in the art can easily understand other advantages and effects of this disclosure from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. This disclosure can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this disclosure. It should be noted that, in the absence of conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0055] It should be noted that various aspects of embodiments within the scope of the appended claims are described below. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using structures and / or functionalities other than one or more of the aspects set forth herein.

[0056] It should also be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this disclosure. The illustrations only show the components related to this disclosure and are not drawn according to the number, shape and size of the components in actual implementation. In actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.

[0057] Furthermore, specific details are provided in the following description to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects can be practiced without these specific details.

[0058] This disclosure provides an image restoration method based on a text-guided and detail-preserving diffusion model, which can be applied to image restoration processes in image processing scenarios.

[0059] See Figure 1 This is a flowchart illustrating an image inpainting method based on a text-guided and detail-preserving diffusion model, provided in an embodiment of this disclosure. Figure 1 As shown, the method mainly includes the following steps:

[0060] Step 1: Obtain a preliminary degraded image based on the clear image; and encode the clear image into the latent space using an encoder and add noise to obtain the noise latent variable.

[0061] In practice, a preliminarily degraded image is obtained based on a high-quality image. The clear image is then encoded into the latent space by an encoder, and noise is added to obtain the noise latent variable z. t The specific process mainly includes:

[0062] Step 1.1, after obtaining the initial degraded image I lq At that time, a degradation model is defined to train high-quality images I. hq Degenerate to obtain I lq The degradation model is as follows:

[0063] I lq =φ ω (I hq )

[0064] Where, φ ω (·) represents existing artificial degradation modeling operations that design various factors and processes that may affect image quality, such as noise, blur, compression, and other forms of distortion. The resulting low-quality image will exhibit severe to slight degradation. lq Can be used with high-quality images hq This constitutes training data pairs.

[0065] Step 1.2, after obtaining the noise latent variable z t At that time, using the autoencoder in Stable Diffusion, high-quality image I hq The variable z0 is mapped to a latent space variable, and then noise is gradually added to z0. During the noise addition process, the hyperparameters time step t and α are set. t With variance β t Gaussian noise ∈ (0,1) is added to the encoded latent variable z0 to generate the noise latent variable z. t The formula for adding noise at a certain time t is:

[0066] z0=E(I hq ),

[0067]

[0068] Where E(·) represents the autoencoder, t∈{1,…,T},∈~N(0,I), represents the noise of the standard Gaussian distribution, and α t =1-βt,

[0069] Step 2: Using the initial degraded image and noise latent variables as conditions, the noise added in step t is predicted in the conditional diffusion model. Then, the latent variables that retain clear details are obtained through calculation, and the new degraded image is decoded by the decoder accordingly.

[0070] In practice, the initial degraded image and the noise latent variable z will be... t As a condition, the noise added at step t is fed into the conditional diffusion model to predict the noise. Then, the latent variables that retain clear details are obtained through calculation. The new degraded image is then decoded by the decoder. The specific process is as follows:

[0071] Step 2.1: Convert the initially degraded image I lq and noise latent variable z t As a condition, the noise added at step t is fed into the conditional diffusion model to predict the noise.

[0072]

[0073] Where, ∈ θ (·) represents a conditional diffusion model, requiring the latent variable z at step t as input. t The time step t for denoising, and the image I that has undergone initial degradation. lq As a condition, P is the text condition, and the output of the conditional diffusion model is the predicted noise.

[0074] In this step, we use a conditional diffusion model that can accept both images and text as conditions, such as the ControlNet model. During the training phase, the text condition P is derived from the BLIP model to complete the image-to-text task.

[0075] P = BLIP(I) hq ),

[0076] Where BLIP(·) represents the BLIP model, used to obtain a clear image I hq The text description. Due to the existence of the text condition P, the semantic consistency of the repaired image can be guaranteed, while also satisfying the user's controllability over the image repair.

[0077] Step 2.2: Utilize the noise predicted by the model ∈ θ (z t ,t,I lq P), and the latent variables for the initial prediction at step t-1 are obtained through calculation. and retaining clear details of latent variables The calculation formula is as follows:

[0078]

[0079] Since this step was obtained through calculation This result does not conform to the denoising process of DDIM or DDPM. Therefore, the result obtained in this step... It's not the final, clear result we ultimately want, but rather a latent variable that still has some degradation but retains many details.

[0080] Step 2.3: Retain latent variables with clear details. The new degraded image is decoded by the decoder.

[0081]

[0082] Where D(·) is the decoder, which can preserve the details of latent variables. Decoding to pixel space yields a new degraded image. This is how it was obtained. It preserves both some degradation and a great deal of image detail; importantly, it presents a new degraded image. It can be automatically generated by the diffusion model at different time steps t, without the need for manual design of complex degradation processes.

[0083] Step 3: Input the new degraded image and noise latent variables as conditions back into the conditional diffusion model to obtain the final predicted noise. Calculate the loss based on the final predicted noise and the initial predicted noise to train the conditional diffusion model.

[0084] In practice, based on the new degraded image and noise latent variable z t The noise is then fed back into the conditional diffusion model as a condition to obtain the final predicted noise. The specific process of calculating the loss using the final predicted noise and the initial predicted noise to train the diffusion model can include:

[0085] Step 3.1, based on the new degraded image and noise latent variable z t The noise is then re-input into the conditional diffusion model as a condition, resulting in the final predicted noise ∈ t .

[0086] In this process, the model's input conditions change from the original, initially degraded image I lq Replace with a new degraded image With explicit constraints on the diffusion iteration process, the noise latent variable at step t-1 can be predicted using the following operation:

[0087]

[0088] The noise latent variable z obtained through this process t-1 More of the content from I was retained. lq The improved image details effectively solved the instability problem of the diffusion model.

[0089] Step 3.2: Calculate the loss by mixing the final predicted noise and the initial predicted noise to train the diffusion model.

[0090] To improve the generalization ability of the model training, we use hybrid training to enable the model to perform well on initially degraded images I. lq New degraded images It maintains a good fit even on more types of image degradation, setting a loss and calculating the probabilities p of the two losses respectively. iide and 1-p iide During training, p iide The probability loss is defined as the prediction noise. Mean square error between actual noise ∈:

[0091]

[0092] Having 1-p iide The probability loss is defined as the prediction noise ∈ θ (z t ,t,I lq The mean square error between P and the actual noise ∈:

[0093]

[0094] Where p iide ∈(0,1), where U(1,T) represents a uniform distribution on the set 1,…,T, and N(0,I) represents a multidimensional standard Gaussian distribution with mean 0 and covariance matrix I. These two losses are used to constrain... and z t-1 To maintain similarity to the greatest extent possible, thereby ensuring that the final predicted z t-1 To retain as much as possible from I lq Image details.

[0095] Step 4: Save the model weights of the trained conditional diffusion model, and use the inference process of DDIM to complete the target image restoration process.

[0096] In practice, we can first sample a noise latent variable z from N(0,I). t and the corresponding low-quality image I lq For different time steps sampled by DDIM, the trained conditional diffusion model ∈θ The model weights are used to predict the noise at each step. θ (z t ,t,I lq The image is gradually denoised (P), then the clear latent variable z0 is predicted, and z0 is then decoded to obtain the final clear image to complete the image restoration task.

[0097] This embodiment provides an image inpainting method based on text guidance and a detail-preserving diffusion model. By leveraging the generative priors of the diffusion model, it designs a method that guides the generation process from a given low-quality image to a high-quality image, ensuring the restored high-quality image faithfully reproduces the original content. Compared to methods that merely use low-quality images obtained through artificial degradation modeling as training conditions, our method can successfully generate high-fidelity restored images even when the low-quality input image has multiple unknown degradations, presenting vibrant colors and highly realistic details. Furthermore, our method allows users to precisely control the restored image through text conditions, ensuring semantic consistency. This disclosed method improves the model's generalization ability to accept different degradation conditions, maximizing the preservation of image details, while also allowing users to precisely control the restoration of low-quality images through text conditions.

[0098] It should be understood that the various parts of this disclosure can be implemented in hardware, software, firmware, or a combination thereof.

[0099] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.

Claims

1. An image inpainting method based on a text-guided and detail-preserving diffusion model, characterized in that, include: Step 1: Obtain a preliminary degraded image based on the clear image; and encode the clear image into the latent space using an encoder and add noise to obtain the noise latent variable. Step 1 specifically includes: Step 1.1, define a degradation model that will sharpen the image. The degradation process yields a preliminary degraded image. The expression for the degradation modeling is: in, This indicates a variety of preset degradation modeling operations; Step 1.2: Use the preset autoencoder to extract the clear image. Mapping to latent space variables Then, for the latent space variables Perform gradual noise addition, and set the hyperparameter time step during the noise addition process. and With variance Gaussian noise is added to the encoded latent space variables. Above, to generate noise latent variables The expression for the latent space variable is given by the noise addition formula at time t. ; The noise addition formula at time t is as follows: , in, Indicates an automatic encoder. This represents noise with a standard Gaussian distribution. ; Step 2: Using the initial degraded image and noise latent variables as conditions, the noise added in step t is predicted in the conditional diffusion model. Then, the latent variables that retain clear details are obtained through calculation, and the new degraded image is decoded by the decoder accordingly. Step 3: Input the new degraded image and noise latent variables as conditions back into the conditional diffusion model to obtain the final predicted noise. Calculate the loss based on the final predicted noise and the initial predicted noise to train the conditional diffusion model. Step 4: Save the model weights of the trained conditional diffusion model, and use the inference process of DDIM to complete the target image restoration process.

2. The method according to claim 1, characterized in that, Step 2 specifically includes: Step 2.1, the initially degraded image and noise latent variables As a condition, the noise added at step t is fed into the conditional diffusion model to predict the noise. in, For conditional diffusion models, Let t be the noise latent variable at step t, where t is the time step for denoising, and t is the image after initial degradation. As a condition Text conditions corresponding to a clear image; Step 2.2, using the noise predicted by the model The latent variables at step t-1 of the preliminary prediction were obtained through calculation. and hidden variables that retain clear details , ; Step 2.3 will preserve the hidden variables with clear details. The new degraded image is decoded by the decoder. in, For decoders, used to extract latent variables that retain clear details. Decoding to pixel space yields a new degraded image. .

3. The method according to claim 2, characterized in that, Prior to step 2.1, the method further includes: Using the BLIP model to obtain text conditions corresponding to sharp images ( ) in, This represents the BLIP model.

4. The method according to claim 3, characterized in that, Step 3 specifically includes: Step 3.1, based on the new degraded image and noise latent variables The noise is then fed back into the conditional diffusion model as a condition to obtain the final predicted noise. ; Step 3.2, using the noise predicted by the model The latent variables at step t-1 of the final prediction are obtained through calculation. ; Step 3.3, calculate the noise in the final prediction. With actual noise The mean square error between them is used as the first loss. in, It represents the mathematical expectation of different sampling variables, that is, the average value taken from multiple samples. Indicates in set Uniform distribution on The mean is The covariance matrix is ​​the identity matrix. A multidimensional standard Gaussian distribution. It represents the square of the L2 norm, which is the square of the Euclidean distance, and is used to measure the difference between predicted noise and actual noise. Step 3.4, calculate the noise of the initial prediction. With actual noise The mean square error between them is used as the second loss. ; Step 3.5: Train the conditional diffusion model using the first loss and the second loss.

5. The method according to claim 4, characterized in that, Step 4 specifically includes: Noise from a standard Gaussian distribution A noise latent variable was sampled from the middle. and the corresponding preliminary degradation image For different time steps sampled by DDIM, the trained conditional diffusion model is used. The model weights are used to predict the noise at each time step t. Then, the clear latent variables are calculated, and the clear latent variables are then processed by the decoder to obtain the final clear image, thus completing the image restoration task.