Face sketch photograph synthesis method based on dual-domain guided denoising diffusion model

By using a dual-domain guided denoising diffusion model, and leveraging a frequency-assisted guided fine module and a residual network attention feature module, the problem of insufficient utilization of spatial and frequency domain information in existing technologies is solved, achieving high-quality face sketching-photo synthesis and improving the ability to preserve identity and restore details.

CN122265059APending Publication Date: 2026-06-23CHONGQING UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING UNIV OF TECH
Filing Date
2026-01-26
Publication Date
2026-06-23

Smart Images

  • Figure CN122265059A_ABST
    Figure CN122265059A_ABST
Patent Text Reader

Abstract

The application provides a face sketch-photo synthesis method based on a dual-domain guided denoising diffusion model, which realizes the collaborative optimization of spatial domain and frequency domain information and the identity consistency maintenance in the denoising diffusion process by introducing a frequency auxiliary guided fine module and a residual network attention feature module. The frequency auxiliary guided fine module effectively reduces the modal gap between the sketch and the photo by independent modulation of the amplitude spectrum and the phase spectrum, and strengthens the identity key feature maintenance; the residual network attention feature module enhances the structure perception and detail restoration ability of the face key area through residual learning and attention mechanism. The two modules work together, so that the model achieves better visual quality and identity recognition accuracy than existing methods on multiple public data sets, significantly improving the realism and practicality of sketch-photo synthesis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a method for synthesizing facial sketches based on a dual-domain guided denoising diffusion model. Background Technology

[0002] Face sketching-photo synthesis technology is an important research direction in the field of computer vision. Its core goal is to convert hand-drawn or generated sketches into realistic and consistent facial photographs. This technology has a wide range of applications.

[0003] Early sketch-photograph synthesis methods primarily relied on regression-based or example-based models, such as locally linear embeddings and Bayesian frameworks. While these methods achieved cross-modal mapping to some extent, they often suffered from high computational costs, severe structural distortion, and limited generalization ability to real-world scenes, making it difficult to capture subtle texture features such as facial skin and hair. With the development of deep learning, generative adversarial networks (GANs) have gradually become the mainstream framework in this field. For example, some studies have attempted to improve the quality and identity preservation capabilities of synthesized images by introducing facial component priors, bidirectional joint training strategies, and hierarchical GAN ​​structures. Nevertheless, GAN-based methods still generally face problems such as training instability, pattern collapse, and insufficient maintenance of identity structure consistency, easily producing blurry or anatomically inaccurate output results.

[0004] In recent years, denoising diffusion probability models have demonstrated superior performance in image generation and conversion tasks due to their strong training stability and wide distribution coverage. These models gradually reconstruct the target image from noise through a stochastic process of forward denoising and backward denoising, exhibiting stronger detail restoration and distribution fitting capabilities. Meanwhile, frequency domain analysis, which provides complementary representations of the global structure and local details of an image, has also begun to be introduced into image synthesis tasks. For example, by using Fast Fourier Transform to decompose an image into amplitude and phase spectra, it can be used to preserve style and geometric identity information, respectively. However, most existing methods are still limited to single spatial domain or simple frequency domain operations, failing to achieve the collaborative mining and guidance of spatial and frequency domain information. This results in significant modal gaps in cross-modal synthesis, making it difficult to achieve high-fidelity texture and structure reconstruction while maintaining identity consistency.

[0005] Therefore, how to design a face sketch-photo synthesis method that can simultaneously utilize spatial and frequency domain information and achieve cross-modal alignment and identity preservation during the diffusion process has become a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] The purpose of this invention is to provide a method for synthesizing facial sketch photos based on a dual-domain guided denoising diffusion model, so as to solve the problems existing in the prior art.

[0007] To achieve the above objectives, the present invention provides the following solution: This invention provides a method for synthesizing facial sketch photos based on a dual-domain guided denoising diffusion model, comprising: Obtain the sketch image to be synthesized; The sketch image is input into a trained sketch-photo synthesis network to generate a corresponding face photo. The sketch-photo synthesis network is built based on a denoising diffusion probability model and includes an encoder, a decoder, and a frequency-assisted guidance fine module and a residual network attention feature module embedded in the connection path between the encoder and decoder. The trained sketch-photo synthesis network is obtained by training the training set with sketches and photos.

[0008] Preferably, the frequency-assisted guidance fine module includes: A frequency-gated mining submodule is used to perform spatial-spectral decoupling and selective enhancement on input features; The frequency-assisted fine-tuning submodule is used to independently modulate and selectively learn and reassemble the amplitude spectrum and phase spectrum obtained by fast Fourier transform decomposition.

[0009] Preferably, the frequency-gated mining submodule achieves pixel-level feature selection through a dual-gated mechanism, including the following steps: The input feature map is split into two paths using a 1×1 convolution; One path is processed by depthwise separable convolution and channel attention, while the other path generates a gated mask through linear projection and GELU activation; The two outputs are multiplied element by element to obtain the preliminary modulation characteristics.

[0010] Preferably, the frequency-assisted fine-tuning submodule performs the following steps: The input features are subjected to a Fast Fourier Transform, which decomposes them into an amplitude spectrum and a phase spectrum. Global context enhancement of the amplitude spectrum is achieved through a sliding window mechanism and dual pooling operations; Phase spectrum modulation is guided by the correlation difference information between amplitude spectrum features and original phase spectrum features; The modulated amplitude spectrum and phase spectrum are recombined into spatial domain features through inverse fast Fourier transform.

[0011] Preferably, the residual network attention feature module includes: Residual blocks are used to embed time-step information and perform feature enhancement; A cross-attention mechanism is used to capture long-range dependencies and reinforce identity-critical regions at bottleneck resolutions.

[0012] Preferably, the query, key, and value matrices in the cross-attention mechanism are generated by sequentially concatenating 1×1 point convolutions and 3×3 depth convolutions.

[0013] Preferably, the training process of the sketch-photo synthesis network includes: The target images in the training set are forward diffused, and noise is gradually added; In the reverse denoising process, the corresponding sketch image is used as a conditional auxiliary input, and the frequency-assisted fine module and the residual network attention feature module work together to guide noise prediction. The network is optimized using a multi-task loss function until the model converges.

[0014] Preferably, the multi-task loss function includes: Noise reconstruction loss is used to minimize the difference between predicted noise and true noise; Frequency consistency loss is used to constrain the structural consistency of the amplitude spectrum and phase spectrum in the frequency domain. Chabonier loss is used for robust reconstruction that enhances texture and edge details; Perceptual loss is used to ensure consistency between the generated image and the real image at the semantic feature level.

[0015] Preferably, during the training and inference phases, the input image is enhanced with a spectrum based on Fast Fourier Transform without altering the image structure information, in order to balance the distribution of high-frequency components among heterogeneous datasets and suppress random noise.

[0016] The present invention also provides an electronic device, including a processor, a memory and a communication interface, wherein the memory stores a computer program, and when the computer program is executed by the processor, it implements a method for synthesizing a human face sketch photo based on a dual-domain guided denoising diffusion model.

[0017] The present invention achieves the following beneficial technical effects compared to the prior art: This invention provides a face sketch-photo synthesis method based on a dual-domain guided denoising diffusion model. By introducing a frequency-assisted guided fine-tuning module and a residual network attention feature module, it achieves synergistic optimization of spatial and frequency domain information and preservation of identity consistency during the denoising diffusion process. Specifically, the frequency-assisted guided fine-tuning module effectively reduces the modal gap between the sketch and the photograph by independently modulating the amplitude and phase spectra, and strengthens the preservation of key identity features. The residual network attention feature module enhances the ability to perceive the structure and restore details of key facial regions through residual learning and attention mechanisms. The two modules work together, enabling the model to achieve superior visual quality and identity recognition accuracy compared to existing methods on multiple public datasets, significantly improving the realism and practicality of sketch-photo synthesis. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a schematic diagram of the overall process of the face sketch photo synthesis method based on the dual-domain guided denoising diffusion model provided in the embodiments of the present invention; Figure 2 This is a schematic diagram of the overall architecture of the sketch-photo synthesis network provided in an embodiment of the present invention; Figure 3 This is a schematic diagram comparing the sketch-photo composite results on the CUHK Student dataset provided in this embodiment of the invention; Figure 4 This is a schematic diagram comparing the results of sketch-photo synthesis on an AR dataset, provided in an embodiment of the present invention. Figure 5 This is a schematic diagram comparing the sketch-photo synthesis results on the XM2VTS dataset provided in this embodiment of the invention; Figure 6 This is a schematic diagram comparing the sketch-photograph synthesis results on the CUGSF dataset provided in this embodiment of the invention; Figure 7 This is a comparative diagram of the various methods provided in the embodiments of the present invention on multiple image quality evaluation indicators; Figure 8 This is a schematic diagram comparing the results of cross-dataset synthesis provided in an embodiment of the present invention; Figure 9 This is a visual comparison diagram of ablation experiments provided in an embodiment of the present invention; Figure 10 This is a schematic diagram of feature distribution and color histogram statistical analysis provided in an embodiment of the present invention. Detailed Implementation

[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0021] The purpose of this invention is to provide a method for synthesizing facial sketch photos based on a dual-domain guided denoising diffusion model, so as to solve the problems existing in the prior art.

[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0023] Example 1: Please see Figure 1 The method includes the following steps: acquiring a sketch image to be synthesized; inputting the sketch image into a trained sketch-photo synthesis network to generate a corresponding face photo. The sketch-photo synthesis network is built based on a denoising diffusion probability model, and its core architecture is as follows: Figure 2 As shown, it includes an encoder, a decoder, and a frequency-assisted guidance fine-tuning module and a residual network attention feature module embedded between the encoder and decoder. The trained sketch-photo synthesis network is obtained by training on a sketch-photo training set.

[0024] In terms of overall architecture, this invention uses U-Net as the basic denoising network, and embeds the proposed frequency-assisted guided fine-tuning module and residual network attention feature module into its encoder and decoder paths. During the training phase, a forward diffusion process is performed on the target image, gradually adding Gaussian noise, which follows the following formula: ; in, x 0 is the target photo. For predefined variance scheduling, T This represents the total number of time steps. At any given moment... t The noisy image can be represented as ,in , , .

[0025] In the reverse denoising process, the network uses the corresponding sketch image as conditional input to progressively predict and remove noise, reconstructing the target image. This process is parameterized by the following formula: ; The network objective is to predict the noise added, and training is achieved by optimizing the variational lower bound.

[0026] The structure of the frequency-assisted guidance fine module proposed in this invention is as follows: Figure 2 As shown, it consists of a frequency-gated mining submodule and a frequency-assisted refinement submodule. The frequency-gated mining submodule first processes the input features... X Spatial-spectral decoupling and selective enhancement are performed. Specifically, [the following is a list of steps] will be implemented. X The features are split into two paths using a 1×1 convolution. X 1 and X2; One path is processed by depthwise separable convolution and channel attention, while the other path generates a gated mask through linear projection and GELU activation. Finally, the initial modulation features are obtained by element-wise multiplication. X f : ; in, CA (·) represents the channel attention function. DW (·) indicates depthwise separable convolution. This indicates element-wise multiplication. Next, we will... X f The frequency domain is mapped using Fast Fourier Transform (FFT), and the frequency response is dynamically adjusted through a series of lightweight convolutions and activation functions. Finally, the frequency domain is mapped back to the spatial domain using Inverse Fast Fourier Transform (IFFT). ; The frequency-assisted fine-tuning submodule then... Y Independent modulation of the amplitude spectrum and phase spectrum is performed. First, [the following is done]... Y Perform a fast Fourier transform to decompose it into an amplitude spectrum. Phase spectrum : ; in, i , j Represents frequency domain coordinates, and These are the real and imaginary parts, respectively. The amplitude spectrum is globally enhanced using a sliding window mechanism and double pooling. ; in, , GMP and GAP These represent global max pooling and global average pooling, respectively. Sigmoid The activation function is used. Phase spectrum modulation utilizes the residual between the amplitude spectrum and the original phase spectrum. Guided analysis is conducted to enhance key geometric information related to identity through frequency division processing: ; in, This represents element-wise addition. Finally, the modulated amplitude and phase spectra are reconstructed into spatial domain features via inverse fast Fourier transform: .

[0027] The residual network attention feature module structure proposed in this invention is as follows: Figure 2 As shown, it includes residual blocks and a cross-attention mechanism. Residual blocks embed time-step information during convolution. x embtTo perceive the current noise reduction stage: ; in, x embt By time step t Sine wave position encoding is performed. When the feature map reaches its resolution bottleneck, a cross-attention mechanism is introduced to capture long-range dependencies. Query Q, key K, and value V are generated by sequentially concatenating 1×1 point convolutions and 3×3 depthwise convolutions. ; Attention weights are calculated by scaling the dot product attention: ; in, α It is a learnable scaling factor used to prevent the distribution of attention scores from becoming too concentrated.

[0028] During training, a multi-task loss function is used to jointly optimize the network. The total loss function is defined as: ; in, This represents the balancing coefficient. The specific definitions of each loss function are as follows: Noise reconstruction loss: ; Frequency consistency loss: ; Chabonier's losses: ; Perceived loss: ; in, Φ j (·) represents the first pre-trained VGG-19 network. j Layer feature map, Ω j = H j × W j × C j This represents the feature dimension of this layer.

[0029] To enhance the model's adaptability to heterogeneous data during the training and inference phases, this invention also introduces a spectral enhancement strategy based on Fast Fourier Transform (FFT). After performing FFT on the input image, selective enhancement is applied to high-frequency components to balance the spectral distribution across different datasets, suppress random noise, and improve the model's robustness under complex lighting and style variations.

[0030] In the experimental section, this invention was validated on four benchmark datasets for face sketching-photo synthesis: CUHK Student, AR, XM2VTS, and CUGSF. The dataset partitioning settings used in the experiments are shown in Table 1. Table 1. Data set partitioning for the experiment.

[0031] Figures 3 to 6 The results of the synthesis of this invention are compared with those of several existing mainstream methods. It can be seen that the present invention performs better in terms of detail restoration, identity preservation and noise suppression. Figure 7 Table 2 presents comparative results on various image quality evaluation metrics, including LPIPS, FSIM, MSSIM, IS, SR-SIM, and ID. The present invention achieves best or near-best performance on most metrics. In the LPIPS and FID metrics, which measure perceptual quality, the present invention achieves the lowest values ​​on all four datasets, indicating that the synthesized result is closest to the real photograph in terms of depth feature space and distribution. In the MS-SSIM metric, which measures structural fidelity, the present invention also achieves the highest score, demonstrating its superior ability to maintain the multi-scale structure and brightness consistency of the face. Particularly noteworthy is the excellent performance in the identity preservation metric, proving that the proposed method can effectively anchor the identity features of the input sketch and avoid identity drift during the synthesis process.

[0032] Table 2 compares the quantization performance of existing methods on four benchmark datasets (best and second-best results are marked with bold and underline, respectively).

[0033] Figure 8 The results of cross-dataset synthesis are shown, demonstrating that the present invention has good generalization ability. Figure 9 The effectiveness of each module was verified through ablation experiments. Figure 10 Further analysis using feature distribution and color histograms shows that the present invention can effectively reduce the modal gap and improve identity discrimination and color authenticity.

[0034] In summary, this invention achieves dual-domain guidance in both the spatial and frequency domains within a denoising diffusion framework through the collaborative design of a frequency-assisted guidance fine module and a residual network attention feature module. This significantly improves the visual quality and identity preservation capabilities of face sketch-photo synthesis, demonstrating strong practical value and promising prospects for wider application.

[0035] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0036] It should be noted that the components mentioned in the above embodiments are all general standard parts or components known to those skilled in the art. Their structures and principles can be learned by those skilled in the art through technical manuals or conventional experimental methods.

[0037] This invention has illustrated its principles and implementation methods using specific examples. The descriptions of these embodiments are merely illustrative of the method and its core ideas; furthermore, those skilled in the art will recognize that modifications may be made to the specific implementation methods and application scope based on the principles of this invention. Therefore, the content of this specification should not be construed as limiting the invention.

Claims

1. A method for synthesizing face sketch photos based on a dual-domain guided denoising diffusion model, characterized in that, include: Obtain the sketch image to be synthesized; The sketch image is input into a trained sketch-photo synthesis network to generate a corresponding face photo. The sketch-photo synthesis network is built based on a denoising diffusion probability model and includes an encoder, a decoder, and a frequency-assisted guidance fine module and a residual network attention feature module embedded in the connection path between the encoder and decoder. The trained sketch-photo synthesis network is obtained by training the training set with sketches and photos.

2. The method for synthesizing face sketch photos based on a dual-domain guided denoising diffusion model according to claim 1, characterized in that, The frequency-assisted guidance fine module includes: A frequency-gated mining submodule is used to perform spatial-spectral decoupling and selective enhancement on input features; The frequency-assisted fine-tuning submodule is used to independently modulate and selectively learn and reassemble the amplitude spectrum and phase spectrum obtained by fast Fourier transform decomposition.

3. The method for synthesizing face sketch photos based on a dual-domain guided denoising diffusion model according to claim 2, characterized in that, The frequency-gated mining submodule achieves pixel-level feature selection through a dual-gated mechanism, including the following steps: The input feature map is split into two paths using a 1×1 convolution; One path is processed by depthwise separable convolution and channel attention, while the other path generates a gated mask through linear projection and GELU activation; The two outputs are multiplied element by element to obtain the preliminary modulation characteristics.

4. The method for synthesizing face sketch photos based on a dual-domain guided denoising diffusion model according to claim 2, characterized in that, The frequency-assisted fine-tuning submodule performs the following steps: The input features are subjected to a Fast Fourier Transform, which decomposes them into an amplitude spectrum and a phase spectrum. Global context enhancement of the amplitude spectrum is achieved through a sliding window mechanism and dual pooling operations; Phase spectrum modulation is guided by the correlation difference information between amplitude spectrum features and original phase spectrum features; The modulated amplitude spectrum and phase spectrum are recombined into spatial domain features through inverse fast Fourier transform.

5. The method for synthesizing face sketch photos based on a dual-domain guided denoising diffusion model according to claim 1, characterized in that, The residual network attention feature module includes: Residual blocks are used to embed time-step information and perform feature enhancement; A cross-attention mechanism is used to capture long-range dependencies and reinforce identity-critical regions at bottleneck resolutions.

6. The method for synthesizing face sketch photographs based on a dual-domain guided denoising diffusion model according to claim 5, characterized in that, The query, key, and value matrices in the cross-attention mechanism are generated by sequentially concatenating 1×1 point convolutions and 3×3 depth convolutions.

7. The method for synthesizing face sketch photographs based on a dual-domain guided denoising diffusion model according to claim 1, characterized in that, The training process of the sketch-photo synthesis network includes: The target images in the training set are forward diffused, and noise is gradually added; In the reverse denoising process, the corresponding sketch image is used as a conditional auxiliary input, and the frequency-assisted fine module and the residual network attention feature module work together to guide noise prediction. The network is optimized using a multi-task loss function until the model converges.

8. The method for synthesizing face sketch photographs based on a dual-domain guided denoising diffusion model according to claim 7, characterized in that, The multi-task loss function includes: Noise reconstruction loss is used to minimize the difference between predicted noise and true noise; Frequency consistency loss is used to constrain the structural consistency of the amplitude spectrum and phase spectrum in the frequency domain. Chabonier loss is used for robust reconstruction that enhances texture and edge details; Perceptual loss is used to ensure consistency between the generated image and the real image at the semantic feature level.

9. The method for synthesizing face sketch photographs based on a dual-domain guided denoising diffusion model according to claim 1, characterized in that, During the training and inference phases, the input images are enhanced with a spectrum based on Fast Fourier Transform without altering the image structure information, in order to balance the distribution of high-frequency components among heterogeneous datasets and suppress random noise.

10. An electronic device, comprising a processor, a memory, and a communication interface, characterized in that, The memory stores a computer program that, when executed by a processor, implements the face sketch photograph synthesis method based on a dual-domain guided denoising diffusion model as described in any one of claims 1 to 9.