Adaptive conditional enhancement method, device and equipment for face generation and storage medium

By using an adaptive conditional enhancement method, combined with semantic masking and text feature optimization of the diffusion model, the problem of lack of spatial control in face generation in existing technologies is solved, and high-quality face generation is achieved.

CN121661435BActive Publication Date: 2026-06-19BEIJING HISIGN TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING HISIGN TECH
Filing Date
2025-08-01
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing face generation methods based on text input lack spatial control capabilities, making it difficult to guarantee the structural accuracy of the generated face and prone to semantic deviations.

Method used

An adaptive conditional augmentation method is adopted. By inputting real face images into the diffusion model, noise is added and combined with semantic masks and text features. Adaptive region attention normalization and face attention adapter are used to optimize the diffusion model to accurately control the spatial structure and details of the face.

Benefits of technology

It achieves precise spatial structure control and rich semantic description of face generation, solves the ambiguity problem when the text description is vague, and improves the generation quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121661435B_ABST
    Figure CN121661435B_ABST
Patent Text Reader

Abstract

This invention provides an adaptive conditional enhancement method, apparatus, device, and storage medium for face generation, belonging to the field of computer vision technology. The method includes: inputting a real face image into a diffusion model to obtain a latent feature map output by a VAE encoder in the diffusion model; adding actual noise to the latent feature map at time step t to obtain a latent noise map; inputting text prompts into a CLIP text encoder to obtain text features output by the CLIP text encoder; inputting a semantic mask map into a VAE encoder that replicates a U-Net encoding module to obtain a latent mask map output by the VAE encoder; inputting the latent mask map, latent noise map, time step t, and text features into a denoising U-Net to obtain predicted noise output by the U-Net; and training the system with the goal of minimizing the difference between predicted noise and actual noise. This invention combines semantic mask conditions with text conditions to improve the quality of face generation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to an adaptive conditional enhancement method, apparatus, device, and storage medium for face generation. Background Technology

[0002] Face generation relies on deep learning algorithms to mine potential feature distributions from massive amounts of face data, and through complex mathematical modeling and network architecture design, it achieves high-fidelity generation and diverse expression of virtual faces.

[0003] Currently, face generation is typically controlled based on text input. This method relies solely on text descriptions for face generation control, lacking precise spatial control capabilities, making it difficult to guarantee the structural accuracy of the generated face, and is prone to semantic bias, resulting in inconsistencies between the text description and the generated result.

[0004] Therefore, there is an urgent need for a face generation solution that can achieve spatial structure control while ensuring the need for detailed expression. Summary of the Invention

[0005] This invention provides an adaptive condition enhancement method, apparatus, device, and storage medium for face generation, which addresses the shortcomings of existing technologies that lack spatial control capabilities in face generation based on text input.

[0006] This invention provides an adaptive conditional enhancement method for face generation, comprising the following steps:

[0007] A real face image is input into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0008] At time step t, actual noise is added to the latent feature map to obtain a latent noise map;

[0009] The text prompt is input into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0010] The semantic mask image is input into the VAE encoder that replicates the U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0011] The latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0012] The model is trained with the goal of minimizing the difference between the predicted noise and the actual noise.

[0013] According to an adaptive conditional enhancement method for face generation provided by the present invention, the diffusion model is optimized as follows:

[0014] A normalization block is inserted between each layer of the copied U-Net encoding module; the normalization block includes an Adaptive Region Attention Normalization (ARAN) module; the ARAN module processes the input feature map based on the following steps:

[0015] The input feature map is input into the self-attention layer of the ARAN module to obtain the enhanced conditional feature map output by the self-attention layer;

[0016] The enhanced conditional feature map is input into the first convolutional layer and the first activation layer connected in sequence in the ARAN module to obtain the scaling factor output by the first activation layer;

[0017] The enhanced conditional feature map is input into the second convolutional layer and the second activation layer connected in sequence in the ARAN module to obtain the offset factor output by the second activation layer;

[0018] After instance normalization of the input feature map, the scaling factor and the offset factor are used to transform it to obtain the first output feature map.

[0019] According to the adaptive conditional enhancement method for face generation provided by the present invention, the normalization block processes the input feature map based on the following steps:

[0020] The input feature map is input to the first ARAN module to obtain the first output feature map output by the first ARAN module;

[0021] The first output feature map is input into the first SiLU layer to obtain the first activation feature map output by the first SiLU layer;

[0022] The first activation feature map is input into the first convolutional layer to obtain the first convolutional feature map output by the first convolutional layer.

[0023] The first convolutional feature map is input into the second ARAN module to obtain the second output feature map output by the second ARAN module;

[0024] The second output feature map is input into the second SiLU layer to obtain the second activation feature map output by the second SiLU layer;

[0025] The second activation feature map is input into the second convolutional layer to obtain the second convolutional feature map output by the second convolutional layer;

[0026] The input feature map and the second convolutional feature map are fused to obtain the output feature map.

[0027] According to the adaptive conditional enhancement method for face generation provided by the present invention, a face attention adapter is further included after the CLIP text encoder;

[0028] The face attention adapter includes two adapter sub-modules connected via residuals;

[0029] Each of the aforementioned adaptation submodules includes a feedforward network and a multi-head attention layer connected in sequence.

[0030] According to the adaptive conditional enhancement method for face generation provided by the present invention, the feedforward network includes a layer normalization layer, a GEGLU activation layer, a dropout layer and a linear transformation layer connected in sequence.

[0031] According to an adaptive conditional enhancement method for face generation provided by the present invention, each downsampling block of the copied U-Net encoding module includes a conditional fusion module, which determines the weights of the latent mask image and the text features based on the following steps:

[0032] The time step t is input into the first linear layer of the conditional fusion module to obtain the first transformation feature output by the first linear layer;

[0033] The first change feature is input into the ReLU activation layer of the conditional fusion module to obtain the second change feature output by the ReLU activation layer;

[0034] The second transformation feature is input into the second linear layer of the conditional fusion module to obtain the third transformation feature output by the second linear layer;

[0035] The third transformation feature is input into the Sigmoid activation layer of the conditional fusion module to obtain the prediction result output by the Sigmoid activation layer;

[0036] The prediction results are used as weights for the potential mask image.

[0037] The present invention also provides an adaptive condition enhancement device for face generation, comprising the following modules:

[0038] The image encoding module is used to: input a real human face image into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0039] The noise addition module is used to: add actual noise to the latent feature map at time step t to obtain a latent noise map;

[0040] The text encoding module is used to: input text prompts into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0041] The mask encoding module is used to: input a semantic mask image into the VAE encoder that replicates the U-Net encoding module to obtain a latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0042] The noise prediction module is used to: input the latent mask image, the latent noise image, the time step t, and the text features into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0043] The model training module is used to train the model with the goal of minimizing the difference between the predicted noise and the actual noise.

[0044] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the adaptive conditional enhancement method for face generation as described above.

[0045] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the adaptive conditional enhancement method for face generation as described above.

[0046] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the adaptive conditional enhancement method for face generation as described above.

[0047] The present invention provides an adaptive conditional enhancement method, apparatus, device, and storage medium for face generation. The method involves inputting a real face image into a diffusion model to obtain a latent feature map output by a VAE encoder; adding actual noise to the latent feature map at time step t to obtain a latent noise map; inputting text prompts into a CLIP text encoder to obtain text features output by the CLIP text encoder; inputting a semantic mask map into a VAE encoder that replicates a U-Net encoding module to obtain a latent mask map output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask map represent face region categories; inputting the latent mask map, the latent noise map, the time step t, and the text features into a denoising U-Net in the diffusion model to obtain predicted noise output by the U-Net; and training the model with the goal of minimizing the difference between the predicted noise and the actual noise. This invention introduces semantic masking conditions, providing precise spatial structural information such as the position and contour of facial features, accurately controlling the spatial layout of the generated object. Text description conditions provide rich semantic descriptions, allowing flexible specification of visual attributes of the generated object, such as skin color, hairstyle, and facial expression details. Combining the two can simultaneously ensure structural accuracy and richness of detail, achieving pixel-level positioning and semantic-level description. When the text description is ambiguous, the semantic mask provides structural constraints, resisting noise interference in the text description, resolving potential ambiguities in the text description, and ensuring the quality of face generation. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0049] Figure 1 This is a flowchart illustrating the adaptive conditional enhancement method for face generation provided by the present invention;

[0050] Figure 2 This is a schematic diagram of the neural network framework provided by the present invention;

[0051] Figure 3 This is a schematic diagram of the structure of the adaptive region adaptive normalization module provided by the present invention;

[0052] Figure 4 This is a schematic diagram of the result of the face attention adapter provided by the present invention;

[0053] Figure 5 This is a schematic diagram of the conditional fusion module provided by the present invention;

[0054] Figure 6 This is a schematic diagram of the simulation experiment results provided by the present invention;

[0055] Figure 7 This is a schematic diagram of the adaptive condition enhancement device for face generation provided by the present invention;

[0056] Figure 8 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0058] It should be noted that in the description of the embodiments of the present invention, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. The terms "upper," "lower," etc., indicating orientation or positional relationships are based on the orientation or positional relationships shown in the accompanying drawings and are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the present invention. Unless otherwise expressly specified and limited, the terms "installed," "connected," and "linked" should be interpreted broadly, for example, as a fixed connection, a detachable connection, or an integral connection; a mechanical connection or an electrical connection; a direct connection or an indirect connection through an intermediate medium; or a connection within two elements. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0059] The terms "first," "second," etc., used in this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, a first object can be one or more. Furthermore, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects have an "or" relationship.

[0060] The following is combined Figures 1-8 This invention describes the adaptive conditional enhancement method, apparatus, device, and storage medium for face generation provided in embodiments of the present invention.

[0061] Figure 1 This is a flowchart illustrating the adaptive conditional enhancement method for face generation provided by the present invention, as shown below. Figure 1 As shown, the method includes the following:

[0062] S110, Input the real face image into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0063] S120, At time step t, add actual noise to the latent feature map to obtain a latent noise map;

[0064] S130, Input the text prompt into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0065] S140, The semantic mask image is input into the VAE encoder of the copy U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0066] S150, the latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0067] S160, Model training is performed with the goal of minimizing the difference between the predicted noise and the actual noise.

[0068] It should be noted that the execution subject of the adaptive condition enhancement method for face generation provided in this application embodiment can be a server, computer device, such as a mobile phone, tablet computer, laptop computer, handheld computer, vehicle electronic device, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc.

[0069] In this embodiment of the invention, based on the Stable Diffusion (SD) generative model, semantic masking and textual conditions are introduced as additional controls to achieve multimodal input. The SD model is a deep learning model primarily used for image generation. It converts the input image into a latent space representation through a VAE encoder, making it easier to model and manipulate. During the diffusion process, each step requires knowledge of the current time step, typically achieved by mapping the time step to a high-dimensional space, thus better capturing the impact of the time step on the generation process. The U-Net architecture is the core of the SD model, used to perform the actual denoising process. U-Net accepts the latent space representation and time step embedding as input and attempts to predict the noise added to the original data. After a series of denoising steps, the decoder part of the VAE is used to convert the latent space representation back to the image space, ultimately generating a new image. In the specific implementation, during the training phase, each iteration selects a batch of images from the dataset and gradually adds noise to these images according to a predetermined noise schedule. This process simulates the gradual transformation from a clear image to pure noise; using an image with added noise as input, the SD model is trained to predict the noise added to the original image; after training, starting from pure noise, the trained model is used to gradually remove the noise, eventually obtaining a clear image.

[0070] In S110, the VAE (Variational Auto-Encoder) encoder achieves efficient feature representation and learning by compressing high-dimensional image data into a low-dimensional latent space, providing a foundation for subsequent processing.

[0071] In S140, the pixel values ​​in the semantic mask map represent the categories of facial parts. If the face is divided into n parts, the range of pixel values ​​in the semantic mask map is from 0 to n-1, and each value corresponds to a part.

[0072] Optionally, training data is generated using the CelebAMask-HQ dataset. The original label map of CelebAMask-HQ contains 19 facial parts, so each pixel in the generated semantic mask map has a value of 0-18 (representing the 19 facial categories, such as 0=background, 1=skin, 2=left eyebrow, etc.). The entire training data is divided into three parts: semantic mask (512x512), text, and real faces (512x512). The semantic mask data serves as the conditional embedding of the frame; the text is the basic conditional input for the frame and the diffusion model; and the real faces are the latent features of the basic diffusion model, which are transformed into Gaussian noise at the beginning of the denoising process.

[0073] Figure 2 This is a schematic diagram of the neural network framework provided by the present invention, as shown below. Figure 2 As shown, real faces are processed into low-dimensional latent features by the variational autoencoder (VAE) of the stable diffusion model. During the noise addition process, noise is added step by step according to the time step to become a Gaussian noise map. The semantic mask map is embedded into different layers in the U-Net encoding module through the VAE. The text data is further extracted into the model by the CLIP (Contrastive Language-Image Pre-Training) text encoder.

[0074] Understandably, there are no strict timing constraints on the processing of real face images, text prompts, and semantic mask images.

[0075] In S150, the latent mask image and text features are added to the copy U-Net encoding module section.

[0076] After S160 and training, face image generation is performed through a backdiffusion process. Specifically, an initial latent map is sampled from a standard normal distribution to prepare text prompts and semantic mask maps. From time step T to 0, at each step, the latent map, time step t, text prompts, and semantic mask maps are input into the U-Net. Noise is predicted, and the latent map for the next step is updated using noise scheduling. Finally, the latent map at time step 0 is obtained. The VAE decoder is used to restore the denoised latent feature map to the image space, generating the final image.

[0077] The adaptive conditional enhancement method for face generation provided in this invention involves inputting a real face image into a diffusion model to obtain a latent feature map output by the VAE encoder in the diffusion model; adding actual noise to the latent feature map at time step t to obtain a latent noise map; inputting text prompts into a CLIP text encoder to obtain text features output by the CLIP text encoder; inputting a semantic mask map into a VAE encoder that replicates the U-Net encoding module to obtain a latent mask map output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask map represent the face part categories; inputting the latent mask map, the latent noise map, the time step t, and the text features into a denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net; and training the model with the goal of minimizing the difference between the predicted noise and the actual noise. This invention introduces semantic masking conditions, providing precise spatial structural information such as the position and contour of facial features, accurately controlling the spatial layout of the generated object. Text description conditions provide rich semantic descriptions, allowing flexible specification of visual attributes of the generated object, such as skin color, hairstyle, and facial expression details. Combining the two can simultaneously ensure structural accuracy and richness of detail, achieving pixel-level positioning and semantic-level description. When the text description is ambiguous, the semantic mask provides structural constraints, resisting noise interference in the text description, resolving potential ambiguities in the text description, and ensuring the quality of face generation.

[0078] In an optional embodiment, the diffusion model is optimized as follows:

[0079] A normalization block is inserted between each layer of the copied U-Net encoding module; the normalization block includes an Adaptive Region Attention Normalization (ARAN) module; the ARAN module processes the input feature map based on the following steps:

[0080] The input feature map is input into the self-attention layer of the ARAN module to obtain the enhanced conditional feature map output by the self-attention layer;

[0081] The enhanced conditional feature map is input into the first convolutional layer and the first activation layer connected in sequence in the ARAN module to obtain the scaling factor output by the first activation layer;

[0082] The enhanced conditional feature map is input into the second convolutional layer and the second activation layer connected in sequence in the ARAN module to obtain the offset factor output by the second activation layer;

[0083] After instance normalization of the input feature map, the scaling factor and the offset factor are used to transform it to obtain the first output feature map.

[0084] Figure 3This is a schematic diagram of the adaptive region adaptive normalization module provided by the present invention, as shown below. Figure 3 As shown, the ARAN module includes a self-attention layer, a convolutional layer with SiLU activation function, and dynamic normalization. The self-attention layer extracts global region information from the conditional feature map, enhancing the model's ability to perceive key regions (such as eyes and nose). The convolutional layer with Conv+SiLU activation function extracts local features from the attention-enhanced feature map through the convolutional layer, and uses the SiLU (Swish) activation function to enhance non-linear expressive power, generating scaling factor γ and offset factor β. The scaling factor γ and offset factor β are then used to perform an affine transformation on the input feature map to achieve conditionally guided feature adjustment.

[0085] ;

[0086] in, hint It is the input semantic mask condition. A(hint) For self-attention mechanism, x Input the original features. This represents element-wise multiplication, and σ represents the Sigmoid function. W att This is the weight matrix for self-attention, and InstanceNorm() represents instance normalization. W γ It is the weight matrix of the scaling factor γ. W β It is the weight matrix of the offset factor β.

[0087] The adaptive conditional enhancement method for face generation provided in this invention dynamically adjusts the feature distribution based on semantic mask and text, enhances the model's perception of key regions through a self-attention mechanism, improves the model's expressive power through convolution and SiLU, and inserts it into each layer of the copied U-Net encoding module to strengthen the edge detail control capability of semantic mask conditions throughout the face conditional generation process, thereby enhancing the generation control capability of U-Net.

[0088] In an optional embodiment, the normalization block processes the input feature map based on the following steps:

[0089] The input feature map is input to the first ARAN module to obtain the first output feature map output by the first ARAN module;

[0090] The first output feature map is input into the first SiLU layer to obtain the first activation feature map output by the first SiLU layer;

[0091] The first activation feature map is input into the first convolutional layer to obtain the first convolutional feature map output by the first convolutional layer.

[0092] The first convolutional feature map is input into the second ARAN module to obtain the second output feature map output by the second ARAN module;

[0093] The second output feature map is input into the second SiLU layer to obtain the second activation feature map output by the second SiLU layer;

[0094] The second activation feature map is input into the second convolutional layer to obtain the second convolutional feature map output by the second convolutional layer;

[0095] The input feature map and the second convolutional feature map are fused to obtain the output feature map.

[0096] like Figure 3 As shown, the normalized block ARAN_Resblk performs a skip connection between the input feature map and the second convolutional feature map, which has been processed sequentially through the first ARAN, the first SiLU activation layer, the first Conv Layer, the second ARAN, the second SiLU activation layer, and the second Conv Layer, to obtain the final output feature map.

[0097] The adaptive conditional augmentation method for face generation provided in the embodiments of the invention performs affine transformation on the input feature map by each ARAN module. Through skip connections, the model makes a trade-off between the original features and the transformed features, avoiding the loss of original input information. Furthermore, skip connections allow the model to learn residual information instead of directly overwriting the original features, thereby improving the training stability and generation quality of the model.

[0098] In an optional embodiment, a face attention adapter is further included after the CLIP text encoder;

[0099] The face attention adapter includes two adapter sub-modules connected via residuals;

[0100] Each of the aforementioned adaptation submodules includes a feedforward network and a multi-head attention layer connected in sequence.

[0101] Figure 4 This is a schematic diagram of the result of the face attention adapter provided by the present invention, such as... Figure 4As shown, a FaceAttAdapter is designed and inserted after the CLIP text encoder. The adapter consists of two identical adaptation sub-modules. The input and output residuals of the first adaptation sub-module are connected, and the input and output residuals of the second adaptation sub-module are also connected to avoid gradient vanishing. Each adaptation sub-module mainly includes a feedforward network and a multi-head attention mechanism. The feedforward network is used to extract features, and the multi-head attention captures the long-range dependencies of the input sequence.

[0102] The adaptive conditional enhancement method for face generation provided in this invention further processes the semantic features extracted by CLIP through a feedforward network and multi-head self-attention. It performs more detailed feature adjustments and enhancements based on CLIP extraction. Multi-head attention can capture long-range dependencies in the text, enabling the model to better understand complex semantic descriptions. Furthermore, multi-head attention can emphasize certain specific regions, such as glasses and noses, through appropriate weight allocation, thus optimizing face generation. Residual connections ensure effective gradient propagation, thereby accelerating the model training process and improving model stability.

[0103] In an optional embodiment, the feedforward network includes a layer normalization layer, a GEGLU activation layer, a dropout layer, and a linear transformation layer connected in sequence.

[0104] ;

[0105] ;

[0106] in, X These are input features. Q , K , V These are the query, key, and value for multi-head attention. W Q , W K , W V These are the weights of the query, key, and value, respectively. T It is a time step. d k This refers to the dimension of the key. LayerNorm() represents layer normalization. FeedForward1 is a feedforward network. Attention() is single-head attention, and MultiHeadAttn1() is multi-head attention. W res1 These are the weights of the residual connections.

[0107] The adaptive conditional enhancement method for face generation provided in this invention accelerates training speed through a layer normalization layer, enabling the model to reach the optimal solution more quickly. The GEGLU activation function combined with the Dropout layer effectively enhances the model's non-linear expressive power, helping to better capture detailed features. Compared to traditional ReLU or other simple activation functions, the GEGLU activation function provides stronger non-linear expressive power. Dropout randomly discards a portion of the neuron outputs, forcing the model to learn more robust feature representations rather than relying on specific neuron connections, thus preventing overfitting. The linear transformation layer adjusts the dimensionality of the features to meet the needs of subsequent mapping processing.

[0108] In an optional embodiment, each downsampling block of the replicated U-Net encoding module is followed by a conditional fusion module, which determines the weights of the latent mask map and the text features based on the following steps:

[0109] The time step t is input into the first linear layer of the conditional fusion module to obtain the first transformation feature output by the first linear layer;

[0110] The first change feature is input into the ReLU activation layer of the conditional fusion module to obtain the second change feature output by the ReLU activation layer;

[0111] The second transformation feature is input into the second linear layer of the conditional fusion module to obtain the third transformation feature output by the second linear layer;

[0112] The third transformation feature is input into the Sigmoid activation layer of the conditional fusion module to obtain the prediction result output by the Sigmoid activation layer;

[0113] The prediction results are used as weights for the potential mask image.

[0114] Figure 5 This is a schematic diagram of the conditional fusion module provided by the present invention, as shown below. Figure 5 As shown, the ScalePredictor module can dynamically predict the control scale of training conditions at different time steps to coordinate weight issues under different conditions. This module includes two linear layers, a ReLU activation function, and a Sigmoid function.

[0115] ;

[0116] The input is the time step t, which ranges from 1000 to 0. W 1 and W 2 These are the weights of the two linear layers; b1 and b 2 These are the bias terms of the two linear layers; σ represents the Sigmoid function.

[0117] Here, the control scale scale(t) predicted at different time steps t applies to the copy U-Net encoding module in the entire framework; if the scale of the mask condition is s, then the control scale in the latent feature representation process is s+1, thereby controlling each layer of features in the copy U-Net encoding module.

[0118] When the dynamic scale predicted by the control scale predictor is trained and sampled with a fixed value, different results will be generated. When the control scale is 1, it will better fit the conditions of the semantic mask, while when the control scale is 0.5, it will better fit the conditions of the text.

[0119] The adaptive conditional enhancement method for face generation provided by the embodiments of the invention maps the input features to a higher-dimensional feature space through a linear layer, introduces nonlinearity through the ReLU activation function to enable the model to learn more complex feature relationships, further transforms through a second linear layer to output the final scale factor, and compresses the output to the [0,1] interval through the Sigmoid function, which is used as a weight factor for control conditions.

[0120] In summary, the adaptive conditional enhancement method for face generation provided by this invention uses a stable diffusion (SD) model as the base model, with an additional neural network framework added to the copy coding block. Therefore, during model training, SD-V1.5 pre-trained weights are loaded first, and the CLIP parameters are frozen. Only the newly inserted modules need to be trained, reducing the total number of computational parameters. During training, the first step involves inputting real face images, face semantic mask images, and text descriptions from the dataset into the SD to learn the face distribution, gradually adding Gaussian noise during the noise addition process. The second step involves embedding the semantic mask image into the copy coding block using a VAE, while the text information is extracted into the copy coding block via CLIP. The predictor's scale controls each layer of the copy coding block, ranging from 0 to 1. The third step involves the initial stage of the denoising process, when the time step is relatively large, where the scale is at its maximum. At this point, the semantic mask conditions have a significant impact on the face generation effect. As the time step decreases, the text conditions have a greater impact than the semantic mask conditions. Finally, the two conditions coordinate with each other during the denoising process to guide face generation.

[0121] The effectiveness of the adaptive conditional enhancement method for face generation provided by this invention will be explained below with reference to specific simulation experiments.

[0122] The CelebAMask-HQ (512x512) face dataset was used, with a total of 30,000 face pairs. The learning rate was 1e-5, and the dataset was trained for 5 rounds. The weights from the last training round were used to sample 500 pairs of face images to further evaluate the generation effect.

[0123] The following evaluation indicators will be used:

[0124] FID (Freche Inception distance) is an important metric for evaluating the quality of generated images. It is mainly used to measure the similarity between the distribution of generated images and the distribution of real images. The lower the value, the better.

[0125] mIoU calculates the ratio (IoU) of the intersection and union of the "predicted region" and the "true region", and takes the average value over all categories. The higher the value, the better.

[0126] CLIP-Score measures semantic relevance using cosine similarity; a higher value is better.

[0127] Figure 6 This is a schematic diagram of the simulation experiment results provided by the present invention, such as... Figure 6 As shown, the adaptive conditional enhancement method for face generation provided by this invention has an FID of 53.3614, an mIoU of 60.5719%, and a CLIP-Score of 16.2073%. Adding only the adaptive region attention normalization module to the basic framework improved mIoU by 1.1319% (60.4595% vs 59.3276%), thus the normalization module improved semantic mask alignment. Adding only the face attention adapter improved CLIP-Score by 0.1612% (16.1561% vs 15.9949%), indicating that the adapter fine-tuning improved text alignment. Finally, adding both the normalization module and the adapter module simultaneously resulted in mIoU and CLIP-Score of 60.2666% and 16.1333%, respectively, demonstrating that controlling the scale predictor improved these two metrics by 0.3353% and 0.0740%, respectively, while reducing FID by 4.4177 (53.3614 vs 57.7791). This significantly improved the quality of face generation, indicating that coordinated control conditions can generate more realistic faces.

[0128] The adaptive conditional enhancement method for face generation provided by this invention can meet the requirements of multimodal input conditions and has a high degree of alignment between semantic mask and text conditions. Regarding the semantic mask condition, thanks to the adaptive region attention normalization module, facial edge features are enhanced, resulting in more complete and smooth edge feature generation. For the text condition, a face attention adapter is used to fine-tune CLIP, reducing computation while enhancing the detailed texture features and semantic understanding of the face (such as a bright face and dotted whiskers). Furthermore, the input of multiple conditions leads to certain conflicts in the constraints of face generation; the control scale predictor ensures that the conditions are coordinated, thereby enhancing the generation quality.

[0129] The adaptive condition enhancement device for face generation provided in the embodiments of this application will be described below. The adaptive condition enhancement device for face generation described below can be referred to in correspondence with the adaptive condition enhancement method for face generation described above.

[0130] Figure 7 This is a schematic diagram of the adaptive condition enhancement device for face generation provided by the present invention, as shown below. Figure 7 As shown, the adaptive conditional enhancement device for face generation may include, but is not limited to:

[0131] Image encoding module 710 is used to: input a real face image into a diffusion model to obtain a latent feature map output by the VAE encoder in the diffusion model;

[0132] The noise addition module 720 is used to: add actual noise to the latent feature map at time step t to obtain a latent noise map;

[0133] The text encoding module 730 is used to: input the text prompt into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0134] The mask encoding module 740 is used to: input a semantic mask image into the VAE encoder of the copy U-Net encoding module to obtain a latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0135] The noise prediction module 750 is used to: input the latent mask image, the latent noise image, the time step t and the text features into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0136] The model training module 760 is used to train the model with the goal of minimizing the difference between the predicted noise and the actual noise.

[0137] It should be noted that the adaptive condition enhancement device for face generation provided in this embodiment of the invention can execute the adaptive condition enhancement method for face generation described in any of the above embodiments during specific operation, which will not be elaborated in this embodiment.

[0138] Figure 8 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 8 As shown, the electronic device may include: a processor 810, a communications interface 820, a memory 830, and a communication bus 840, wherein the processor 810, the communications interface 820, and the memory 830 communicate with each other via the communication bus 840. The processor 810 can call logical instructions in the memory 830 to execute an adaptive conditional enhancement method for face generation, the method including:

[0139] A real face image is input into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0140] At time step t, actual noise is added to the latent feature map to obtain a latent noise map;

[0141] The text prompt is input into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0142] The semantic mask image is input into the VAE encoder that replicates the U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0143] The latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0144] The model is trained with the goal of minimizing the difference between the predicted noise and the actual noise.

[0145] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0146] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program that can be stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is able to execute the adaptive conditional enhancement method for face generation provided by the above methods, the method comprising:

[0147] A real face image is input into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0148] At time step t, actual noise is added to the latent feature map to obtain a latent noise map;

[0149] The text prompt is input into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0150] The semantic mask image is input into the VAE encoder that replicates the U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0151] The latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0152] The model is trained with the goal of minimizing the difference between the predicted noise and the actual noise.

[0153] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the adaptive conditional enhancement method for face generation provided by the methods described above, the method comprising:

[0154] A real face image is input into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model;

[0155] At time step t, actual noise is added to the latent feature map to obtain a latent noise map;

[0156] The text prompt is input into the CLIP text encoder to obtain the text features output by the CLIP text encoder;

[0157] The semantic mask image is input into the VAE encoder that replicates the U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories;

[0158] The latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net;

[0159] The model is trained with the goal of minimizing the difference between the predicted noise and the actual noise.

[0160] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0161] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0162] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An adaptive conditional enhancement method for face generation, characterized in that, include: A real face image is input into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model; At time step t, actual noise is added to the latent feature map to obtain a latent noise map; The text prompt is input into the CLIP text encoder to obtain the text features output by the CLIP text encoder; The semantic mask image is input into the VAE encoder that replicates the U-Net encoding module to obtain the latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories; The latent mask image, the latent noise image, the time step t, and the text features are input into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net; The model is trained with the goal of minimizing the difference between the predicted noise and the actual noise. The diffusion model is optimized as follows: A normalization block is inserted between each layer of the copied U-Net encoding module; the normalization block includes an Adaptive Region Attention Normalization (ARAN) module; the ARAN module processes the input feature map based on the following steps: The input feature map is input into the self-attention layer of the ARAN module to obtain the enhanced conditional feature map output by the self-attention layer; The enhanced conditional feature map is input into the first convolutional layer and the first activation layer connected in sequence in the ARAN module to obtain the scaling factor output by the first activation layer; The enhanced conditional feature map is input into the second convolutional layer and the second activation layer connected in sequence in the ARAN module to obtain the offset factor output by the second activation layer; After instance normalization of the input feature map, the scaling factor and the offset factor are used to transform it to obtain the first output feature map.

2. The method of claim 1, wherein, The normalization block processes the input feature map based on the following steps: The input feature map is input to the first ARAN module to obtain the first output feature map output by the first ARAN module; The first output feature map is input into the first SiLU layer to obtain the first activation feature map output by the first SiLU layer; The first activation feature map is input into the first convolutional layer to obtain the first convolutional feature map output by the first convolutional layer; The first convolutional feature map is input into the second ARAN module to obtain the second output feature map output by the second ARAN module; The second output feature map is input into the second SiLU layer to obtain the second activation feature map output by the second SiLU layer; The second activation feature map is input into the second convolutional layer to obtain the second convolutional feature map output by the second convolutional layer; The input feature map and the second convolutional feature map are fused to obtain the output feature map.

3. The method of claim 1, wherein, A face attention adapter is also included after the CLIP text encoder; The face attention adapter includes two adapter sub-modules connected via residuals; Each of the aforementioned adaptation submodules includes a feedforward network and a multi-head attention layer connected in sequence.

4. The method of claim 3, wherein, The feedforward network comprises a layer normalization layer, a GEGLU activation layer, a dropout layer, and a linear transformation layer connected in sequence.

5. The method of adaptive conditional augmentation of face generation of any one of claims 1-4, wherein, Each downsampling block of the copied U-Net encoding module is followed by a conditional fusion module, which determines the weights of the latent mask map and the text features based on the following steps: The time step t is input into the first linear layer of the conditional fusion module to obtain the first transformation feature output by the first linear layer; The first transformation feature is input into the ReLU activation layer of the conditional fusion module to obtain the second transformation feature output by the ReLU activation layer; The second transformation feature is input into the second linear layer of the conditional fusion module to obtain the third transformation feature output by the second linear layer; The third transformation feature is input into the Sigmoid activation layer of the conditional fusion module to obtain the prediction result output by the Sigmoid activation layer; The prediction results are used as weights for the potential mask image.

6. An adaptive conditional enhancement device for face generation, characterized in that, include: The image encoding module is used to: input a real human face image into the diffusion model to obtain the latent feature map output by the VAE encoder in the diffusion model; The noise addition module is used to: add actual noise to the latent feature map at time step t to obtain a latent noise map; The text encoding module is used to: input text prompts into the CLIP text encoder to obtain the text features output by the CLIP text encoder; The mask encoding module is used to: input a semantic mask image into the VAE encoder that replicates the U-Net encoding module to obtain a latent mask image output by the VAE encoder; the pixel values ​​of the pixels in the semantic mask image represent the face part categories; The noise prediction module is used to: input the latent mask image, the latent noise image, the time step t, and the text features into the denoising U-Net in the diffusion model to obtain the predicted noise output by the U-Net; The model training module is used to train the model with the goal of minimizing the difference between the predicted noise and the actual noise. The diffusion model is optimized as follows: A normalization block is inserted between each layer of the copied U-Net encoding module; the normalization block includes an Adaptive Region Attention Normalization (ARAN) module; the ARAN module processes the input feature map based on the following steps: The input feature map is input into the self-attention layer of the ARAN module to obtain the enhanced conditional feature map output by the self-attention layer; The enhanced conditional feature map is input into the first convolutional layer and the first activation layer connected in sequence in the ARAN module to obtain the scaling factor output by the first activation layer; The enhanced conditional feature map is input into the second convolutional layer and the second activation layer connected in sequence in the ARAN module to obtain the offset factor output by the second activation layer; After instance normalization of the input feature map, the scaling factor and the offset factor are used to transform it to obtain the first output feature map.

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized in that, The computer program is executed by the processor to implement the adaptive conditional enhancement method for face generation according to any one of claims 1 to 5.

8. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that, The computer program is executed by the processor to implement the adaptive conditional enhancement method for face generation according to any one of claims 1 to 5.

9. A computer program product comprising a computer program, characterized in that, The computer program is executed by the processor to implement the adaptive conditional enhancement method for face generation according to any one of claims 1 to 5.