Image generation method, system, device, and medium
By using the ID encoder-Q-Former refinement link and the MM-DiT/DiT two-stage AdaLN injection mechanism, the problems of easy loss of identity information and inaccurate text control in portrait generation are solved, generating high-quality personalized portraits and ensuring identity consistency and style conformity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING JIZHI DIGITAL TECH CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies struggle to simultaneously preserve the identity of the reference face and adhere to the text style in portrait generation, resulting in issues such as identity distortion, misalignment between text and generated results, easy incorporation of noise in complex backgrounds, loss of facial details, and poor compatibility between different text styles and the reference portrait.
The purity of facial features is improved by using an ID encoder-Q-Former refinement link. Combined with the two-stage AdaLN injection mechanism of MM-DiT and DiT, the text-visual precision collaboration is achieved through branch self-attention and cross-modal stitching to generate high-quality personalized portraits.
It achieves a high degree of preservation and consistent restoration of the identity features of the target object while strictly adhering to the text description, significantly improving the realism, controllability and identity fidelity of the generated image.
Smart Images

Figure CN122199728A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the technical fields of image generation, and in particular to an image generation method, system, device and medium. Background Technology
[0002] In the field of portrait generation, the need to preserve the identity of the reference face and follow the text style customization is becoming increasingly prominent. However, it faces the challenge of inaccurate multimodal feature fusion. Although technologies such as VAE (Variational Autoencoder), CLIP (Contrastive Language-Image Pre-Training), and ArcFace (Additive Angular Margin Loss) have laid the foundation for cross-modal generation, it is difficult to balance identity consistency and style controllability.
[0003] Related technologies employ CLIP-guided VAE-Transformer algorithms (such as the SD (Stable Diffusion) face fine-tuning model), injecting facial features only in a single stage during the later generation phase. This can easily lead to identity distortion and misalignment between the text and the generated result. Another related technology uses an ID encoder + single-stage fusion algorithm, simply overlaying and fusing facial features extracted using ArcFace without feature refinement or continuous enhancement of identity information. It is also prone to noise incorporation and loss of facial details in complex backgrounds. Furthermore, most of these related technologies use a single fusion method, resulting in poor adaptability to different text styles and reference portraits. Summary of the Invention
[0004] The embodiments of this application aim to at least partially solve one of the technical problems in the related art. Therefore, the purpose of the embodiments of this application is to provide an image generation method, system, device, and medium that improves the quality of image generation.
[0005] This application provides an image generation method, comprising: acquiring original image data and text description data, and extracting target object image data from the original image data; encoding the original image data and text description data respectively to obtain object latent feature data and text feature data, and encoding and attention calculation on the target object image data to obtain object identity feature data; performing preliminary cross-modal feature fusion processing and feature alignment processing based on the object latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data; performing deep feature fusion processing based on the cross-modal fused feature data and object identity feature data to obtain deep fused feature data; and generating a target image corresponding to the text description data and target object image data based on the deep fused feature data.
[0006] For example, cross-modal feature fusion processing and feature alignment processing are performed based on object latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data. This includes: performing self-attention calculation on object latent feature data to obtain internal association feature data; performing self-attention calculation on text feature data to obtain enhanced text feature data; performing cross-modal feature fusion processing based on internal association feature data and object identity feature data to obtain fused image feature data; performing nonlinear activation processing, residual processing, and normalization processing on the fused image feature data and enhanced text feature data based on a multilayer perceptron to obtain processed fused image feature data and processed enhanced text feature data; and performing feature alignment processing based on the processed fused image feature data and processed enhanced text feature data to obtain cross-modal fused feature data.
[0007] For example, feature alignment processing is performed on the processed fused image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data, including: performing a first convolution processing on the processed fused image feature data to obtain first image feature data; and performing concatenation processing on the first image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data.
[0008] For example, deep feature fusion processing is performed on cross-modal fusion feature data and object feature data to obtain deep fusion feature data, including: performing global self-attention calculation on the cross-modal fusion feature data to obtain text-image global association feature data; performing attention enhancement calculation on the text-image global association feature data and object identity feature data to obtain object identity enhancement feature data; and performing nonlinear activation processing, residual processing and normalization processing on the object identity enhancement feature data based on a multilayer perceptron to obtain deep fusion feature data.
[0009] For example, generating a target image corresponding to text description data and target object image data based on deep fusion feature data includes: performing a second convolution process on the deep fusion feature data to obtain target feature data; and performing decoding processing on the target feature data to obtain the target image corresponding to the text description data and target object image data.
[0010] For example, encoding and attention calculations are performed on target object image data to obtain object identity feature data, including: performing identity encoding processing on target object image data to obtain initial object identity feature data; and performing self-attention and cross-attention enhancement calculations on the initial object identity feature data to obtain object identity feature data.
[0011] For example, performing row self-attention and cross-attention enhancement calculations on the initial object identity feature data to obtain object identity feature data includes: performing self-attention calculation based on the query vector to obtain the target query vector, wherein the query vector is obtained by training the Q-Former model; and performing cross-attention calculation based on the target query vector and the initial object identity feature data to obtain the object identity feature data.
[0012] Another embodiment of this application provides an image generation system, comprising: an acquisition module for acquiring original image data and text description data, and extracting target object image data from the original image data; an encoding module for encoding the original image data and text description data respectively to obtain object latent feature data and text feature data, and encoding and attention calculation on the target object image data to obtain object identity feature data; a first fusion module for performing preliminary cross-modal feature fusion processing and feature alignment processing based on the object latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data; a second fusion module for performing deep feature fusion processing based on the cross-modal fused feature data and object identity feature data to obtain deep fused feature data; and a generation module for generating a target image corresponding to the text description data and the target object image data based on the deep fused feature data.
[0013] Another embodiment of this application provides an electronic device having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.
[0014] Another embodiment of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.
[0015] In the above embodiments, the image generation method includes: acquiring original image data and text description data, and extracting target object image data from the original image data; encoding the original image data and text description data respectively to obtain object latent feature data and text feature data, and encoding and attention calculation on the target object image data to obtain object identity feature data; performing preliminary cross-modal feature fusion processing and feature alignment processing based on the object latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data; performing deep feature fusion processing based on the cross-modal fused feature data and object identity feature data to obtain deep fused feature data; and generating a target image corresponding to the text description data and target object image data based on the deep fused feature data. By encoding the original image, text description, and target object image separately, latent object features, text features, and high-fidelity object identity features are generated. Then, through preliminary cross-modal feature fusion and feature alignment processing, the text semantics and visual features are accurately associated. Finally, through deep feature fusion, the identity information is stably embedded into the generation process. Ultimately, under the premise of strictly following the text description content, a high degree of preservation and consistent restoration of the target object's identity features is achieved. This effectively solves the problems of easy loss of identity information and imprecise text control in image generation, and significantly improves the realism, controllability, and identity fidelity of the generated image. Attached Figure Description
[0016] Figure 1 A flowchart of an image generation method provided for an embodiment of this application; Figure 2 Flowchart of another image generation method provided for embodiments of this application; Figure 3 A block diagram of an image generation system provided for another embodiment of this application; Figure 4 A block diagram of an electronic device provided for another embodiment of this application. Detailed Implementation
[0017] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0018] In the field of portrait generation, there is a growing demand for high-quality generation that combines reference portrait identity with customized text style. However, related methods struggle to simultaneously satisfy both "preserving facial identity" and "following text style": First, the fusion of facial features and multimodal features is inaccurate, easily leading to dilution of identity information or impure feature extraction; second, insufficient text-visual collaboration often results in generated results that do not match the text or are distorted in detail; and third, the ability to balance identity and style is weak, resulting in poor flexibility. Although technologies such as VAE (Variational Autoencoder), CLIP (Contrastive Language-Image Pre-Training), and ArcFace (Additive Angular Margin Loss for Deep Face Recognition) have laid the foundation for cross-modal generation, they still struggle to balance identity consistency and style controllability.
[0019] Related technologies employ CLIP-guided VAE-Transformer algorithms (such as the SD (Stable Diffusion) face fine-tuning model), injecting facial features only in a single stage during the later generation phase, which easily leads to identity distortion and misalignment between text and generated results. Another related technology uses an ID (Identification) encoder + single-stage fusion algorithm, simply overlaying and fusing facial features extracted using ArcFace without feature refinement and failing to continuously enhance identity information. It is also prone to noise incorporation and loss of facial details in complex backgrounds. Furthermore, most of the aforementioned technologies employ a single fusion method, resulting in poor adaptability to different text styles and reference portraits.
[0020] In view of this, this application aims to solve the above problems and proposes an image generation method. It improves the purity of facial features through an ID encoder-Q-Former refinement link, strengthens identity information by combining a two-stage AdaLN (Adaptive Layer Normalization) injection mechanism of MM-DiT (Multi-Modal Diffusion Transformer) and DiT (Diffusion Transformer), and achieves precise text-visual collaboration by combining branch self-attention and cross-modal stitching. Finally, it generates a high-quality personalized portrait that retains the identity of the reference face and conforms to the style of the text.
[0021] It is important to note that obtaining users' personal information, such as facial data, is only permitted with the user's authorization.
[0022] Figure 1A flowchart illustrating the image generation method provided in this application.
[0023] like Figure 1 As shown, the image generation method 100 provided in this application embodiment includes, for example, steps S110-S150.
[0024] Step S110: Obtain the original image data and text description data, and extract the target object image data from the original image data.
[0025] For example, the original image data includes portrait images, the text description data includes style requirements for the portrait images, such as hairstyle, accessories, facial expressions, clothing, etc., and the target object image data is such as the face region of the portrait image. The extraction of the original image data is implemented based on a neural network model, such as the MTCNN (Multi-Task Cascaded Convolutional Networks) detection model.
[0026] Step S120: Encode the original image data and text description data to obtain object latent feature data and text feature data, and encode and perform attention calculation on the target object image data to obtain object identity feature data.
[0027] For example, the encoding of the original image data is implemented based on the VAE encoder, the encoding of the text description data is implemented based on the CLIP encoder, the CLIP encoder adopts ViT-L / 14 (Vision Transformer-Large / 14), the encoding of the target object image data is implemented using the pre-trained ArcFace identity encoder (ID encoder), the attention calculation is implemented based on the Q-Former model, the Q-Former model can include multiple layers, the model includes multiple learnable vectors Query, and the attention calculation can include self-attention calculation and cross-attention calculation, etc.
[0028] Step S130: Based on the object's latent feature data, text feature data, and object identity feature data, perform preliminary cross-modal feature fusion processing and feature alignment processing to obtain cross-modal fused feature data.
[0029] For example, the initial cross-modal feature fusion processing is implemented based on the MM-DiT module. The MM-DiT module includes multiple sub-modules, such as six sub-modules. Each sub-module includes an image branch and a text branch. The initial cross-modal feature fusion processing includes self-attention calculation of object latent feature data and text feature data, and feature fusion of the object latent feature data and object identity feature data obtained from self-attention calculation through adaptive layer normalization (AdaLN) to obtain fused image feature data. The fused feature data and the text feature data obtained from self-attention calculation are input into a feedforward network (including a multi-layer perceptron (MLP)) for non-linear activation processing, residual processing, and normalization processing. Feature alignment processing includes feature convolution and feature concatenation.
[0030] Step S140: Perform deep feature fusion processing based on cross-modal fusion feature data and object identity feature data to obtain deep fusion feature data.
[0031] For example, the feature deep fusion processing is implemented based on the DiT module, which includes multiple sub-modules, such as 12 sub-modules. The feature deep fusion processing includes performing global self-attention calculation on the cross-modal fused feature data, and fusing the object identity feature data with the cross-modal fused feature data obtained by global self-attention calculation through adaptive layer normalization (AdaLN) to obtain object identity enhanced feature data. The object identity enhanced feature data is then input into a feedforward network (including a multi-layer perceptron (MLP)) for nonlinear activation processing, residual processing, and normalization processing to obtain deep fused feature data.
[0032] Step S150: Based on the deep fusion feature data, generate a target image corresponding to the text description data and the target object image data.
[0033] For example, the deep fusion feature data is convolved, and the convolved deep fusion feature data is input into the VAE decoder for decoding to obtain the target image corresponding to the text description data and the target object image data.
[0034] In the above embodiments, by encoding the original image, text description, and target object image respectively, latent object features, text features, and high-fidelity object identity features are generated. Then, through preliminary cross-modal feature fusion and feature alignment processing, the text semantics and visual features are accurately associated. After deep feature fusion, the identity information is stably embedded into the generation process. Finally, under the premise of strictly following the text description content, a high degree of preservation and consistent restoration of the target object's identity features is achieved. This effectively solves the problems of easy loss of identity information and inaccurate text control in image generation, and significantly improves the realism, controllability, and identity fidelity of the generated image.
[0035] In one example, step S120 encodes and performs attention calculations on the target object image data to obtain object identity feature data, including: performing identity encoding processing on the target object image data to obtain initial object identity feature data; and performing self-attention and cross-attention enhancement calculations on the initial object identity feature data to obtain object identity feature data.
[0036] Specifically, the extracted face image (target object image data) is input into the pre-trained ArcFace Identity Encoder, which outputs face identity features (initial object identity feature data) (containing unique identity information). ArcFace focuses on extracting identity features to ensure face consistency. The initial object identity feature data is then input into the Q-Former model for refinement, resulting in object identity feature data. The Q-Former model consists of multiple layers and includes multiple learnable vectors (Query). Self-attention calculation is performed between learnable vectors, and cross-attention calculation is performed between learnable vectors and the initial object identity feature data.
[0037] For example, ArcFace features (initial object identity feature data) are input into a 6-layer Q-Former model (configured with 8 learnable queries). Self-attention and cross-attention enhance the discriminative power of the features, outputting refined face features. Q-Former further strengthens the semantic expressive power of the features, providing a stable signal for subsequent identity preservation.
[0038] In one example, row self-attention and cross-attention enhancement calculations are performed on the initial object identity feature data to obtain the object identity feature data. This includes: performing self-attention calculation based on the query vector to obtain the target query vector, wherein the query vector is obtained by training the Q-Former model; and performing cross-attention calculation based on the target query vector and the initial object identity feature data to obtain the object identity feature data.
[0039] Specifically, the query vector (learnable vector) includes multiple vectors, such as 8 learnable vectors. Attention calculation includes queries, keys, and values. Self-attention calculation involves generating corresponding queries, keys, and values for each query vector through a linear transformation. Attention weights are obtained by calculating the dot product of the query and all keys and normalizing it. The values are then weighted and summed using these weights to update the representation of each query vector, thereby obtaining the target query vector and achieving full fusion of the internal contextual information of the query vector.
[0040] Cross-attention computation, for example, uses the target query vector as the query, maps the initial object identity feature data to keys and values respectively, then calculates the dot product of the query and the key, and obtains the attention weight distribution after scaling and normalization; finally, based on the attention weight distribution, the values are selectively aggregated to generate a refined query vector representation that integrates the initial object identity feature data, i.e., the object identity feature data.
[0041] In one example, cross-modal feature fusion and feature alignment are performed based on object latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data. This includes: performing self-attention calculation on the object latent feature data to obtain internal association feature data; performing self-attention calculation on the text feature data to obtain enhanced text feature data; performing cross-modal feature fusion based on the internal association feature data and object identity feature data to obtain fused image feature data; performing nonlinear activation, residual processing, and normalization processing on the fused image feature data and enhanced text feature data based on a multilayer perceptron to obtain processed fused image feature data and processed enhanced text feature data; and performing feature alignment based on the processed fused image feature data and processed enhanced text feature data to obtain cross-modal fused feature data.
[0042] Specifically, the initial cross-modal feature fusion processing is implemented based on the MM-DiT module, which includes multiple sub-modules, such as six sub-modules. Each sub-module includes an image branch and a text branch. Each sub-module performs self-attention calculation, adaptive layer normalization (AdaLN), feedforward network processing, residual processing, normalization processing, and feature alignment processing. These sub-modules are concatenated. Self-attention calculation is implemented based on the self-attention module. The image branch performs self-attention calculation on latent human features (object latent feature data) to learn the internal feature associations of the image, obtaining internally associated feature data. The text branch performs self-attention calculation on text semantic features (text feature data) to enhance text semantic consistency, obtaining enhanced text feature data. The self-attention module parameters are the same for both the image and text branches. Feature alignment processing includes feature convolution and feature concatenation.
[0043] Refined facial features (object identity feature data) are injected into image branch features (internal correlation feature data) through adaptive layer normalization (AdaLN) to obtain fused image feature data. The fused feature data and text feature data are then input into a feedforward network (including a Multi-Layer Perceptron (MLP)) for nonlinear activation processing, residual processing, and normalization processing to obtain processed fused image feature data and processed enhanced text feature data. The MLP consists of two layers. Nonlinear activation processing is implemented based on GELU (Gaussian Error Linear Unit). Residual processing involves adding the processed result of each layer to its input data to obtain the output of each layer, and then normalizing the output of each layer (Layer Normalization). Based on the processed fused image feature data and processed enhanced text feature data, feature alignment processing is performed to obtain cross-modal fused feature data.
[0044] In the above embodiments, the sub-module retains modal characteristics through branch self-attention, and AdaLN realizes the dynamic fusion of facial features and image features, ensuring that identity information is naturally integrated into the generation process.
[0045] In one example, feature alignment processing is performed on the processed fused image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data, including: performing a first convolution processing on the processed fused image feature data to obtain first image feature data; and performing concatenation processing on the first image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data.
[0046] Specifically, the image branch features (processed fused image feature data) are mapped to the same dimension as the text branch features (processed enhanced text feature data) through a 1×1 convolution (first convolution processing). The image sequence (first image feature data) and the text sequence (processed enhanced text feature data) are concatenated along the sequence length dimension to obtain cross-modal fused feature data.
[0047] In the above embodiments, the unified dimension ensures that the subsequent model can process two modal features simultaneously, and the splicing operation constructs a complete "text-image" feature sequence.
[0048] In one example, deep feature fusion processing is performed based on cross-modal fusion feature data and object feature data to obtain deep fusion feature data. This includes: performing global self-attention calculation on the cross-modal fusion feature data to obtain text-image global association feature data; performing attention enhancement calculation on the text-image global association feature data and object identity feature data to obtain object identity enhancement feature data; and performing nonlinear activation processing, residual processing, and normalization processing on the object identity enhancement feature data based on a multilayer perceptron to obtain deep fusion feature data.
[0049] Specifically, the feature deep fusion processing is implemented based on the DiT module, which includes multiple sub-modules, such as 12 sub-modules. Each sub-module performs self-attention calculation, adaptive layer normalization (AdaLN), feedforward network processing, residual processing, and normalization processing. The multiple sub-modules are connected in series. Cross-modal fusion feature data is input into the attention module to learn the global association between text and image features (text-image global association feature data). Then, refined facial features (object identity feature data) are injected into the attention output (text-image global association feature data) via AdaLN to strengthen the continuous influence of identity features, resulting in object identity enhancement feature data. This enhanced feature data is then input into a feedforward network (including a Multi-Layer Perceptron (MLP)) for nonlinear activation processing, residual processing, and normalization to obtain deep fusion feature data. The MLP consists of two layers. Nonlinear activation processing is implemented using GELU (Gaussian Error Linear Unit). Residual processing involves adding the processing result calculated by each layer of the MLP to the input data of each layer to obtain the output result of each layer, and then normalizing the output result of each layer.
[0050] In the above embodiments, the feature deep fusion model structure realizes fine fusion of cross-modal features, and the secondary injection of facial features ensures the consistency of identity in the generated images.
[0051] In one example, generating a target image corresponding to text description data and target object image data based on deep fusion feature data includes: performing a second convolution process on the deep fusion feature data to obtain target feature data; and decoding the target feature data to obtain the target image corresponding to the text description data and target object image data.
[0052] Specifically, the features output by the DiT module (deep fusion feature data) are compressed by a 1×1 convolution (second convolution processing) and restored to a latent feature map (target feature data) of the same size as the output of the VAE encoder. The target feature data is then input into the VAE decoder, which restores the features to an RGB image (target image) of the original input image (original image data) size. The VAE decoder ensures the visual quality and detail representation of the generated image.
[0053] In another example, steps S130 and S140 above are repeated for iterative optimization.
[0054] For example, steps S130 and S140 involve a total of 50 iterations (the first 20 iterations learn text semantics (text description data), and the last 30 iterations enhance facial details (target object image data)). These multiple iterations balance the fusion weights of text semantics and facial features to output the final portrait (target image).
[0055] Figure 2 Another image generation method flowchart provided for embodiments of this application, such as Figure 2 As shown, the image generation method includes S201-S207.
[0056] S201: Obtain the text description and portrait image, and perform portrait cutout on the portrait image to obtain a small image after cutout.
[0057] For example, a portrait image includes the original image data, and the text description (text description data) includes style requirements for the portrait image, such as hairstyle, accessories, facial expression, clothing, etc. The small image after cutout (target object image data) is such as the face region of the portrait image. Portrait cutout is implemented based on a neural network model, such as the MTCNN (Multi-Task Cascaded Convolutional Networks) detection model.
[0058] S202 encodes portrait images, text descriptions, and small images after cutout.
[0059] For example, the input portrait image is fed into a pre-trained VAE encoder (using the VAE model in the StableDiffusion diffusion model), outputting latent portrait features (object latent feature data). VAEs efficiently compress image information while preserving key visual features, and the latent feature space is more suitable for subsequent model processing, while reducing computational complexity. The user-input text description is fed into the CLIP-ViT-L / 14 text encoder to generate text semantic features (text feature data). The CLIP text encoder, trained on large-scale image-text pairs, can convert natural language into semantic vectors homologous to visual features, laying the foundation for cross-modal fusion. The target object image data (small images after image matting) is encoded using a pre-trained ArcFace identity encoder (ID encoder).
[0060] S203, Q-former model for attention calculation.
[0061] For example, attention computation can include self-attention computation and cross-attention computation, etc. Attention computation is based on the Q-Former model, which can include multiple layers and multiple learnable vectors Query. The encoded target object data is input into the Q-former model for attention computation to obtain object identity feature data.
[0062] S204, based on the MM-DiT module, performs preliminary cross-modal feature fusion processing.
[0063] For example, the MM-DiT module includes N sub-modules, where N can be 6. Each sub-module includes an image branch and a text branch. Each sub-module performs self-attention calculation, adaptive layer normalization (AdaLN), feedforward network processing, residual processing, and normalization processing. Multiple sub-modules are concatenated. The image branch performs self-attention calculation on latent human features (object latent feature data) to learn the internal feature associations of the image and obtain internally associated feature data. The text branch performs self-attention calculation on text semantic features (text feature data) to enhance text semantic consistency and obtain enhanced text feature data. The self-attention module parameters of the image branch and the text branch are the same. Refined facial features (object identity feature data) are injected into image branch features (internal correlation feature data) through adaptive layer normalization (AdaLN) to obtain fused image feature data. The fused feature data and text feature data are then input into a feedforward network (including a multi-layer perceptron (MLP)) for nonlinear activation processing, residual processing, and normalization processing to obtain processed fused image feature data and processed enhanced text feature data. The multi-layer perceptron consists of two layers. The nonlinear activation processing is implemented based on GELU (Gaussian Error Linear Unit). The residual processing involves adding the processing result calculated by each layer of the multi-layer perceptron to the input data of each layer to obtain the output result of each layer. The output result of each layer is then normalized (LayerNorm (LayerNormalization)).
[0064] S205, splicing process.
[0065] For example, the processed fused image feature data and the processed enhanced text feature data are convolved and concatenated to obtain cross-modal fused feature data.
[0066] S206 performs feature deep fusion processing based on the DiT module.
[0067] For example, the DiT module includes N sub-modules, where N can be 12. Each sub-module performs self-attention calculation, adaptive layer normalization (AdaLN), feedforward network processing, residual processing, and normalization processing. Multiple sub-modules are connected in series. Cross-modal fusion feature data is input into the attention module to learn the global association between text and image features (text-image global association feature data). Then, refined facial features (object identity feature data) are injected into the attention output (text-image global association feature data) via AdaLN to strengthen the continuous influence of identity features, resulting in object identity enhancement feature data. This enhanced feature data is then input into a feedforward network (including a Multi-Layer Perceptron (MLP)) for nonlinear activation processing, residual processing, and normalization to obtain deep fusion feature data. The MLP consists of two layers. Nonlinear activation processing is implemented using GELU (Gaussian Error Linear Unit). Residual processing involves adding the processing result calculated by each layer of the MLP to the input data of each layer to obtain the output result of each layer, and then normalizing the output result of each layer.
[0068] S207, input to VAE decoder for decoding.
[0069] For example, the features (deeply fused feature data) output by the DiT module are compressed by a 1×1 convolution (second convolution processing) and restored to a latent feature map (target feature data) of the same size as the output of the VAE encoder; the target feature data is then input into the VAE decoder.
[0070] S208, Generate image.
[0071] For example, the VAE decoder restores the features to the RGB image (target image) of the original input image (original image data) size.
[0072] In the above embodiments, the input consists of a portrait image (containing a clearly identifiable facial area) providing an identity reference and a natural language text description specifying style requirements (such as "a gentle portrait with a high ponytail, silver earrings, and a blue knit sweater"). Based on a two-stage facial feature injection and cross-modal fusion architecture, combined with an ID encoder-Q-Former refinement link, the output is a single new portrait image (target image) with the same resolution as the input reference image. This image can accurately retain the core identity features of the reference portrait (such as facial contours, facial proportions, and skin tone), while strictly matching the style details in the text description (such as hairstyle, accessories, clothing, and facial expression). The overall visual quality is clear, without obvious distortion or missing details, thus improving the quality of the generated image.
[0073] The image generation method and system proposed in this application achieve preliminary fusion of facial features and image features through AdaLN in the MM-DiT module, and then perform secondary enhancement injection through AdaLN in the DiT module, forming a progressive feature interaction of "preliminary fusion-deep enhancement". This solves the problem of inconsistent facial identity features and easy dilution by text semantics caused by traditional single-stage injection, and ensures the consistency of identity between the generated image and the original face image.
[0074] Figure 3 A block diagram of an image generation system provided for another embodiment of this application.
[0075] like Figure 3 As shown, another embodiment of this application provides an image generation system 300, which includes: an acquisition module 310, an encoding module 320, a first fusion module 330, a second fusion module 340, and a generation module 350.
[0076] The acquisition module 310 is used to acquire raw image data and text description data, and extract target object image data from the raw image data.
[0077] The encoding module 320 is used to encode the original image data and text description data respectively to obtain object latent feature data and text feature data, and to encode and perform attention calculation on the target object image data to obtain object identity feature data.
[0078] The first fusion module 330 is used to perform preliminary cross-modal feature fusion processing and feature alignment processing based on object latent feature data, text feature data and object identity feature data to obtain cross-modal fused feature data.
[0079] The second fusion module 340 is used to perform deep feature fusion processing based on cross-modal fusion feature data and object identity feature data to obtain deep fusion feature data.
[0080] The generation module 350 is used to generate a target image corresponding to the text description data and the target object image data based on the deep fusion feature data.
[0081] For example, the first fusion module 330 is further configured to perform self-attention calculation on the latent feature data of the object to obtain internal correlation feature data; perform self-attention calculation on the text feature data to obtain enhanced text feature data; perform cross-modal feature preliminary fusion processing based on the internal correlation feature data and the object identity feature data to obtain fused image feature data; perform nonlinear activation processing, residual processing and normalization processing on the fused image feature data and enhanced text feature data based on a multilayer perceptron to obtain processed fused image feature data and processed enhanced text feature data; and perform feature alignment processing based on the processed fused image feature data and processed enhanced text feature data to obtain cross-modal fused feature data.
[0082] For example, the first fusion module 330 is further configured to perform a first convolution process on the processed fused image feature data to obtain first image feature data; and to perform a concatenation process on the first image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data.
[0083] For example, the second fusion module 340 is further configured to perform global self-attention calculation on the cross-modal fusion feature data to obtain text image global association feature data; perform attention enhancement calculation based on the text image global association feature data and object identity feature data to obtain object identity enhancement feature data; and perform nonlinear activation processing, residual processing and normalization processing on the object identity enhancement feature data based on a multilayer perceptron to obtain deep fusion feature data.
[0084] For example, the generation module 350 is further configured to perform a second convolution process on the deep fusion feature data to obtain target feature data; and to perform decoding processing on the target feature data to obtain a target image corresponding to the text description data and the target object image data.
[0085] For example, the encoding module 320 is further configured to perform identity encoding processing on the target object image data to obtain initial object identity feature data; and to perform self-attention and cross-attention enhancement calculations on the initial object identity feature data to obtain object identity feature data.
[0086] For example, the encoding module 320 is also used to perform self-attention calculation based on the query vector to obtain the target query vector, wherein the query vector is obtained by training the Q-Former model; and to perform cross-attention calculation based on the target query vector and the initial object identity feature data to obtain the object identity feature data.
[0087] Figure 4 A block diagram of an electronic device provided for another embodiment of this application.
[0088] Another embodiment of this application provides an electronic device having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.
[0089] like Figure 4 As shown, for ease of understanding, embodiments of this application illustrate a specific electronic device 400.
[0090] Electronic device 400 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic device 400 may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0091] like Figure 4 As shown, the electronic device 400 includes a computing unit 401, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 402 or a computer program loaded from a storage unit 408 into a random access memory (RAM) 403. The RAM 403 may also store various programs and data required for the operation of the electronic device 400. The computing unit 401, ROM 402, and RAM 403 are interconnected via a bus 404. An input / output (I / O) interface 405 is also connected to the bus 404.
[0092] Multiple components in electronic device 400 are connected to input / output (I / O) interface 405. These components include: input unit 406, such as a keyboard or mouse; output unit 407, such as various types of displays or speakers; storage unit 408, such as a hard disk or optical disk; and communication unit 409, such as a network interface card (NIC), modem, or wireless transceiver. Communication unit 409 allows electronic device 400 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0093] The computing unit 401 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods described above. For example, in some embodiments, any one or more of the various methods described above can be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 400 via ROM 402 and / or communication unit 409. When the computer program is loaded into RAM 403 and executed by the computing unit 401, one or more steps of any one or more of the various methods described above can be performed. Alternatively, in other embodiments, the computing unit 401 can be configured to perform any one or more of the various methods described above by any other suitable means (e.g., by means of firmware).
[0094] This application provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the method in any of the above embodiments.
[0095] It should be noted that the logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this application, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other media, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0096] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0097] In the description of this application, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this application, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0098] In the description of this application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc., indicating the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application.
[0099] Furthermore, the terms "first," "second," etc., used in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying relative importance, or implicitly specifying the number of technical features indicated in this embodiment. Therefore, features defined with terms such as "first" and "second" in the embodiments of this application can explicitly or implicitly indicate that the embodiment includes at least one of those features. In the description of this application, the word "multiple" means at least two or more, such as two, three, four, etc., unless otherwise explicitly and specifically defined in the embodiments.
[0100] In this application, unless otherwise explicitly specified or limited in the embodiments, the terms "installation," "connection," "joining," and "fixing" appearing in the embodiments should be interpreted broadly. For example, a connection can be a fixed connection, a detachable connection, or an integral part; it can also be a mechanical connection, an electrical connection, etc. Of course, it can also be a direct connection, or an indirect connection through an intermediate medium, or it can be the internal communication between two components, or the interaction between two components. Those skilled in the art can understand the specific meaning of the above terms in this application based on the specific implementation.
[0101] In this application, unless otherwise expressly specified and limited, "above" or "below" the second feature can mean that the first feature is in direct contact with the second feature, or that the first feature is in indirect contact with the second feature through an intermediate medium. Furthermore, "above," "on top of," and "over" the second feature can mean that the first feature is directly above or diagonally above the second feature, or simply that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature can mean that the first feature is directly below or diagonally below the second feature, or simply that the first feature is at a lower horizontal level than the second feature.
[0102] Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of this application.
Claims
1. An image generation method, characterized in that, The method includes: Acquire raw image data and text description data, and extract target object image data from the raw image data; The original image data and the text description data are encoded to obtain object latent feature data and text feature data, respectively; and the target object image data is encoded and attention is calculated to obtain object identity feature data. Based on the object's latent feature data, the text feature data, and the object's identity feature data, preliminary cross-modal feature fusion processing and feature alignment processing are performed to obtain cross-modal fused feature data; Based on the cross-modal fusion feature data and the object identity feature data, deep feature fusion processing is performed to obtain deep fusion feature data; Based on the deep fusion feature data, a target image corresponding to the text description data and the target object image data is generated.
2. The method according to claim 1, characterized in that, The process of performing preliminary cross-modal feature fusion and feature alignment based on the object's latent feature data, text feature data, and object identity feature data to obtain cross-modal fused feature data includes: Self-attention calculation is performed on the latent feature data of the object to obtain internal correlation feature data; Self-attention calculation is performed on the text feature data to obtain enhanced text feature data; Based on the internal correlation feature data and the object identity feature data, a preliminary cross-modal feature fusion process is performed to obtain fused image feature data; The fused image feature data and the enhanced text feature data are subjected to nonlinear activation processing, residual processing and normalization processing based on a multilayer perceptron to obtain the processed fused image feature data and the processed enhanced text feature data. Based on the processed fused image feature data and the processed enhanced text feature data, feature alignment processing is performed to obtain cross-modal fused feature data.
3. The method according to claim 2, characterized in that, The feature alignment process based on the processed fused image feature data and the processed enhanced text feature data to obtain cross-modal fused feature data includes: The processed fused image feature data is subjected to a first convolution process to obtain the first image feature data; Cross-modal fusion feature data is obtained by concatenating the first image feature data and the processed enhanced text feature data.
4. The method according to claim 1, characterized in that, The deep feature fusion processing based on the cross-modal fused feature data and the object feature data to obtain deep fused feature data includes: Global self-attention calculation is performed on the cross-modal fusion feature data to obtain global association feature data of text and image; Attention enhancement calculation is performed based on the global association feature data of the text image and the object identity feature data to obtain object identity enhancement feature data; Based on a multilayer perceptron, nonlinear activation processing, residual processing, and normalization processing are performed on the object identity enhancement feature data to obtain deep fusion feature data.
5. The method according to claim 1, characterized in that, The step of generating a target image corresponding to the text description data and the target object image data based on the deep fusion feature data includes: The deep fusion feature data is subjected to a second convolution process to obtain the target feature data; The target feature data is decoded to obtain a target image corresponding to the text description data and the target object image data.
6. The method according to claim 1, characterized in that, The process of encoding and attention calculation on the target object image data to obtain object identity feature data includes: The target object image data is subjected to identity encoding processing to obtain initial object identity feature data; Self-attention and cross-attention enhancement calculations are performed on the initial object identity feature data to obtain object identity feature data.
7. The method according to claim 6, characterized in that, The step of performing row self-attention and cross-attention enhancement calculations on the initial object identity feature data to obtain object identity feature data includes: Self-attention is calculated based on the query vector to obtain the target query vector, wherein the query vector is obtained by training the Q-Former model; Cross-attention calculation is performed based on the target query vector and the initial object identity feature data to obtain the object identity feature data.
8. An image generation system, characterized in that, The system includes: The acquisition module is used to acquire raw image data and text description data, and extract target object image data from the raw image data; The encoding module is used to encode the original image data and the text description data respectively to obtain object latent feature data and text feature data, and to encode and perform attention calculation on the target object image data to obtain object identity feature data. The first fusion module is used to perform preliminary cross-modal feature fusion processing and feature alignment processing based on the object's latent feature data, the text feature data, and the object's identity feature data to obtain cross-modal fused feature data; The second fusion module is used to perform deep feature fusion processing based on the cross-modal fusion feature data and the object identity feature data to obtain deep fusion feature data; The generation module is used to generate a target image corresponding to the text description data and the target object image data based on the deep fusion feature data.
9. An electronic device having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1-7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1-7.