Text-to-image method and apparatus, and electronic device and storage medium

By replacing the attention structure in the U-shaped neural network model with linear complexity and utilizing model distillation techniques, the problem of generating high-resolution images using existing text-based image models has been solved, achieving efficient generation of images with resolutions greater than 2k.

WO2026123825A1PCT designated stage Publication Date: 2026-06-18CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-09-08
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing text-based image models cannot generate images with a resolution greater than 2k due to the high computational complexity of the attention module.

Method used

Multiple attention structures are replaced with a target model structure with linear complexity, and the previously learned knowledge is transferred to the student model through model distillation to generate the target U-shaped neural network model.

🎯Benefits of technology

It enables the generation of images with resolutions greater than 2k under limited video memory conditions, improving the efficiency and performance of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025119623_18062026_PF_FP_ABST
    Figure CN2025119623_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the embodiments of the present application are a text-to-image method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring an initial U-shaped neural network model for performing image denoising processing during a text-to-image process, wherein the initial U-shaped neural network model comprises a plurality of groups of attention structures; replacing the plurality of groups of attention structures with a target model structure having linear complexity; on the basis of the initial U-shaped neural network model, performing model distillation on the U-shaped neural network model which has been subjected to structure replacement, so as to obtain a target U-shaped neural network model; and on the basis of the target U-shaped neural network model, performing text-to-image image denoising, so as to obtain a target image. By means of the embodiments of the present application, a plurality of groups of attention structures are replaced with a target model structure having linear complexity, and the computational time complexity of an attention module is reduced to linear complexity, so as to obtain a target U-shaped neural network model, and thus an image at a resolution of greater than 2k can be generated by means of text-to-image generation.
Need to check novelty before this filing date? Find Prior Art

Description

Methods and apparatus for creating text images, electronic devices, and storage media

[0001] This application claims priority to Chinese Patent Application No. 2024118055360, filed on December 9, 2024, entitled "A Method and Apparatus for Creating Graphics, Electronic Equipment, and Storage Medium", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of image generation technology, and in particular to a text-to-image method and apparatus, electronic device, and storage medium. Background Technology

[0003] Natural language image generation (NLE) is a task that generates images based on natural language text. The goal is for a model to generate an image that matches the input description. This task requires the model to not only recognize specific objects and scenes in the text, but also to grasp abstract features such as style and emotion, and to transform textual information into potential visual features to generate the corresponding image. NLE models can help designers quickly generate sketches and design concepts, and have broad application prospects in fields such as game and film production, advertising and marketing, education and scientific research.

[0004] Current image denoising processes typically employ the Unet model, which includes multiple attention structures. The computational time complexity of the attention module makes it impossible to generate images with a resolution greater than 2k with limited video memory. Summary of the Invention

[0005] In view of the above problems, a text-based image method, apparatus, electronic device, and storage medium are proposed to overcome or at least partially solve the above problems, including:

[0006] A text-to-image method, the method comprising:

[0007] An initial U-shaped neural network model is obtained for image denoising during the text-to-image process, the initial U-shaped neural network model including multiple attention structures;

[0008] Replace the multiple attention structures with a target model structure that has linear complexity;

[0009] Based on the initial U-shaped neural network model, the U-shaped neural network model after the replacement structure is distilled to obtain the target U-shaped neural network model;

[0010] The target image is obtained by denoising the raw image according to the target U-shaped neural network model.

[0011] In some embodiments, the model distillation of the U-shaped neural network model after the replacement structure based on the initial U-shaped neural network model includes:

[0012] Obtain the training sample data corresponding to the initial U-shaped neural network model and the first noise estimation data corresponding to the training sample data;

[0013] The training sample data is input into the U-shaped neural network model after the replacement structure to obtain the second noise estimation data;

[0014] Using the first noise estimation data as the expected value for image denoising of the target U-shaped neural network model, the model parameters of the U-shaped neural network model after the replacement structure are adjusted based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model.

[0015] In some embodiments, the step of using the first noise estimation data as the expected value of image denoising for the target U-shaped neural network model, and adjusting the model parameters of the U-shaped neural network model after the replacement structure based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model includes:

[0016] A loss function is constructed for the first noise estimation data and the second noise estimation data, using the first noise estimation data as the expected value of image denoising for the target U-shaped neural network model;

[0017] Based on the loss function, the model parameters of the U-shaped neural network model after the replacement structure are adjusted to obtain the target U-shaped neural network model.

[0018] In some embodiments, the loss function is the square of the L2 norm of the difference between the second noise estimation data and the first noise estimation data.

[0019] In some embodiments, the image denoising of the text image according to the target U-shaped neural network model includes:

[0020] A pre-trained variational autoencoder is used to convert the initial image into latent space features;

[0021] Gaussian noise is added to the latent space features to obtain latent space noise features;

[0022] Obtain text information input by the user and encode the text information into text features;

[0023] The text features and the latent space noise features are input into the target U-shaped neural network model for image denoising, and the denoised features are output.

[0024] The target image is generated based on the denoising features.

[0025] In some embodiments, encoding the text information into text features includes:

[0026] Obtain the target language model for text encoding;

[0027] The text information is input into the target language model, and an encoder in the target language model is used to encode the text to obtain text features.

[0028] In some embodiments, the target model structure is a Mamba2 structure based on a state-space model.

[0029] In some embodiments, adjusting the model parameters of the U-shaped neural network model after the structure replacement based on the loss function includes:

[0030] Set the threshold for the loss function;

[0031] When the loss function value of the U-shaped neural network model after the replacement structure is within the loss function threshold range, model distillation is stopped, and the current model is determined to be the target U-shaped neural network model.

[0032] In some embodiments, the multiple attention structures include self-attention modules and cross-attention modules.

[0033] A text-based image processing device, the device comprising:

[0034] The initial U-shaped neural network model acquisition module is configured to acquire an initial U-shaped neural network model for image denoising during the text-to-image process, wherein the initial U-shaped neural network model includes multiple attention structures;

[0035] The structure replacement module is configured to replace the multiple sets of attention structures with a target model structure having linear complexity;

[0036] The model distillation module is configured to perform model distillation on the U-shaped neural network model after the replacement structure based on the initial U-shaped neural network model to obtain the target U-shaped neural network model;

[0037] The target image generation module is configured to perform image denoising on the raw image according to the target U-shaped neural network model to obtain the target image.

[0038] An electronic device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the text-to-image method as described above.

[0039] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the text-to-image method as described above.

[0040] The embodiments of this application have the following advantages:

[0041] In this embodiment, an initial U-shaped neural network model for image denoising during text-to-image processing is obtained. This initial U-shaped neural network model includes multiple attention structures. A target model structure with linear complexity is used to replace these multiple attention structures. Model distillation is performed on the replaced U-shaped neural network model based on the initial U-shaped neural network model to obtain a target U-shaped neural network model. Image denoising of the text-to-image is then performed according to the target U-shaped neural network model to obtain the target image. This method allows the replacement of multiple attention structures with a target model structure having linear complexity, thereby reducing the computational time complexity of the attention module to linear complexity and obtaining the target U-shaped neural network model. Thus, text-to-image processing can generate images with a resolution greater than 2k. Furthermore, generating the target U-shaped neural network model through model distillation effectively utilizes previously learned knowledge, improving model efficiency and performance. Attached Figure Description

[0042] To more clearly illustrate the technical solution of this application, the drawings used in the description of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 is a flowchart of a text-to-image method provided in an embodiment of this application;

[0044] Figure 2 is a flowchart of another text-based image generation method provided in an embodiment of this application;

[0045] Figure 3 is a flowchart of another text-based image generation method provided in an embodiment of this application;

[0046] Figure 4 is a flowchart illustrating a text image according to an embodiment of this application;

[0047] Figure 5 is a schematic diagram of the structure of a texturing device provided in an embodiment of this application. Detailed Implementation

[0048] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0049] Referring to Figure 1, a flowchart of a text-to-image generation method according to an embodiment of this application is shown, which may specifically include the following steps:

[0050] Step S101: Obtain an initial U-shaped neural network model for image denoising during the text-to-image process. The initial U-shaped neural network model includes multiple attention structures.

[0051] In practical applications, text image processing can include stages such as image feature extraction, adding noise to image features, denoising image features, and decoding to generate the final image. Among these, a U-shaped neural network model, also known as the Unet model, can be used for image denoising. The Unet model employs a symmetrical U-shaped architecture and skip connections, effectively capturing both global and local image features. Current text image denoising models often use an initial U-shaped neural network model that includes multiple attention structures, each consisting of a self-attention block and a cross-attention block. Both the self-attention and cross-attention blocks have computational time complexity. However, under limited video memory conditions, it is difficult to generate images with resolutions greater than 2k.

[0052] In this embodiment, the initial U-shaped neural network model is a maturely trained Unet model used in the texturing process. Specifically, the Unet model mainly includes a self-attention block and a cross-attention block, which form a group, and there are N groups in total constituting the Unet network θ. This network predicts the loss L through noise. noise Conduct training:

[0053] Among them, L noise Let z be the predicted noise loss, ∈ be the Gaussian noise added once during the texturing process, and z be the noise loss. t Let t be the latent space noise features in the Unet model, y be the number of times Gaussian noise is added, and y be the text features during image denoising. θTo analyze the latent space noise features z based on the Unet model t The noise is estimated when the noise is added t times and the text feature y is input.

[0054] In the Unet network, the self-attention module and the cross-attention module have the same structure. The difference between the self-attention module and the cross-attention module is that the self-attention module only uses image features z to mine feature self-attention information, while the cross-attention module uses image features z and text features y to mine cross-attention information.

[0055] Assuming image features and text features are Weight parameters Where n is the number of tokens, d is the feature dimension, and d′ is the attention dimension, then

[0056] Q = XW Q K = XW K V = XW V ,

[0057] Y = MV

[0058] Where M is the attention map, the above formula represents a self-attention module when both Q and K come from image features, and a cross-attention module when Q comes from image features and K comes from text features.

[0059] The Unet model is trained using hundreds of millions of image-text pairs, which enables the Unet model to accurately convert noise into latent space features of the image based on text information. As a result, the complete image can be reconstructed based solely on noise and text features during the inference stage.

[0060] In the Unet model, the way Q (Query), K (Key), and V (Value) are calculated causes the computation of a single attention set to increase exponentially. At the same time, the use of multiple attention structures in the Unet model results in excessively high data dimensionality, which is not conducive to computation in limited GPU memory.

[0061] Step S102: Replace multiple attention structures with the target model structure that has linear complexity;

[0062] After obtaining the initial U-shaped neural network model, multiple attention structures in the initial U-shaped neural network model can be improved. Specifically, multiple attention structures can be replaced by the target model structure with linear complexity, thereby reducing the computational time complexity of the attention module to linear, so as to generate higher resolution images on the basis of limited video memory.

[0063] The target model structure is a Mamba2 structure based on the state-space model. A state-space model (SSM) is a mathematical model used to describe dynamic systems, helping to analyze and predict time series data by explaining the relationship between observed variables and hidden state variables. The Mamba2 structure is a sequence reasoning model with linear complexity, designed to efficiently process long sequence data. Mamba2's design goal is to achieve linear time complexity, thus providing a significant computational advantage when processing large-scale sequence data. Mamba2 achieves this goal by incorporating a state-space model.

[0064] In one embodiment of this application, the state-space model (SSM) can be expressed in the following form. Y i =C i H i ,

[0065] Where i represents the sequence number of the current token in the sequence. It is in a hidden state. For independent variables, ⊙ denotes element-wise multiplication. The meaning of the above formula is: H i It is the hidden state at the current moment, obtained through the hidden state H from the previous moment. i-1 And the current input X i Updated to A i and B i This is a parameter related to the current time step, used to adjust the way the hidden state is updated. i It is the output at the current moment, through the current hidden state H. i and parameter C i Calculated.

[0066] Therefore, the above formula can be expressed as:

[0067] in, It is an n×n lower triangular matrix. This structure is Mamba2, which has linear complexity. It is achieved by multiplying the inverse matrices of Ci and Bi and... By combining these methods, the computation of the state-space model can be simplified to linear complexity. The elements in are elements That is, A k The cumulative product.

[0068] Step S103: Based on the initial U-shaped neural network model, perform model distillation on the U-shaped neural network model after the structure replacement to obtain the target U-shaped neural network model;

[0069] After structural replacement, the replaced U-shaped neural network model may not achieve the desired image denoising effect. Therefore, model distillation can be performed on the replaced U-shaped neural network model with the image denoising effect corresponding to the pre-trained initial U-shaped neural network model as the target. Model distillation is a technique for model compression and acceleration. It transfers the knowledge of a complex, large, and well-trained teacher model (achieving high accuracy on the target task) to a simpler, smaller student model. This allows the student model to maintain high performance while having lower computational complexity and faster inference speed. During model distillation, the student model is typically smaller and has fewer parameters than the teacher model, thus reducing the model's storage space and computational resource requirements. Simultaneously, due to its smaller size and faster inference speed, the student model is suitable for running on resource-constrained devices (such as mobile devices and embedded systems). Furthermore, by learning the knowledge of the teacher model, the student model can achieve model compression and acceleration while maintaining high performance.

[0070] In this embodiment, the initial U-shaped neural network model is the teacher model, and the U-shaped neural network model after the structure replacement is the student model. Distillation is used to distill the knowledge from the original Unet into the Unet structure that replaces the target model structure, thus effectively utilizing previously learned knowledge. Furthermore, after large-scale distillation, the new Unet structure θ′ possesses the same denoising capability as the original Unet structure θ, while also exhibiting linear time complexity, making it suitable for generating ultra-high resolution images.

[0071] Step S104: Denoise the raw image according to the target U-shaped neural network model to obtain the target image.

[0072] After generating the U-shaped neural network model, the target U-shaped neural network model can be applied to the text-to-image process to replace the original initial U-shaped neural network model for image denoising, thereby obtaining a high-resolution target image greater than 2K.

[0073] In this embodiment, an initial U-shaped neural network model for image denoising during text-to-image processing is obtained. This initial U-shaped neural network model includes multiple attention structures. A target model structure with linear complexity is used to replace these attention structures. Model distillation is performed on the replaced U-shaped neural network model based on the initial U-shaped neural network model to obtain a target U-shaped neural network model. Image denoising of the text-to-image is then performed according to the target U-shaped neural network model to obtain the target image. This method reduces the computational time complexity of the attention module to linear complexity by replacing multiple attention structures with a target model structure with linear complexity, thus enabling the generation of images with a resolution greater than 2k from the text-to-image process. Furthermore, generating the target U-shaped neural network model through model distillation effectively utilizes previously learned knowledge, improving model efficiency and performance.

[0074] Referring to Figure 2, a flowchart of another text-to-image method provided in an embodiment of this application is shown, which may specifically include the following steps:

[0075] Step S201: Obtain an initial U-shaped neural network model for image denoising during the text-to-image process. The initial U-shaped neural network model includes multiple attention structures.

[0076] Step S202: Replace multiple attention structures with the target model structure that has linear complexity;

[0077] Step S203: Obtain the training sample data corresponding to the initial U-shaped neural network model and the first noise estimation data corresponding to the training samples;

[0078] The training sample data for the initial U-shaped neural network model may include image features, noise, the number of times noise is added, and text features.

[0079] Step S204: Input the training sample data into the U-shaped neural network model after the structure replacement to obtain the second noise estimation data;

[0080] Following the training process of the initial U-shaped neural network model, the training sample data is input into the U-shaped neural network model after the structure is replaced to perform noise estimation, thereby obtaining the second noise estimation data.

[0081] Step S205: Using the first noise estimation data as the expected value of image denoising for the target U-shaped neural network model, the model parameters of the U-shaped neural network model after the replacement structure are adjusted based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model.

[0082] In practical applications, since the initial U-shaped neural network model is a pre-trained and usable model, its image denoising capability meets the requirements of the text image. Therefore, the first noise estimation data can be used as the expected value of the image denoising process of the target U-shaped neural network model to evaluate the denoising capability of the U-shaped neural network model after the structure replacement. Then, by modifying the model parameters in the target model structure, the denoising capability of the U-shaped neural network model after the structure replacement can be optimized so that its denoising capability can reach the denoising capability corresponding to the initial U-shaped neural network model.

[0083] In this embodiment of the application, the parameters of the unreplaced part of the U-shaped neural network model are maintained as before, and the model parameters are adjusted only for the target model structure of the replaced part, which can reduce the amount of model parameter modification. In one embodiment of this application, step S205 may include the following sub-steps:

[0084] Sub-step S11: Construct a loss function for the first noise estimation data and the second noise estimation data using the first noise estimation data as the expected value of the image denoising of the target U-shaped neural network model;

[0085] The loss function is the square of the L2 norm of the difference between the second noise estimation data and the first noise estimation data.

[0086] That is: the loss function is

[0087] Among them, L kd Denotes the loss function, ∈ θ′ (z t (t,y) represents the U-shaped neural network model after the structural replacement in the training sample data (z). t Second noise estimation data under (t,y); ∈ θ (z t (t,y) represents the initial U-shaped neural network model in the training sample data (z). t The first noise estimation data under (t,y).

[0088] Sub-step S12: Based on the loss function, adjust the model parameters of the U-shaped neural network model after the structure replacement to obtain the target U-shaped neural network model.

[0089] The model parameters of the U-shaped neural network model after the structure replacement are adjusted based on the value calculated by the loss function, so that the second noise estimation data of the adjusted model is close to the first noise estimation data.

[0090] In one embodiment of this application, during the model distillation process, a loss function threshold can be set. When the loss function value of the model is within the range of the loss function threshold, the model distillation can be stopped, and the current model is determined to be the target U-shaped neural network model. If the loss function value of the model is not within the range of the loss function threshold, the model distillation continues.

[0091] Step S206: Denoise the raw image according to the target U-shaped neural network model to obtain the target image.

[0092] In this embodiment, multiple attention structures can be replaced with a target model structure having linear complexity, reducing the computational time complexity of the attention module to linear complexity, thus obtaining a target U-shaped neural network model. This allows Wenshengtu to generate images with a resolution greater than 2k. Simultaneously, by comparing the noise estimation of the trained U-shaped neural network model with the noise estimation of the replaced U-shaped neural network model, the target U-shaped neural network model can be generated, effectively utilizing previously learned knowledge and improving model efficiency and performance.

[0093] Referring to Figure 3, a flowchart of another text-to-image method provided in an embodiment of this application is shown, which may specifically include the following steps:

[0094] Step S301: Obtain an initial U-shaped neural network model for image denoising during the text-to-image process. The initial U-shaped neural network model includes multiple attention structures.

[0095] Step S302: Replace multiple attention structures with the target model structure that has linear complexity;

[0096] Step S303: Based on the initial U-shaped neural network model, perform model distillation on the U-shaped neural network model after the structure replacement to obtain the target U-shaped neural network model;

[0097] Step S304: The initial image is converted into latent space features using a pre-trained variational autoencoder;

[0098] An initial image can be converted into latent space features using a pre-trained variational autoencoder (VAE). This process includes loading a pre-trained VAE model, preprocessing the input image, encoding the image into the latent space, and sampling the latent space features.

[0099] The VAE consists of two parts: an encoder and a decoder. The encoder maps input data (such as an image) to latent variables (such as mean and variance) in the latent space; the decoder reconstructs the input data from the latent variables in the latent space.

[0100] The pre-trained VAE selected in this embodiment has been trained on a large amount of data, effectively mapping input data to the latent space and generating high-quality samples from the latent space. The pre-trained VAE can be directly used for feature extraction tasks without retraining. The pre-trained model can be obtained from publicly available model libraries or as an appendix of research papers.

[0101] In practical applications, before loading the initial image into the VAE model, the initial image can be preprocessed to meet the input requirements of the VAE model. Preprocessing can include image scaling, normalization, and other operations. The preprocessed initial image is then input into the VAE model, where the encoder part encodes the preprocessed image into the latent space, obtaining latent variables (mean and variance). Reparameterization techniques are then used to sample latent variables from the mean and variance.

[0102] Step S305: Add Gaussian noise to the latent space features to obtain latent space noise features;

[0103] Gaussian noise is a type of random noise whose values ​​follow a normal distribution (i.e., a Gaussian distribution). The purpose of adding noise is to introduce randomness, making the generated image more varied and diverse.

[0104] Step S306: Obtain the text information input by the user and encode the text information into text features;

[0105] In one embodiment of this application, encoding text information into text features includes: obtaining a target language model for text encoding; inputting the text information into the target language model; and using one encoder in the target language model to perform text encoding to obtain text features. Compared to existing text-generated image processes that use two or more encoders for text encoding, this simplifies the model structure.

[0106] In practical applications, the target language model can adopt the GLM4 general language model, which supports multiple languages. Unlike the common CLIP encoder (Contrastive Language-Image Pretraining) and T5 encoder (Text-To-Text Transfer Transformer), GLM4 (General Language Model 4) can greatly improve the representation ability of text features.

[0107] Step S307: Input the text features and latent space noise features into the target U-shaped neural network model to perform image denoising and output the denoised features;

[0108] In the process of image denoising, targeted denoising is achieved according to text features.

[0109] Step S308: Generate the target image based on denoising feature decoding.

[0110] A variational autoencoder decoder is used to decode the denoising features to obtain the target image.

[0111] In this embodiment, multiple attention structures can be replaced with a target model structure having linear complexity, reducing the computational time complexity of the attention module to linear complexity, thus obtaining a target U-shaped neural network model. This allows Wenshengtu to generate images with a resolution greater than 2k. Furthermore, generating the target U-shaped neural network model through model distillation effectively utilizes previously learned knowledge, improving model efficiency and performance.

[0112] Referring to Figure 4, a schematic diagram of a text-based image processing flow according to an embodiment of this application is shown, which may include the following process:

[0113] The initial image on the left side of Figure 4 is obtained. The initial image is input into the encoding module of the VAE encoder and noise is added to generate the latent space noise feature Zt.

[0114] The user inputs the text "Three cows are grazing peacefully under white birch trees. The birch leaves are yellow, the grass is yellow, and there is a white cow in the distance. The whole scene creates a tranquil atmosphere." This text is input into the GLM4 large language model to generate text encoding. Then, the latent space noise feature Zt and the latent space noise feature Zt image denoising module can be used to denoise the image. The upper part is the original Unet model, which consists of multiple self-attention blocks and cross-attention blocks. The lower part is the target Unet model obtained by replacing the attention structure with the Mamba2 structure based on the State Space Model (SSM) and performing model distillation. The target Unet model can denoise the latent space noise feature Zt according to the text features, thus obtaining the denoised features. The denoised features are input into the VAE encoder's decoding module to obtain the generated target image (the image on the right side of Figure 4).

[0115] It should be noted that, for the sake of simplicity, the method embodiments are described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.

[0116] Referring to Figure 5, a schematic diagram of a text-based image processing device according to an embodiment of this application is shown, which may specifically include the following modules:

[0117] The initial U-shaped neural network model acquisition module 501 is configured to acquire an initial U-shaped neural network model for image denoising processing during text-to-image processing, wherein the initial U-shaped neural network model includes multiple attention structures;

[0118] The structure replacement module 502 is configured to replace the multiple sets of attention structures with a target model structure having linear complexity;

[0119] The model distillation module 503 is configured to perform model distillation on the U-shaped neural network model after the replacement structure based on the initial U-shaped neural network model to obtain the target U-shaped neural network model;

[0120] The target image generation module 504 is configured to perform image denoising on the raw image according to the target U-shaped neural network model to obtain the target image.

[0121] In one embodiment of this application, the model distillation module 503 may include the following sub-modules:

[0122] The sample acquisition submodule is configured to acquire the training sample data corresponding to the initial U-shaped neural network model and the first noise estimation data corresponding to the training sample.

[0123] The second noise estimation determination submodule is configured to input the training sample data into the U-shaped neural network model after the replacement structure to obtain the second noise estimation data.

[0124] The model parameter adjustment submodule is configured to use the first noise estimation data as the expected value for image denoising of the target U-shaped neural network model, and to adjust the model parameters of the U-shaped neural network model after the replacement structure based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model.

[0125] In one embodiment of this application, the model parameter adjustment submodule may include the following units:

[0126] The loss function construction unit is configured to construct a loss function for the first noise estimation data and the second noise estimation data using the first noise estimation data as the expected value of image denoising of the target U-shaped neural network model;

[0127] The U-shaped neural network model determination unit is configured to adjust the model parameters of the U-shaped neural network model after the replacement structure based on the loss function to obtain the target U-shaped neural network model.

[0128] In one embodiment of this application, the loss function is the square of the L2 norm of the difference between the second noise estimation data and the first noise estimation data.

[0129] In one embodiment of this application, the target image generation module 504 may include:

[0130] The latent space feature transformation submodule is configured to use a pre-trained variational autoencoder to transform the initial image into latent space features;

[0131] The latent space noise feature submodule is configured to add Gaussian noise to the latent space feature to obtain the latent space noise feature;

[0132] The text information encoding submodule is configured to acquire text information input by the user and encode the text information into text features;

[0133] The denoising submodule is configured to input the text features and latent space noise features into the target U-shaped neural network model for image denoising and output denoised features.

[0134] The decoding submodule is configured to generate the target image based on the denoising features.

[0135] In one embodiment of this application, the text information encoding submodule may include the following units:

[0136] The target speech model acquisition unit is configured to acquire the target language model used for text encoding.

[0137] The text feature encoding unit is configured to input the text information into the target language model and use an encoder in the target language model to encode the text to obtain text features.

[0138] In one embodiment of this application, the target model structure is a Mamba2 structure based on a state-space model.

[0139] In this embodiment, an initial U-shaped neural network model for image denoising during texturing is obtained. This initial U-shaped neural network model includes multiple attention structures. A target model structure with linear complexity is used to replace these multiple attention structures. Based on the initial U-shaped neural network model, model distillation is performed on the replaced U-shaped neural network model to obtain a target U-shaped neural network model. Image denoising of the texturing image is then performed according to the target U-shaped neural network model to obtain the target image. This method allows the replacement of multiple attention structures with a target model structure having linear complexity, reducing the computational time complexity of the attention module to linear complexity, and obtaining the target U-shaped neural network model. Therefore, texturing images can be generated with a resolution greater than 2k. Furthermore, generating the target U-shaped neural network model through model distillation effectively utilizes previously learned knowledge, improving model efficiency and performance.

[0140] An embodiment of this application also provides an electronic device, which may include a processor, a memory, and a computer program stored in the memory and capable of running on the processor. When the computer program is executed by the processor, it implements the image generation method described above.

[0141] An embodiment of this application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, it implements the image generation method described above.

[0142] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.

[0143] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0144] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0145] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in one or more blocks of the flowchart illustrations and / or one or more blocks of the block diagrams.

[0146] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and / or one or more block diagrams.

[0147] These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable terminal equipment, provide steps for implementing the functions specified in one or more flowcharts and / or one or more block diagrams.

[0148] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.

[0149] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0150] The above provides a detailed description of the text-to-image method, apparatus, electronic device, and storage medium. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A method for generating images from text, comprising: An initial U-shaped neural network model is obtained for image denoising during the text-to-image process, the initial U-shaped neural network model including multiple attention structures; Replace the multiple attention structures with a target model structure that has linear complexity; Based on the initial U-shaped neural network model, the U-shaped neural network model after the replacement structure is distilled to obtain the target U-shaped neural network model; The target image is obtained by denoising the raw image according to the target U-shaped neural network model.

2. The method according to claim 1, wherein, The process of model distillation based on the initial U-shaped neural network model and the replaced structure includes: Obtain the training sample data corresponding to the initial U-shaped neural network model and the first noise estimation data corresponding to the training sample data; The training sample data is input into the U-shaped neural network model after the structure is replaced to obtain the second noise estimation data; Using the first noise estimation data as the expected value for image denoising of the target U-shaped neural network model, the model parameters of the U-shaped neural network model after the replacement structure are adjusted based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model.

3. The method according to claim 2, wherein, The step of using the first noise estimation data as the expected value for image denoising of the target U-shaped neural network model, and adjusting the model parameters of the U-shaped neural network model after the replacement structure based on the first noise estimation data and the second noise estimation data to obtain the target U-shaped neural network model includes: A loss function is constructed for the first noise estimation data and the second noise estimation data, using the first noise estimation data as the expected value of image denoising for the target U-shaped neural network model; Based on the loss function, the model parameters of the U-shaped neural network model after the replacement structure are adjusted to obtain the target U-shaped neural network model.

4. The method according to claim 3, wherein, The loss function is the square of the L2 norm of the difference between the second noise estimation data and the first noise estimation data.

5. The method according to any one of claims 1 to 4, wherein, The image denoising process based on the target U-shaped neural network model includes: A pre-trained variational autoencoder is used to convert the initial image into latent space features; Gaussian noise is added to the latent space features to obtain latent space noise features; Obtain text information input by the user and encode the text information into text features; The text features and the latent space noise features are input into the target U-shaped neural network model for image denoising, and the denoised features are output. The target image is generated based on the denoising features.

6. The method according to claim 5, wherein, Encoding the text information into text features includes: Obtain the target language model for text encoding; The text information is input into the target language model, and an encoder in the target language model is used to encode the text to obtain text features.

7. The method according to claim 1, 2, 3, 4, or 6, wherein, The target model structure is a Mamba2 structure based on a state-space model.

8. The method according to claim 3, wherein, The step of adjusting the model parameters of the U-shaped neural network model after the structure replacement based on the loss function includes: Set the threshold for the loss function; When the loss function value of the U-shaped neural network model after the replacement structure is within the loss function threshold range, model distillation is stopped, and the current model is determined to be the target U-shaped neural network model.

9. The method according to claim 1, wherein, The multiple attention structures include self-attention modules and cross-attention modules.

10. A text-based image processing device, comprising: The initial U-shaped neural network model acquisition module is configured to acquire an initial U-shaped neural network model for image denoising during the text-to-image process, wherein the initial U-shaped neural network model includes multiple attention structures; The structure replacement module is configured to replace the multiple sets of attention structures with a target model structure having linear complexity; The model distillation module is configured to perform model distillation on the U-shaped neural network model after the replacement structure based on the initial U-shaped neural network model to obtain the target U-shaped neural network model; The target image generation module is configured to perform image denoising on the raw image according to the target U-shaped neural network model to obtain the target image.

11. An electronic device comprising a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the text-to-image method as described in any one of claims 1 to 9.

12. A computer-readable storage medium storing a computer program that, when executed by a processor, implements the text-to-image method as described in any one of claims 1 to 9.