A fast stylized text-to-image generation method based on diffusion model
By using a potential consistency model and a normalized hybrid self-attention mechanism, representative style features are extracted from reference style images to guide text-to-image generation. This solves the problems of time-consuming fine-tuning and poor performance in existing technologies, and achieves fast and efficient stylized image generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2025-05-16
- Publication Date
- 2026-06-23
AI Technical Summary
Existing techniques for generating images from stylized text require fine-tuning of pre-trained large-scale diffusion models, which is time-consuming and ineffective. In particular, when there is a large difference between the reference style image and the content image, the inference time of the two-stage method increases and the effect is poor.
We utilize the self-consistency properties of latent consistency models (LCMs) to extract representative style features from reference style images and introduce a normalized hybrid self-attention mechanism to guide the text-to-image generation process, avoiding fine-tuning and inversion operations, and ensuring that the generated results are highly consistent with the distribution of the reference style images.
It enables the rapid generation of high-quality stylized images, reduces inference time and improves efficiency, and can generate images that meet the desired style using only a single style reference image.
Smart Images

Figure CN120543365B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, specifically to a method for rapid stylized text-to-image generation based on a diffusion model. Background Technology
[0002] Stylized text-to-image generation aims to generate images that conform to a desired style based on a small number of style images, representing a novel image generation paradigm. Although stylized image generation is similar to neural style transfer tasks, the two are fundamentally different. Neural style transfer takes content and style images as input, addressing an image transformation task, focusing on transferring the artistic style of the style image to the content image. In contrast, stylized image generation generates images that conform to a specific style based on given textual cues.
[0003] One direct approach is to first use a pre-trained text-to-image (T2I) model to generate an image based on content cues. Then, state-of-the-art style transfer methods are used to convert the generated image into a style specific to the reference style image. However, this method is mainly suitable when the reference style image and the content image have some similarity. When the two are significantly different, this method may fail. Another alternative approach is to use more reference style images and perform minimal training / fine-tuning of the model using LoRA or an adapter, or even fine-tuning the entire T2I model. However, these methods require more time and effort and are less convenient than directly providing a reference style image.
[0004] Currently, these technologies still face challenges and drawbacks. They require adding additional adapters to pre-trained large-scale diffusion models for full or partial fine-tuning, which is time-consuming and limits their practical application. Furthermore, employing a two-stage approach—first generating an image using a text-to-image generation model and then stylizing it using advanced neural style transfer methods—often involves inversion techniques of the diffusion model, which doubles the inference time and typically yields poor results. Summary of the Invention
[0005] In view of this, the present invention provides a method for generating fast stylized text to images based on a diffusion model, so as to at least solve the above-mentioned technical problems.
[0006] According to a first aspect of the present invention, a fast stylized text-to-image generation method based on a diffusion model is provided, comprising: S1, style feature extraction: inputting a reference style image with a resolution of 512×512×3, and extracting representative style features from the reference style image using the self-consistency of a latent consistency model; S2, stylized text-to-image generation: guiding the process of generating an image from text through the extracted representative style features and a normalized hybrid self-attention mechanism.
[0007] Optionally, the step of extracting representative style features from the input reference style image using the self-consistency of the latent consistency model includes: mapping the reference style image to a low-dimensional latent space through a pre-trained variational autoencoder to obtain an initial latent code; adding Gaussian noise to the initial latent code using a diffusion process noise injection formula to obtain a noisy latent code; inputting the noisy latent code into the noise prediction network of the latent consistency model, and extracting the style statistical features of each layer of the Transformer as the representative style features through a single-step denoising operation, wherein the noise prediction network achieves stable mapping of features at adjacent time steps by minimizing the consistency loss function.
[0008] Optionally, the addition of Gaussian noise to the initial latent code using the diffusion process noise injection formula yields a noisy latent code, expressed as:
[0009]
[0010] in, This is the latent encoding after adding noise. For diffusion time step, It is Gaussian noise. This is the initial potential encoding.
[0011] Optionally, the process of guiding text to generate images through extracted representative style features and a normalized hybrid self-attention mechanism includes: fusing content features with extracted representative style features in the Transformer feature layer of the diffusion model to obtain intermediate output content features, and adjusting the intermediate output content features through a normalized hybrid self-attention mechanism to ensure that the generated stylized result is highly consistent with the distribution of the reference style image.
[0012] Optionally, the normalized hybrid self-attention mechanism specifically involves: in the self-attention module of the Transformer feature layer, mapping the intermediate output content features to query features, key features, and value features; replacing the key features and value features with the key features and value features of representative style features to obtain the replaced key features and value features; calculating the attention score matrix using the query features and the replaced key features and value features; performing a unified Softmax operation on the attention score matrix to obtain the fused attention weight matrix; and performing a weighted summation of the value features based on the fused attention weight matrix to obtain the stylized content features as the stylization result.
[0013] Optionally, the normalized hybrid self-attention mechanism further includes: performing style distribution normalization on the content features before performing a unified Softmax operation on the attention score matrix, as follows:
[0014]
[0015] in, , , and Representing content features respectively and representative style characteristics The mean and standard deviation.
[0016] According to a second aspect of the present invention, a fast stylized text-to-image generation system based on a diffusion model is provided, comprising: a style feature extraction module, used to input a reference style image with a resolution of 512×512×3, and extract representative style features from the reference style image using the self-consistency of a latent consistency model; and a stylized text-to-image generation module, used to guide the process of generating an image from text through the extracted representative style features and a normalized hybrid self-attention mechanism.
[0017] According to a third aspect of the present invention, an electronic device is provided, including a processor and a memory storing a program. The program includes instructions that, when executed by the processor, cause the processor to perform the steps performed by the method of the first aspect described above.
[0018] According to a fourth aspect of the present invention, a computer storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method of the first aspect described above.
[0019] In summary, this invention leverages the self-consistency properties of latent consistency models (LCMs) to extract representative style statistics from reference style images to guide the stylization process. Furthermore, it introduces a normalized mixture of self-attention mechanism, enabling the model to query the most relevant style patterns from these style statistics and use them to adjust the content features of the intermediate output. This mechanism also ensures that the generated stylized results are highly consistent with the distribution of the reference style image. This invention can generate stylized images from text using only a single style reference image, offering faster inference speed and higher efficiency compared to other methods. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings.
[0021] Figure 1 This is a flowchart of the steps of the present invention.
[0022] Figure 2 This is a schematic diagram illustrating the CLIP feature similarity of the present invention at different time steps using different combinations.
[0023] Figure 3 This paper presents a visualization analysis of the fusion of semantic and stylistic information in generating images from stylized text, along with principal component analysis.
[0024] Figure 4 A comparison of the effects of the normalized hybrid self-attention mechanism on stylized text-generated images.
[0025] Figure 5 This is a schematic diagram illustrating the impact of style distribution normalization.
[0026] Figure 6 To and Figure 1 The corresponding overall flowchart of the present invention. Detailed Implementation
[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0028] This invention presents a fast stylized text-to-image generation method based on a diffusion model. This method can generate high-quality stylized images from a given style image within six sampling steps. The invention proposes a novel stylized image generation method called UniSty, which utilizes a pre-trained large-scale diffusion model without fine-tuning or any additional optimization. Specifically, this invention leverages the self-consistency properties of latent consistency models (LCMs) to extract representative style statistics from a reference style image to guide the stylization process. Furthermore, this invention introduces a normalized mixture of self-attention mechanism, enabling the model to query the most relevant style patterns from these style statistics and use them to adjust the content features of the intermediate output. This mechanism also ensures that the generated stylized result is highly consistent with the distribution of the reference style image.
[0029] See Figure 1 , Figure 6 The present invention provides a fast stylized text-to-image generation method based on a diffusion model, which mainly includes:
[0030] S1. Style feature extraction: Input a reference style image with a resolution of 512×512×3, and extract representative style features from the reference style image using the self-consistency of the latent consistency model.
[0031] S2. Stylized Text to Image Generation: This stage guides the process of generating images from extracted representative style features using a normalized hybrid self-attention mechanism. After analyzing the advantages and disadvantages of various attention fusion modules, this invention proposes a norm mixture of self-attention mechanisms. This allows the model to query the most relevant style patterns from these style statistics and use them to adjust the content features of the intermediate output. This mechanism also ensures that the generated stylized results are highly consistent with the distribution of the reference style image.
[0032] Optionally, the step of extracting representative style features from the input reference style image using the self-consistency of the latent consistency model includes: mapping the reference style image to a low-dimensional latent space through a pre-trained variational autoencoder to obtain an initial latent code; adding Gaussian noise to the initial latent code using a diffusion process noise injection formula to obtain a noisy latent code; inputting the noisy latent code into the noise prediction network of the latent consistency model, and extracting the style statistical features of each layer of the Transformer as the representative style features through a single-step denoising operation, wherein the noise prediction network achieves stable mapping of features at adjacent time steps by minimizing the consistency loss function.
[0033] Optionally, the addition of Gaussian noise to the initial latent code using the diffusion process noise injection formula yields a noisy latent code, expressed as:
[0034]
[0035] in, This is the latent encoding after adding noise. For diffusion time step, It is Gaussian noise. This is the initial potential encoding.
[0036] For example, extracting representative style features in step S1 specifically involves:
[0037] S11, in Figure 2 In this invention, two noise injection methods are applied to style images, namely the forward noise addition process of the diffusion model. (Equation 1, Eq. 1) and the DDIM Inversion technique are used, with two different models, Latent Consistency Models (LCMs) and Stable Diffusion (SD), for single-step denoising. This invention utilizes the CLIP image encoder to extract features from the original style image and the denoised image, and calculates their cosine similarity. Figure 2 The results show that the performance of all three combinations gradually decreases with increasing time step. However, the performance gap between the Eq. 1 + LCMs combination and the DDIM Inversion + SD combination remains small. This indicates that LCMs can effectively extract representative style statistics from noisy style images, thus avoiding the time-consuming DDIM inversion operation. The fundamental reason lies in the optimization objective of LCMs: minimizing the difference in the consistency function output between adjacent samples. This mechanism allows LCMs to maintain representative style statistics, i.e., representative style features, even during single-step prediction.
[0038] S12. Based on the findings described in step S1, this invention uses Latent Consistency Models (LCMs) to extract representative style statistics from the reference style image. Input reference style image. Encoder through pre-trained variational autoencoder Reference style image Mapping to a low-dimensional latent space yields the initial latent encoding. The noise injection formula for the diffusion process is adopted. Gaussian noise is added to the latent code, where For diffusion time step, It is Gaussian noise;
[0039] S13, The latent encoding after adding noise in step S12 Noise prediction network for input potential consistency model Style statistical features of each layer of the Transformer are extracted through a single-step denoising operation. ,here This represents text input. The noise prediction network achieves a stable mapping of features between adjacent time steps by minimizing the consistency loss function.
[0040] Optionally, the process of guiding text to generate images through extracted representative style features and a normalized hybrid self-attention mechanism includes: fusing content features with extracted representative style features in the Transformer feature layer of the diffusion model to obtain intermediate output content features, and adjusting the intermediate output content features through a normalized hybrid self-attention mechanism to ensure that the generated stylized result is highly consistent with the distribution of the reference style image.
[0041] Optionally, the normalized hybrid self-attention mechanism specifically involves: in the self-attention module of the Transformer feature layer, mapping the intermediate output content features to query features, key features, and value features; replacing the key features and value features with the key features and value features of representative style features to obtain the replaced key features and value features; calculating the attention score matrix using the query features and the replaced key features and value features; performing a unified Softmax operation on the attention score matrix to obtain the fused attention weight matrix; and performing a weighted summation of the value features based on the fused attention weight matrix to obtain the stylized content features as the stylization result.
[0042] Optionally, the normalized hybrid self-attention mechanism further includes: performing style distribution normalization on the content features before performing a unified Softmax operation on the attention score matrix, as follows:
[0043]
[0044] in, , , and Representing content features respectively and representative style characteristics The mean and standard deviation.
[0045] For example, the normalized mixture of self-attention mechanism in step S2 is specifically as follows:
[0046] S21. Define the content characteristics of a special layer in the diffusion model Transformer as follows: The objective of this invention is to [achieve the goal of the first...] The layer will contain representative style statistics. Seamless integration into content features In this way, stylized content features are obtained. In the Transformer layer of the backbone network, features After passing through the self-attention module, they are mapped to queries. ,key Sum feature.
[0047] S22. This invention firstly replaces key values with key values representing style statistics, i.e., representative style features. Intuitively, the query (Q) can represent semantic information, such as image layout, while key (K) and value (V) features represent style statistics, such as color, texture, and lighting. Stylized content features. It can then be expressed as:
[0048] ,
[0049] in, This indicates attention calculation. This represents the softmax activation function. This invention observes that the description in this step often prioritizes capturing style statistics while sacrificing semantic information obtained from the prompt. For example, as... Figure 3 As shown in the first line, the method failed to generate the expected content corresponding to the prompts "girl" and "house". To further analyze this phenomenon, this invention performed principal component analysis (PCA) on the features or attention maps from different backbone network layers, including ResNet blocks and query (Q) and key (K) layers in self-attention. Figure 3The second row visualizes the first three principal components, and the results further reveal that the semantic information is closer to the style reference image than to the given conditional cue words.
[0050] S23. To address the above problems, the present invention further enhances the semantic information by reintroducing it. The semantic representation in is as follows:
[0051] ,
[0052] in, These are weight hyperparameters. This invention found that the method has poor robustness and is highly dependent on… The choice. For example, such as Figure 4 As shown, in the same Under these settings, the stylized images generated in the second row are superior to those in the first row in terms of semantic representation.
[0053] This invention argues that the performance instability stems from the fact that the two attention calculations in the formula of this step are performed separately. Reviewing the calculation process of the standard attention mechanism, negative numbers are mapped to extremely small values after exponential operations, effectively reducing their contribution to the attention weight matrix. However, in the stylized text-to-image (T2I) task, if there is a significant difference between style statistics and semantic information, then content features... Representative style statistics The attention score matrix (before Softmax) may contain negative values in some rows, causing the exponential function to fail to function effectively. Therefore, it is necessary to introduce additional weight coefficients λ to balance the two components on the right-hand side of the equation.
[0054] S24. To avoid the above problems, it is essential to ensure that the calculation of the attention score matrix considers both intra-class semantic differences and inter-class (semantic and stylistic) information differences simultaneously, rather than calculating these components separately. This invention first rewrites the formula in equation step S23 into matrix form, obtaining the following expression:
[0055]
[0056] To achieve this unified attention score matrix calculation, this method forms a new matrix. Through merger and Then, the present invention rewrites the above formula in for :
[0057]
[0058] The above formula performs a Softmax operation on both inter-class and intra-class information, avoiding the failure issue that may occur when mapping the attention score matrix to the attention weight matrix using the exponential function. Stylized content features. It can be obtained in the following ways;
[0059]
[0060] The above formula enables the model to more adaptively aggregate semantic and style information, while reducing the strong dependence on the choice of λ.
[0061] The above formula utilizes content features Each input area is used to query representative style statistics. The most relevant information. However, this method can sometimes lead to... and The style distributions between them do not match because it does not consider the global style distribution. For example... Figure 5 As shown in the first row, this mismatch manifests as a global color inconsistency between the generated stylized image and the reference stylized image.
[0062] Therefore, this invention introduces a style distribution normalization process before applying the above formula, inspired by the following method. This process can be expressed in the following mathematical form:
[0063] ,
[0064] in, , , and Representing content features respectively and representative style statistics The mean and standard deviation. For example... Figure 5 As shown, after applying style distribution normalization, the global hue is more consistent with the style image than the unnormalized case (top row). Ultimately, this invention refers to the above two processes collectively as normalized mixture of self-attention.
[0065] In summary, this invention leverages the self-consistency properties of latent consistency models (LCMs) to extract representative style statistics from reference style images to guide the stylization process. Furthermore, it introduces a normalized mixture of self-attention mechanism, enabling the model to query the most relevant style patterns from these style statistics and use them to adjust the content features of the intermediate output. This mechanism also ensures that the generated stylized results are highly consistent with the distribution of the reference style image. This invention can generate stylized images from text using only a single style reference image, offering faster inference speed and higher efficiency compared to other methods.
[0066] Extensive experiments have demonstrated the feasibility of the method presented in this invention. See Table 1 for a comparison of the effects of different text-to-image generation methods:
[0067] Table 1. Comparison of the effects of different text-to-image generation methods
[0068]
[0069] This invention also provides a fast stylized text-to-image generation system based on a diffusion model, comprising:
[0070] The style feature extraction module is used to extract representative style features from a reference style image with a resolution of 512×512×3 by taking a reference style image as input and utilizing the self-consistency of the latent consistency model.
[0071] The Stylized Text to Image Module guides the process of generating images from text by extracting representative stylistic features and a normalized hybrid self-attention mechanism.
[0072] It should be understood that the fast stylized text-to-image generation system based on a diffusion model in this embodiment is used to implement the corresponding methods in the foregoing multiple method embodiments and has the beneficial effects of the corresponding method embodiments.
[0073] As another example, the present invention also provides an electronic device, which will now be described as an example of a hardware device that can be applied to various aspects of the present invention, serving as a server or client of the invention. The term "electronic device" is intended to represent various forms of digital electronic computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0074] The electronic device may include a processor, a communications interface, memory, and a communications bus.
[0075] The processor, communication interface, and memory communicate with each other via a communication bus. The communication interface is used to communicate with other electronic devices or servers.
[0076] The processor is used to execute programs, specifically the relevant steps in the above method embodiments.
[0077] Specifically, the program may include program code, which includes computer operation instructions.
[0078] The processor may be a CPU, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in a smart device may be of the same type, such as one or more CPUs; or they may be of different types, such as one or more CPUs and one or more ASICs.
[0079] Memory is used to store programs. Memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.
[0080] When executed by a processor, the program is used to enable an electronic device to perform a fast stylized text-to-image generation method based on a diffusion model, as described in this invention.
[0081] Furthermore, the specific implementation of each step in the program can be found in the corresponding descriptions of the steps and units in the above method embodiments, and will not be repeated here. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices and modules described above can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be repeated here.
[0082] An exemplary embodiment of the present invention also provides a computer storage medium storing a computer program, wherein when the computer program is executed by a processor, it implements the methods of the various embodiments of the present invention. The corresponding process descriptions in the foregoing method embodiments can be referred to, and will not be repeated here.
[0083] The methods described above according to embodiments of the present invention can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as a CD-ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or as computer code originally stored on a remote recording medium or a non-transitory machine-readable medium and subsequently stored on a local recording medium, downloaded via a network. Thus, the methods described herein can be processed by software stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware (such as an ASIC or FPGA). It is understood that the computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., RAM, ROM, flash memory, etc.) capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code used to implement the methods shown herein, the execution of the code transforms the general-purpose computer into a dedicated computer for executing the methods shown herein.
[0084] Specific embodiments of the invention have now been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims can be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing can be advantageous.
[0085] It should be understood that although this specification is described according to various embodiments, not every embodiment contains only one independent technical solution. This way of describing the specification is only for clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each embodiment can also be appropriately combined to form other implementation methods that can be understood by those skilled in the art.
[0086] Finally, it should be noted that the above embodiments are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention. Therefore, all equivalent technical solutions also fall within the scope of the embodiments of the present invention, and the patent protection scope of the embodiments of the present invention should be defined by the claims.
Claims
1. A fast stylized text-to-image generation method based on a diffusion model, characterized in that, include: S1. Style feature extraction: Input a reference style image and extract representative style features from the reference style image using the self-consistency of the latent consistency model; S2. Stylized Text to Image Generation: This process guides the generation of images from extracted representative style features using a normalized hybrid self-attention mechanism. It includes: in the Transformer feature layer of the diffusion model, fusing content features with extracted representative style features to obtain intermediate output content features; and adjusting these intermediate output content features using a normalized hybrid self-attention mechanism to ensure the generated stylized result is highly consistent with the distribution of the reference style image. Specifically, the normalized hybrid self-attention mechanism involves: in the self-attention module of the Transformer feature layer, mapping the intermediate output content features to query features, key features, and value features; replacing the key and value features with the key and value features of the representative style features to obtain replaced key and value features; calculating an attention score matrix using the query features and the replaced key and value features; performing a unified Softmax operation on the attention score matrix to obtain a fused attention weight matrix; and weighted summing of the value features based on the fused attention weight matrix to obtain the stylized content features as the stylized result. The stylized content features are represented as follows: in, This represents the softmax activation function. This represents the weight hyperparameter; The normalized hybrid self-attention mechanism also includes: Before performing a uniform Softmax operation on the attention score matrix, the content features are normalized for style distribution, as shown below: in, Indicates representative stylistic features The mean, Indicate content features The mean, Indicate content features standard deviation Indicates representative stylistic features The standard deviation.
2. The method according to claim 1, characterized in that, The extraction of representative style features from the input reference style image using the self-consistency of the latent consistency model includes: The reference style image is mapped to a low-dimensional latent space through a pre-trained variational autoencoder to obtain the initial latent encoding; Gaussian noise is added to the initial latent code using the diffusion process noise injection formula to obtain the noisy latent code; The noisy latent encoding is input into the noise prediction network of the latent consistency model. The style statistical features of each layer of the Transformer are extracted through a single-step denoising operation as the representative style features. The noise prediction network achieves stable mapping of features at adjacent time steps by minimizing the consistency loss function.
3. The method according to claim 2, characterized in that, The method of adding Gaussian noise to the initial latent code using the diffusion process noise injection formula yields the noisy latent code, expressed as: in, This is the latent encoding after adding noise. For diffusion time step, It is Gaussian noise. This is the initial potential encoding.
4. A fast stylized text-to-image generation system based on a diffusion model, characterized in that, include: The style feature extraction module is used to extract representative style features from the reference style image by taking the reference style image as input and utilizing the self-consistency of the latent consistency model. A stylized text-to-image generation module guides the process of generating images from text using extracted representative style features and a normalized hybrid self-attention mechanism. The module includes: fusing content features with extracted representative style features in the Transformer feature layer of a diffusion model to obtain intermediate output content features; adjusting these intermediate output content features using a normalized hybrid self-attention mechanism to ensure the generated stylized result is highly consistent with the distribution of a reference style image. Specifically, the normalized hybrid self-attention mechanism involves: mapping the intermediate output content features to query features, key features, and value features in the self-attention module of the Transformer feature layer; replacing the key and value features with the key and value features of the representative style features to obtain replaced key and value features; calculating an attention score matrix using the query features and the replaced key and value features; performing a unified Softmax operation on the attention score matrix to obtain a fused attention weight matrix; and weighting and summing the value features according to the fused attention weight matrix to obtain the stylized content features as the stylized result. The stylized content features are represented as follows: in, This represents the softmax activation function. This represents the weight hyperparameter; The normalized hybrid self-attention mechanism also includes: Before performing a uniform Softmax operation on the attention score matrix, the content features are normalized for style distribution, as shown below: in, Indicates representative stylistic features The mean, Indicate content features The mean, Indicate content features standard deviation Indicates representative stylistic features The standard deviation.
5. An electronic device, characterized in that, include: processor; Memory for stored programs; The program includes instructions that, when executed by the processor, cause the processor to perform the steps of the method as described in any one of claims 1-3.
6. A computer storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-3.