Method, apparatus, device and storage medium for visual generation

By introducing a guiding module into the diffusion model and dynamically adjusting the influence of conditional information, the problems of low inference efficiency and distribution bias in the diffusion model are solved, and efficient and stable visual generation is achieved.

CN122199713APending Publication Date: 2026-06-12BEIJING ZITIAO NETWORK TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date
2024-12-11
Publication Date
2026-06-12

Smart Images

  • Figure CN122199713A_ABST
    Figure CN122199713A_ABST
Patent Text Reader

Abstract

According to embodiments of the present disclosure, methods, apparatuses, devices and storage media for visual generation are provided. The method comprises: generating, by using a trained content generation model, visual content based on conditional information and by iteratively performing a plurality of inference operations, wherein the performing of a given inference operation of the plurality of inference operations comprises: determining, by using a guidance module associated with the content generation model, a guidance distribution embedding representation for the given inference operation based on a time step number of content generation of the content generation model and a predetermined guidance ratio, wherein parameters of the guidance module are determined in a model distillation process of the content generation model; and generating a processing result corresponding to the given inference operation based on the guidance distribution embedding representation and the conditional information, or based on the guidance distribution embedding representation and a processing result corresponding to a previous inference operation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The exemplary embodiments disclosed herein relate generally to the field of computers, and more particularly to methods, apparatus, devices, and computer-readable storage media for generating visual content. Background Technology

[0002] With the development of machine learning technology, more and more models are being developed to complete various tasks, including visual generation tasks. In video generation tasks, machine learning models are trained to generate images or videos that match the content described by the input text. Because visual content such as images or videos contains a wealth of information, this poses a challenge to machine learning models. Currently, to meet the requirements of video generation quality, machine learning models are being constructed with increasingly complex structures and a growing number of parameters to support the demands of complex video generation tasks. Among these models, diffusion models are widely used, but the generation process of diffusion models requires multiple iterations, and their model complexity and inference efficiency need further improvement. Summary of the Invention

[0003] In a first aspect of this disclosure, a method for visual content generation is provided. The method includes: generating visual content using a trained content generation model, based on conditional information and by iteratively executing multiple inference operations, wherein the execution of a given inference operation includes: determining a guided distribution embedding representation for a given inference operation based on the number of time steps of content generation by the content generation model and a predetermined guidance ratio, using a guidance module associated with the content generation model, wherein parameters of the guidance module are determined during model distillation of the content generation model, and the predetermined guidance ratio is configured to control a ratio between: the content generation process guided by the conditional information, and the content generation process without conditions; and generating a processing result corresponding to the given inference operation based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation.

[0004] In a second aspect of this disclosure, an apparatus for visual content generation is provided. The apparatus includes: a content generation module configured to generate visual content using a trained content generation model, based on conditional information and by iteratively executing a plurality of inference operations, wherein the execution of a given inference operation comprises: determining a guided distribution embedding representation for a given inference operation based on a time step number of content generation based on the content generation model and a predetermined guidance ratio, using a guidance module associated with the content generation model, wherein parameters of the guidance module are determined during model distillation of the content generation model, and the predetermined guidance ratio is configured to control a ratio between: a content generation process guided by conditional information, and a content generation process without conditions; and generating a processing result corresponding to the given inference operation based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation.

[0005] In a third aspect of this disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. When executed by the at least one processing unit, the instructions cause the device to perform the method of the first aspect.

[0006] In a fourth aspect of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program that can be executed by a processor to implement the method of the first aspect.

[0007] In a fifth aspect of this disclosure, a computer program product is provided. The computer program product is tangibly stored in a computer storage medium and includes computer-executable instructions that, when executed by a device, cause the device to perform the method of the first aspect.

[0008] It should be understood that the content described in this content section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0010] Figure 1 A schematic diagram of an example environment in which embodiments of the present disclosure can be implemented is shown;

[0011] Figure 2A schematic diagram of a model architecture according to some embodiments of the present disclosure is shown;

[0012] Figure 3 A schematic diagram of the structure of a boot module according to some embodiments of the present disclosure is shown;

[0013] Figure 4 A flowchart of a process for visual content generation according to some embodiments of the present disclosure is shown;

[0014] Figure 5 A schematic structural block diagram of an apparatus for visual content generation according to some embodiments of the present disclosure is shown; and

[0015] Figure 6 A block diagram of an electronic device in which one or more embodiments of the present disclosure may be implemented is shown. Detailed Implementation

[0016] It is understood that before using the technical solutions disclosed in the various embodiments of this disclosure, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this disclosure in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0017] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the software or hardware, such as the electronic device, application, server, or storage medium performing the operations of this disclosed technical solution, based on the prompt message.

[0018] As an optional but non-limiting implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0019] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0020] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0021] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0022] It should be noted that the headings of any section / subsection provided herein are not limiting. Various embodiments are described throughout this document, and embodiments of any type may be included under any section / subsection. Furthermore, embodiments described in any section / subsection may be combined in any way with any other embodiments described in the same section / subsection and / or different sections / subsections.

[0023] In this document, unless explicitly stated otherwise, performing a step in response to A does not mean that the step is performed immediately after A, but may include one or more intermediate steps.

[0024] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below. The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0025] As used in this paper, the term "model" refers to a system that learns the relationship between inputs and outputs from training data, enabling it to generate corresponding outputs for a given input after training. Model generation can be based on machine learning techniques. Deep learning is a machine learning algorithm that uses multiple layers of processing units to process inputs and provide corresponding outputs. In this paper, "model" may also be referred to as a "machine learning model," a "machine learning network," or simply a "network," and these terms are used interchangeably. A model can also include different types of processing units or networks.

[0026] As used herein, a “unit,” “operation unit,” or “subunit” can consist of any suitable machine learning model or network. As used herein, a set of elements or similar expressions can include one or more such elements. For example, “a set of convolutional units” can include one or more convolutional units.

[0027] Machine learning typically comprises three phases: training, testing, and application (also known as inference). In the training phase, a given model is trained using a large amount of training data, iteratively updating parameter values ​​until the model can consistently generate inferences that meet the expected goals from the training data. Through training, the model can be considered to have learned the relationship between inputs and outputs (also known as an input-output mapping) from the training data. The parameter values ​​of the trained model are determined. In the testing phase, test inputs are applied to the trained model to test whether it can provide the correct output, thus determining the model's performance. The testing phase can sometimes be integrated into the training phase. In the application or inference phase, the trained model can be used to process actual model inputs based on the trained parameter values ​​to determine the corresponding model output.

[0028] Diffusion models, also known as diffusion probability models, are a type of generative model. These models generate data by simulating a diffusion process, inspired by physical processes such as thermal diffusion. Diffusion models include forward diffusion processes and reverse diffusion processes. A diffusion model generates new data samples by simulating a forward diffusion process with progressively added noise and then learning how to reverse this process.

[0029] In the forward diffusion process, noise is gradually added to the data, making it increasingly random through a series of steps until it resembles pure noise. This process can be viewed as a Markov chain, where Gaussian noise is added to the data at each step. The forward diffusion process can be represented as: Where x t This is the noisy data at step t, α t This is used to control the amount of noise added. The forward diffusion process is performed during model training, and the data used to add noise are the training samples.

[0030] In the reverse diffusion process (or reverse denoising), the model learns how to reverse the steps of adding noise. Starting with pure noise, the diffusion model progressively removes the noise, generating data that matches the training distribution. The reverse diffusion process is typically simulated using a neural network that predicts the amount of noise added at each step. Where u θ and σ θ These are the learned model parameters. After model training is complete, the model performing the backdiffusion process can first sample from the noise distribution and then iteratively denoise it until the desired data is obtained.

[0031] In a diffusion model, a time step refers to the number of steps in the forward diffusion process where noise is added. The total number of steps, T, is usually a preset value representing how many steps are needed to transform the raw data into pure noise. At each time step t, Gaussian noise is added to the data according to a predetermined noise scheme. This process is continuous, and each step depends on the result of the previous step. In back diffusion, the number of time steps gradually decreases from the total number of steps, T. For example, if the total number of time steps in the diffusion model is 1000, in back diffusion, the denoising operation is performed first at step 1000, then at step 999, and so on until the denoising operation at step 1, obtaining the desired data.

[0032] In data generation, the inference step, or inference operation, of a diffusion model refers to the number of steps required to recover the original data from pure noise during backdiffusion. The number of inference steps directly affects the quality and speed of the generated data. Generally, more inference steps result in higher quality data, but also increase computational cost and time. In practical applications, the number of inference steps can be adjusted to balance generation quality and efficiency. In some embodiments, inference steps correspond to time steps, and each inference step can correspond to one or more time steps. For example, if the total number of time steps in the diffusion model is 1000, and the inference steps are set to 50, then each inference step can correspond to 20 time steps.

[0033] Figure 1 A schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented is shown. Figure 1 In environment 100, it is desired to train and use a content generation model 130 configured for various application environments. In some embodiments, the content generation model 130 is configured to perform a visual generation task, which can generate visual content such as images or videos based on input conditional information. Such a machine learning model is sometimes also referred to as a content generation model. The input conditional information may be in text format.

[0034] like Figure 1As shown, environment 100 includes a model training system 150, a model distillation system 160, and a model application system 170. Before training, the parameter values ​​of content generation model 130 can have initial values. During training, the parameter values ​​of content generation model 130 can be updated and adjusted to obtain a trained content generation model 130. During the model training phase, content generation model 130 can be trained using model training system 150 based on a sample pair set 110 including multiple sample pairs 112 used for model training. Each sample pair used for training content generation model 130 can be used for sample condition information 120 as model input and sample visual content 122 (e.g., images or videos) corresponding to the sample condition information 120.

[0035] In some embodiments, the content generation model 130 can be implemented based on a diffusion model. During training, the diffusion-based content generation model 130 is trained by continuously adding noise to known data and predicting the added noise. During video generation, the diffusion-based content generation model 130 continuously performs a denoising process from random noise under certain constraints (e.g., the input conditional information is one of the constraints), thereby generating images or videos that meet the conditions.

[0036] In some cases, the trained machine learning models are quite complex with many parameters, which is particularly significant for diffusion models because they require multiple iterations to generate content. This leads to high computational and memory resource consumption and slow computation speed during model application. To accelerate model inference, in some embodiments, various model acceleration techniques (also known as model compression techniques), including model distillation and pruning, can be used to further simplify the trained machine learning model, thereby reducing the complexity of the model. Figure 1 As shown, after the content generation model 130 is trained, it can be distilled to obtain an accelerated machine learning model 144. During the model distillation stage, the parameter values ​​of the content generation model 130 are further updated, for example, simplified, to obtain a smaller-scale machine learning model.

[0037] The basic principle of model distillation is to transfer knowledge from a complex model to a simplified student model by introducing a teacher model. The teacher model is typically a large, complex model with high accuracy and generalization performance, while the student model is a smaller, simpler model with lower computational and storage costs. The student model is the final model that needs to be deployed in the application. Figure 1In the example, the student model refers to the content generation model 130, while the teacher model refers to the baseline model 140 introduced in the distillation stage.

[0038] After distillation, the content generation model 130 will be trained and provided for model application. In the model application phase, the model application system 170 can perform the content generation process using the trained content generation model 130 based on the input conditional information 172 to obtain visual content 174.

[0039] exist Figure 1 In this system, the model training system 150, the model distillation system 160, and / or the model application system 170 may include any suitable computing system with computing capabilities, such as various computing devices / systems, terminal devices, servers, etc. Terminal devices may involve any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, etc.

[0040] It should be understood that Figure 1 The components and arrangements shown in environment 100 are merely examples, and a computing system suitable for implementing the exemplary implementations described in this disclosure may include one or more different components, other components, and / or different arrangements. Although shown as different systems, in some embodiments, the model training system 150, the model distillation system 160, and / or the model application system 170 may also be integrated into the same device or system. Implementations of this disclosure are not limited in this respect.

[0041] To meet the demands of high-quality video generation, machine learning models are being constructed with increasingly complex structures and a growing number of parameters to support the requirements of complex video generation tasks. Therefore, model distillation techniques are needed to accelerate these models. Diffusion models are a class of generative models with outstanding performance, excelling in image generation, video generation, and other tasks, capable of generating high-quality, detailed content. Diffusion models generate data through a progressive inverse denoising process, with each step relying on complex computational processes to ensure the realism and diversity of the generated results.

[0042] In the content generation process of diffusion models, classifier-free guidance (CFG) is one of the key techniques for improving the performance of diffusion models. CFG balances generation quality and diversity during the generation process by introducing a control parameter (i.e., the guidance ratio). Specifically, CFG allows the model to generate data based on given conditional information (such as text descriptions or image labels) without requiring an additional classifier. CFG combines information from conditional models (related to conditions such as text prompts) and unconditional models (generated without conditions), thereby improving the alignment between the generated content and the input conditions.

[0043] The CFG method involves training a model capable of handling both unconditional generation (without specific conditions) and conditional generation (given specific conditions, such as text or labels). During training, the model leaves the conditions empty with a certain probability, thus learning to generate data with and without conditions. During inference, CFG uses a linear combination of conditional and unconditional score estimates. This means that the model considers the balance between unconditional and conditional generation during the generation process. The generation quality can be tuned using a guiding coefficient (CFG Scale), which controls the degree to which conditional information influences the generated samples.

[0044] However, CFG also introduces greater overhead during inference. Current diffusion models typically require 20-50 inference steps to generate satisfactory content. With CFG, this process requires computation of both conditional and unconditional models twice, doubling the number of noise function evaluations (NFE) and significantly increasing inference time. In video generation tasks, this inference time is even more pronounced, limiting real-time generation and interactive applications.

[0045] Various CFG distillation schemes for model acceleration have been proposed. These schemes reduce inference steps and accelerate generation by transferring the performance of the teacher model to a lightweight student model. Some distillation methods typically use a fixed CFG ratio during training. While this simplifies the implementation, it ignores the impact of different guidance ratios on model performance, resulting in a lack of flexibility in the inference phase. Moreover, a fixed CFG ratio can limit the accuracy of semantic content generation in the early stages of inference and may introduce unnecessary computational redundancy in the later stages. In addition, different generation tasks (such as text-to-image and text-to-video) and generation goals (such as high-quality or fast generation) have different requirements for CFG, and existing technologies cannot dynamically adjust the guidance ratio during the inference phase to adapt to diverse task requirements. Furthermore, the distribution of generated data varies significantly under different CFG ratios, and existing methods struggle to capture these distribution differences during training over a wide range of ratios, leading to unstable model performance in high- or low-ratio scenarios and exhibiting distribution bias issues.

[0046] In view of this, embodiments of the present disclosure propose a scheme for visual generation. According to various embodiments of the present disclosure, visual content is generated by using a trained content generation model, based on conditional information, and through iterative execution of multiple inference operations. The execution of a given inference operation includes: determining a guided distribution embedding representation for a given inference operation based on a time step number and a predetermined guidance ratio of the content generation model, using a guidance module associated with the content generation model. The parameters of the guidance module are determined during the model distillation process of the content generation model. The predetermined guidance ratio is configured to control the ratio between: the content generation process guided by the conditional information, and the content generation process without conditions. Depending on whether the given inference operation is the first operation or an intermediate operation, a processing result corresponding to the given inference operation is generated based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation.

[0047] In the embodiments of this disclosure, by introducing parameter learning of the guidance module during the model distillation process, the guidance module dynamically controls the influence of conditional information on the content generation process based on the current time step and guidance ratio in each inference. This alleviates the problems of insufficient model flexibility and difficulty in adapting to different task requirements caused by a fixed guidance ratio. In addition, it can significantly improve inference efficiency, especially solving the problem of redundant computation when combined with CFG technology, eliminating the need to execute conditional generation and unconditional generation processes in two separate paths.

[0048] Furthermore, this solution addresses the issue of distribution bias, which makes it difficult for the model to operate stably under various guidance ratios, particularly leading to performance degradation in video generation tasks. This is because the guidance module learns under different guidance ratios during model distillation, thereby minimizing distribution bias and achieving better generation results during inference.

[0049] The following description will continue with reference to the accompanying drawings, which will provide some exemplary embodiments of this disclosure.

[0050] Figure 2 A schematic diagram of a model architecture 200 according to some embodiments of the present disclosure is shown. The model architecture 200 can be implemented in... Figure 1 The model distillation system has 160 locations. Figure 2 The architecture illustrates the use of a baseline model 140 to perform model distillation on the content generation model 130, resulting in a compliant content generation model. After model distillation, the model parameters of the content generation model 130 and the model parameters of the bootstrapping module 210, which requires additional training, are determined, enabling their use in model inference. Model distillation of the content generation model 130 and the bootstrapping module 210 can be implemented in… Figure 1 The model distillation system 160, and the model application of the content generation model 130 and the guiding module 210 can be implemented in Figure 1 The model application system 170. The following will first describe the model distillation process of the content generation model, and then describe the model inference process.

[0051] In some embodiments, the content generation model 130 is a model that completes the content generation process by iteratively executing multiple inference operations. The input to each inference operation may include the processing result of the previous inference operation (for the initial inference operation, its input includes the model input of the content generation model 130). The processing result of the last inference operation is the model output of the content generation model 130, i.e., the final generated visual content. The model input of the content generation model 130 may include conditional information for controlling content generation, which may include a textual description of the visual content to be generated, classification labels, etc. For example, if it is desired to generate an image about a bird, the conditional information may describe the content to be included in the image, or a classification label for a given bird. This conditional information may be referred to as the cue word information of the content generation model 130.

[0052] In embodiments of this disclosure, during the model distillation process, the baseline model 140 may be referred to as the teacher model, and the content generation model 130 may be referred to as the student model. The baseline model 140 is a trained model capable of generating data of expected quality. In some embodiments, the baseline model 140 is also a content generation model. In some embodiments, both the baseline model 140 and the content generation model 130 may be constructed based on a diffusion model, and therefore their content generation process can be shown as a denoising process. Figure 2 The denoising process 202 of the baseline model 140 and the denoising process 204 of the content generation model 130 are shown.

[0053] In some embodiments, the baseline model 140 performs the content generation process based on CFG. During the content generation process, the baseline model 130 can control the content generation process based on a guidance ratio s. This guidance ratio is configured to control the ratio between content generation guided by conditional information and content generation without conditions. That is, the guidance ratio (also known as the CFG ratio) is a value that controls the degree of influence of conditional information on the content generation process. When the guidance ratio is set to 0, content generation is unguided (i.e., conditional information is ignored), while a higher guidance ratio value makes the content generation process closer to the conditional information. It is generally desirable to control the ratio between conditional and unconditional generation processes. Unconditional generation allows the model some degree of freedom in content generation, while conditional generation allows the generated content to match the input conditional information more closely.

[0054] In some embodiments, the baseline model 140 can be a trained content generation model. During the model distillation phase, both the content generation model 130 and the baseline model 140 can be initialized using the same trained content generation model. However, during model distillation, the model parameters of the baseline model 140 remain unchanged, while the model parameters of the content generation model 130 are continuously updated. The model distillation process continuously reduces the number of model parameters in the content generation model 130, thereby reducing the model size of the content generation model 130 and improving inference efficiency. The optimization objective of model distillation is to ensure that, with a reduction in the number of model parameters, the difference between the visual content generated by the content generation model 130 and the visual content generated by the baseline model 140 remains within a target range; that is, it is expected that the reduction in the number of model parameters will not come at the cost of a significant decrease in content quality.

[0055] During the model training phase, samples can include a mixture of image and video data, with each sample including corresponding conditional information and an image or video. During training, noise is added to the input image / video for different text format conditional information, and the content generation model is used to predict the denoising function. In the model distillation phase, the student model (i.e., the content generation model) is initialized from the teacher model (i.e., the baseline model), and the newly added guiding module is zero-initialized to train the distilled content generation model 130.

[0056] Unlike the fixed CFG guidance ratio in the baseline model 140, a guidance module 210 is introduced in the content generation model 130. This guidance module 210 is applied to each inference operation of the content generation model 130. The guidance module 210 can determine the guidance distribution embedding for the current inference operation based on the time step t 212 of content generation in the content generation model 130 and a predetermined guidance ratio s 214. This guidance distribution embedding, along with the input of the current inference operation (which may be conditional information or the processing result corresponding to the previous inference operation), generates the processing result corresponding to the current inference operation. The processing result corresponding to the current inference operation is then passed to the next inference process, and further processing is performed based on the guidance module 210.

[0057] The boot module 210 can be represented as a boot function. Where t is the time step and s is the predetermined guide ratio. In some embodiments, the guide ratio s in the model distillation can be initialized to s0. The purpose of this guide function is to dynamically adjust the distribution deviation under different time steps and guide ratios to ensure distribution alignment during the generation process, thereby achieving efficient guided distillation.

[0058] It's important to note that in the diffusion model, the number of time steps and the number of inference operations can be interchanged. For example... Figure 2 As shown, in the denoising process corresponding to the content generation process, the diffusion model iterates from the total number of time steps T to 0, while the inference operation iterates from 0 to a predetermined number of times, with each inference operation corresponding to a predetermined number of time steps. For example, if the time step t ranges from 999 to 0, totaling 1000 time steps, and the inference operation ranges from 1 to 50, totaling 50 times, then the first inference operation corresponds to 20 time steps in the diffusion model from time step t = 999 to t = 980, and so on, with the 50th inference operation corresponding to 20 time steps from time step t = 19 to t = 0. Therefore, in some embodiments, the guidance function corresponding to the guidance module 210 can also be expressed as a guidance function related to the number of inference operations performed and the guidance ratio. Such a transformation is conceivable to those skilled in the art.

[0059] During model distillation, a reference visual content is generated using a baseline model 140 based on sample condition information and a sample guidance ratio s. Predicted visual content is then generated using an initialized content generation model 130 and its associated guidance module 210, based on the sample condition information and the sample guidance ratio. The model parameters of the content generation model 130 and the guidance module 210 are updated based on the differences between the generation results of the baseline model 140 and the content generation model 130 at multiple time steps.

[0060] As mentioned earlier, the optimization goal of model distillation is to ensure that the visual content generated by the content generation model 130 is consistent with the visual content generated by the baseline model 140 while reducing the number of model parameters; that is, the difference between the two remains within the target range. During model distillation, the guidance function of the guidance model 210 is trained to capture the distribution deviations of different time steps (changes in t) and guidance ratios (changes in s), so that its generated guidance results are consistent with the teacher model (i.e., the baseline model 140). In some embodiments, during model distillation, the sample guidance ratios used by the baseline model 140 and the content generation model 130 each time can be from a guidance ratio range (e.g., s...). min to s max Randomly sampled from the range.

[0061] Therefore, in some embodiments, the optimization objective function of model distillation can be expressed as:

[0062]

[0063] in This indicates that the content generation model is 130. Indicates the baseline model 140, This indicates the guiding module 210, where t represents the time step in the content generation process, and s represents the time step. tea This represents the guiding ratio of the baseline model 140. During the distillation process, the guiding ratio used by the content generation model 130 can be the same as that used by the baseline model 140. In equation (1) above, This represents the generation result of content generation model 130 at time step t. This represents the generation result of the baseline model 140 at time step t.

[0064] The following section will continue to describe the processing of the guidance module 210. In each inference operation, the inputs to the guidance module 210 include the time step t of content generation based on the content generation model and the guidance ratio s, determining the guidance distribution embedding representation for the current inference operation (or more specifically, the current time step). The guidance module 210 can determine the step-by-step embedding representation t of the content generation time step t. embedAnd determine the guiding embedding representation s of the guiding ratio s. embed Then, the step number is embedded into the representation t using the projection matrix in the guidance module 210. embed and the guiding embedding representation s embed The aggregated embedding representation is projected into the guided distributed embedding representation.

[0065] Figure 3 A schematic diagram of an example structure of the boot module 210 is shown. Figure 3 As shown, the guidance module 210 includes a multilayer perceptron (MLP) 302 and an activation layer (e.g., a sigmoid function activation layer) 320. The MLP 302 and activation layer 320 are configured to process time steps t and generate a step-by-step embedding representation t. embed This is represented as follows:

[0066] t embed =σ(MLP3(t)) (2)

[0067] Where MLP3() represents MLP 302 (e.g., a 3-layer MLP), and σ() represents activation layer 320, which can be a sigmoid activation function used to normalize the output of MLIP 302.

[0068] In some embodiments, when computing the guided embedding representation, the guided module 210 further includes a routing module 304 and a hybrid expert model (MoE) consisting of multiple expert networks. Figure 3 Multiple expert networks 310-1, 310-2, ..., 310-K (collectively referred to as expert network 310, where K is the number of expert networks, and K is an integer greater than 1) are shown.

[0069] In determining the guiding embedding representation s embed At that time, the routing module 304 normalizes the boot ratio s to obtain s norm Then, multiple expert networks 310 are used to determine multiple guidance embedding representations of guidance ratio s, each expert network 310 can be represented as f router Furthermore, the guidance module 210 can also determine the guidance ratio s with respect to multiple weights w of the multiple expert networks 310. i , is represented as w i =softmax(s norm Multiple expert networks 310 form a weight matrix 306. Subsequently, a weighted summation is performed on multiple guided embedding representations using these weights to obtain the guided embedding representation s. embed For the embedded representation s embed The process of determining can be represented as follows:

[0070] snorm =Normalize(s) (3)

[0071] w i =softmax(s norm (4)

[0072]

[0073] In the bootstrapping module 210, based on the expert hybrid model (MoE), multiple routing paths can be assigned for different bootstrapping ratios s. Each routing path learns a function f from the expert model. router Furthermore, the distribution weights w are calculated. By activating the optimal guiding path using a Top-N strategy, the training efficiency problem under different distributions is effectively solved. Based on this process, guiding embedding representations s that better match different guiding ratios s can be determined. embed .

[0074] Through guided embedding representations s embed At the same time, combined with time step information t embed It is possible to determine the guided distribution embedding representation of each inference operation. Specifically, in the guidance module 210, the aggregation module 330 can be used to aggregate the step embedding representation and the guidance embedding representation to obtain the aggregated embedding representation. Then, the projection matrix W can be used... proj Project the aggregated embedding representation into a guided distributed embedding representation. This is represented as follows:

[0075]

[0076] In some embodiments, return to reference Figure 2 In the guidance module 210, a guidance rate parameter r 220 may also be introduced. The guidance rate parameter is determined during the model distillation process of the content generation model 130, and the guidance rate parameter 220 is configured to control the switching between explicit guidance and implicit guidance based on conditional information. The guidance module 210 can determine the guidance distribution embedding representation for each inference operation based on the guidance rate parameter r. In some embodiments, the guidance rate parameter r can be Fourier embedded and injected into each layer of the content generation model 210. In some embodiments, in each inference operation of the content generation model 210, the guidance distribution embedding representation can be used. The aggregation of the Fourier embeddings of the guidance rate parameter r is used to generate the processing result of this inference operation.

[0077] During the inference phase, the dynamic switching between explicit and implicit guidance is adjusted using the guidance rate parameter r. This balances generation quality and inference speed, making it suitable for multi-task generation scenarios. Explicit guidance is used to generate semantic content in the early stages, while implicit guidance is used to optimize computational efficiency in later stages.

[0078] Through model distillation, the parameters of the content generation module 130 and the guidance module 210 can be determined. Once the parameters are determined, the content generation model 130 and the guidance module 210 can be used for inference tasks. During inference, the guidance ratio s of the guidance module 210 can be predetermined, for example, it can be specified by the user.

[0079] In some embodiments, in each inference operation of the content generation model 130, a guidance value for a given inference operation can also be determined based on the time step t of content generation by the content generation model 130 and a predetermined guidance ratio s. In some embodiments, the guiding value The guidance distribution embedding representation generated by guidance module 210 can be used as a basis. To determine, for example, by embedding the guided distribution representation. Mapped to a predetermined range of values. In some embodiments, in response to determining a guide value If the predetermined distribution alignment threshold λ is exceeded, an implicitly guided content generation process is executed using content generation model 130 and guidance module 210 (i.e., conditional generation at each time step needs to be explicitly controlled using guidance ratios). Specifically, in response to determining the guidance value... If the distribution alignment threshold λ is exceeded, a first intermediate result p(x) is determined by performing a content generation process guided by conditional information. t |c,s,t,r)(where c represents the input condition information). The first intermediate result means that the result is generated based on the condition information c, the guidance ratio s, the time step t, and the guidance rate parameter r to generate the processing result at step t.

[0080] Then, using content generation model 130 and guidance module 210, the second intermediate result p(x) is determined by performing a content generation process without conditions. t |s,t,r). The second intermediate result means that the result is generated based on the guiding ratio s, time step t, and guiding rate parameter r to generate the processing result at step t, without considering the conditional information c. The first intermediate result p(x) is aggregated based on the predetermined guiding ratio. t |c,s,t,r) and the second intermediate result p(x) t |s,t,r), determine the processing result p corresponding to the inference operation. t (x t|c). The explicit content generation process can be represented by the upper part of the following formula (7), i.e. The content generation process corresponding to the time.

[0081]

[0082] In some embodiments, in response to determining the bootstrap value If the distribution alignment threshold λ is less than the predetermined threshold, the content generation model 130 and the guidance module 210 are used to execute an explicitly guided content generation process (i.e., the condition generation at each time step needs to be implicitly controlled using a guidance function). Specifically, the content generation model 130 and the guidance module 210 are used to determine the processing result corresponding to the inference operation by executing a content generation process guided by conditional information. The implicitly guided content generation process can be expressed as the lower part of the above formula (7), i.e. The content generation process corresponding to the time.

[0083] In some embodiments, the distribution alignment threshold λ is related to the visual generation task to be performed and is used to control the switching between explicit and implicit guidance. For example, suppose that during inference, the guided distribution embedding representation is mapped to the range [1,7]. During inference, when λ=7, guidance is performed entirely by the formula below, and when λ=1, guidance is performed entirely by the CFG (formula above).

[0084] Based on the above equation (7), a general guided sampling formula p is proposed. t (x t |c) During the inference phase, the threshold λ controls the switching between explicit and implicit guidance, so that the generation process can simultaneously take into account speed and distribution flexibility.

[0085] Through the above embodiments, the solution disclosed herein can significantly improve inference speed, enhance multi-task adaptability, and optimize generation quality, solving the problems of slow inference and distribution bias in existing diffusion models, and providing an efficient and reliable solution for real-time generation and diversified generation tasks.

[0086] Figure 4 A flowchart of a process 400 for vision generation according to some embodiments of the present disclosure is shown. Process 400 may be implemented at model inference system 170 and / or model distillation system 160. Reference is made below. Figure 1 Describe the process 400.

[0087] In box 410, the model inference system 170 utilizes a trained content generation model to generate visual content based on conditional information and by iteratively executing multiple inference operations. The execution of a given inference operation includes: in box 412, determining a guided distribution embedding representation for a given inference operation using a guidance module associated with the content generation model, based on the time steps of content generation by the content generation model and a predetermined guidance ratio, wherein the parameters of the guidance module are determined during the model distillation process of the content generation model, and the predetermined guidance ratio is configured to control the ratio between: the content generation process guided by the conditional information, and the content generation process without conditions. In box 414, generating a processing result corresponding to the given inference operation based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation.

[0088] In some embodiments, determining a guided distributed embedding representation for a given inference operation includes: determining a step embedding representation for the time step of content generation; determining a guided embedding representation for a predetermined guided ratio; and projecting an aggregated embedding representation of the step embedding representation and the guided embedding representation into a guided distributed embedding representation using a projection matrix in the guided module.

[0089] In some embodiments, the guidance module includes a plurality of expert networks, and wherein determining the guidance embedding representation of a predetermined guidance ratio includes: determining a plurality of guidance embedding representations of a predetermined guidance ratio using the plurality of expert networks; determining a plurality of weights of the predetermined guidance ratio with respect to the plurality of expert networks; and performing a weighted summation on the plurality of guidance embedding representations using the plurality of weights respectively to obtain the guidance embedding representation.

[0090] In some embodiments, determining the guided distribution embedding representation for a given inference operation includes: further determining the guided distribution embedding representation for a given inference operation based on a guidance rate parameter, utilizing a guidance module in the content generation model, wherein the guidance rate parameter is determined during the model distillation process of the content generation model, and the guidance rate parameter is configured to control the switching between explicit guidance and implicit guidance based on conditional information.

[0091] In some embodiments, the model distillation process of the content generation model includes: generating reference visual content by the model distillation system 160 using a baseline model based on sample condition information and sample guidance ratio; generating predicted visual content using an initialized content generation model and an associated guidance module based on sample condition information and sample guidance ratio; and updating the model parameters of the content generation model and the guidance module based on the differences between the generation results of the baseline model at multiple time steps and the generation results of the content generation model at multiple time steps.

[0092] In some embodiments, the sample guidance ratio is randomly sampled from a guidance ratio range.

[0093] In some embodiments, the execution of a given inference operation of a plurality of inference operations further includes: determining a guidance value for the given inference operation based on the time step of content generation of the content generation model and a predetermined guidance ratio; in response to determining that the guidance value exceeds a predetermined distribution alignment threshold, determining a first intermediate result by executing a content generation process guided by conditional information using the content generation model and the guidance module; determining a second intermediate result by executing a content generation process without conditions using the content generation model and the guidance module; and aggregating the first intermediate result and the second intermediate result based on the predetermined guidance ratio to determine the processing result corresponding to the given inference operation.

[0094] In some embodiments, the execution of a given inference operation of a plurality of inference operations includes, in response to determining that a guiding value is less than a predetermined distribution alignment threshold, using a content generation model and a guiding module, performing a content generation process guided by conditional information to determine the processing result corresponding to the given inference operation. In some embodiments, the distribution alignment threshold is related to the visual generation task to be performed.

[0095] In some embodiments, the visual generative model is constructed based on a diffusion model.

[0096] Figure 5 A schematic structural block diagram of an apparatus 500 for vision generation according to certain embodiments of the present disclosure is shown. The apparatus 500 may be implemented as or included in the model inference system 170 and / or the model distillation system 160. The various modules / components in the apparatus 500 may be implemented by hardware, software, firmware, or any combination thereof.

[0097] As shown in the figure, the device 500 includes a content generation module 510, configured to generate visual content using a trained content generation model, based on conditional information and by iteratively executing multiple inference operations. For a given inference operation among the multiple inference operations, the content generation module 510 includes a guidance module 512, configured to determine a guided distribution embedding representation for the given inference operation based on the content generation model's content generation time steps and a predetermined guidance ratio, using the guidance module associated with the content generation model. The parameters of the guidance module are determined during the model distillation process of the content generation model, and the predetermined guidance ratio is configured to control the ratio between: the content generation process guided by conditional information, and the content generation process without conditions. For a given inference operation among the multiple inference operations, the content generation module 510 also includes a result generation module 514, configured to generate a processing result corresponding to the given inference operation based on the guided distribution embedding representation and conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation.

[0098] In some embodiments, the guiding module 512 is further configured to: determine a step embedding representation of the time step of content generation; determine a guiding embedding representation of a predetermined guiding ratio; and project the aggregated embedding representation of the step embedding representation and the guiding embedding representation into a guiding distributed embedding representation using a projection matrix in the guiding module.

[0099] In some embodiments, the guidance module includes multiple expert networks, and the guidance module 512 is further configured to: determine multiple guidance embedding representations with a predetermined guidance ratio using the multiple expert networks; determine multiple weights of the predetermined guidance ratio with respect to the multiple expert networks; and perform a weighted summation on the multiple guidance embedding representations using the multiple weights respectively to obtain a guidance embedding representation.

[0100] In some embodiments, the guidance module 512 is further configured to: determine a guidance distribution embedding representation for a given inference operation based on a guidance rate parameter, using the guidance module in the content generation model, wherein the guidance rate parameter is determined during the model distillation process of the content generation model, and the guidance rate parameter is configured to control the switching between explicit guidance and implicit guidance based on conditional information.

[0101] In some embodiments, the model distillation process of the content generation model includes: generating reference visual content using a baseline model based on sample condition information and sample guidance ratio; generating predicted visual content using an initialized content generation model and an associated guidance module based on sample condition information and sample guidance ratio; and updating the model parameters of the content generation model and the guidance module based on the differences between the generation results of the baseline model at multiple time steps and the generation results of the content generation model at multiple time steps.

[0102] In some embodiments, the sample guidance ratio is randomly sampled from a guidance ratio range.

[0103] In some embodiments, the content generation module 510 further includes: a guidance value determination module configured to determine a guidance value for a given inference operation based on the time step of content generation in the content generation model and a predetermined guidance ratio; a first intermediate determination module configured to, in response to determining that the guidance value exceeds a predetermined distribution alignment threshold, determine a first intermediate result by performing a content generation process guided by conditional information using the content generation model and the guidance module; a second intermediate determination module configured to determine a second intermediate result by performing a content generation process without conditions using the content generation model and the guidance module; and a result determination module configured to aggregate the first intermediate result and the second intermediate result based on a predetermined guidance ratio to determine the processing result corresponding to the given inference operation.

[0104] In some embodiments, the content generation module 510 further includes an implicit execution module configured to, in response to determining that a guiding value is less than a predetermined distribution alignment threshold, utilize the content generation model and the guiding module to determine the processing result corresponding to a given inference operation by performing a content generation process guided by conditional information. In some embodiments, the distribution alignment threshold is related to the visual generation task to be performed.

[0105] In some embodiments, the visual generative model is constructed based on a diffusion model.

[0106] Figure 6 A block diagram is shown illustrating an electronic device 600 in which one or more embodiments of the present disclosure may be implemented. It should be understood that... Figure 6 The electronic device 600 shown is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. Figure 6 The electronic device 600 shown can be used to achieve Figure 1 The model distillation system 160 or the model application system 170, or Figure 5 The device 500.

[0107] like Figure 6 As shown, electronic device 600 is in the form of a general-purpose electronic device. Components of electronic device 600 may include, but are not limited to, one or more processors or processing units 610, memory 620, storage device 630, one or more communication units 640, one or more input devices 650, and one or more output devices 660. Processing unit 610 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 620. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of electronic device 600.

[0108] Electronic device 600 typically includes multiple computer storage media. Such media can be any accessible media that is accessible to electronic device 600, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 620 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 630 can be removable or non-removable media and can include machine-readable media, such as flash drives, disks, or any other media that can be used to store information and / or data and can be accessed within electronic device 600.

[0109] Electronic device 600 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 6 As shown, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces. Memory 620 may include computer program product 625 having one or more program modules configured to perform various methods or actions of various embodiments of this disclosure.

[0110] The communication unit 640 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 600 can be implemented using a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the electronic device 600 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.

[0111] Input device 650 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 660 can be one or more output devices, such as a monitor, speaker, printer, etc. Electronic device 600 can also communicate with one or more external devices (not shown) via communication unit 640 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with electronic device 600, or with any device that enables electronic device 600 to communicate with one or more other electronic devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).

[0112] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above.

[0113] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0114] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0115] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0116] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0117] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.

Claims

1. A method for generating visual content, comprising: Visual content is generated using a trained content generation model, based on conditional information and by iteratively executing multiple inference operations, wherein the execution of a given inference operation includes: Based on the time steps of content generation in the content generation model and a predetermined guidance ratio, a guided distribution embedding representation for the given inference operation is determined using a guidance module associated with the content generation model. The parameters of the guidance module are determined during the model distillation process of the content generation model, and the predetermined guidance ratio is configured to control the ratio between: the content generation process guided by the conditional information, and the content generation process without conditions. Based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation, the processing result corresponding to the given inference operation is generated.

2. The method of claim 1, wherein determining the guided distributed embedding representation for the given inference operation comprises: Determine the step embedding representation of the time step in which the content is generated; The guide embedding representation that determines the predetermined guide ratio; as well as The aggregated embedding representation of the step embedding representation and the guidance embedding representation is projected into the guidance distribution embedding representation using the projection matrix in the guidance module.

3. The method of claim 2, wherein the guidance module comprises a plurality of expert networks, and wherein the guidance embedding representation for determining the predetermined guidance ratio comprises: The multiple expert networks are used to determine multiple guidance embedding representations for the predetermined guidance ratio. Determine the predetermined guidance ratio with respect to multiple weights of the multiple expert networks; as well as The multiple guidance embedding representations are weighted and summed using the multiple weights respectively to obtain the guidance embedding representation.

4. The method of claim 1, wherein determining the guided distributed embedding representation for the given inference operation comprises: Furthermore, based on the guidance rate parameter, the guidance module in the content generation model is used to determine the guidance distribution embedding representation for the given inference operation. The guidance rate parameter is determined during the model distillation process of the content generation model, and the guidance rate parameter is configured to control the switching between explicit guidance and implicit guidance based on the conditional information.

5. The method according to claim 1, wherein the model distillation process of the content generation model includes: Based on sample condition information and sample guidance ratio, a benchmark model is used to generate reference visual content; Using the initialized content generation model and the associated guidance module, predictive visual content is generated based on the sample condition information and the sample guidance ratio. as well as The model parameters of the content generation model and the guidance module are updated based on the differences between the generation results of the baseline model at multiple time steps and the generation results of the content generation model at the same multiple time steps.

6. The method of claim 5, wherein the sample guidance ratio is randomly sampled from a guidance ratio range.

7. The method of claim 1, wherein the execution of a given inference operation of the plurality of inference operations further comprises: The guidance value for the given inference operation is determined based on the time step of content generation in the content generation model and the predetermined guidance ratio. In response to determining that the guidance value exceeds a predetermined distribution alignment threshold, a first intermediate result is determined by executing a content generation process guided by the conditional information, using the content generation model and the guidance module. Using the content generation model and the guidance module, a second intermediate result is determined by performing a content generation process without conditions. as well as The first intermediate result and the second intermediate result are aggregated based on the predetermined guidance ratio to determine the processing result corresponding to the given inference operation.

8. The method of claim 7, wherein the execution of a given inference operation of the plurality of inference operations includes In response to determining that the guiding value is less than a predetermined distribution alignment threshold, the processing result corresponding to the given inference operation is determined by executing a content generation process guided by the conditional information, using the content generation model and the guiding module.

9. The method of claim 8, wherein the distribution alignment threshold is related to the visual generation task to be performed.

10. The method of claim 1, wherein the visual generation model is constructed based on a diffusion model.

11. An apparatus for generating visual content, comprising: The content generation module is configured to generate visual content using a trained content generation model, based on conditional information and by iteratively executing multiple inference operations, wherein the execution of a given inference operation includes: Based on the time steps of content generation in the content generation model and a predetermined guidance ratio, a guided distribution embedding representation for the given inference operation is determined using a guidance module associated with the content generation model. The parameters of the guidance module are determined during the model distillation process of the content generation model, and the predetermined guidance ratio is configured to control the ratio between: the content generation process guided by the conditional information, and the content generation process without conditions. Based on the guided distribution embedding representation and the conditional information, or based on the guided distribution embedding representation and the processing result corresponding to the previous inference operation, the processing result corresponding to the given inference operation is generated.

12. An electronic device, comprising: At least one processing unit; as well as At least one memory, coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 10 when executed by the at least one processing unit.

13. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 10.

14. A computer program product tangibly stored in a computer storage medium and comprising computer-executable instructions that, when executed by a device, cause the device to perform the method according to any one of claims 1 to 10.