Image generation method and device based on conditional coding fusion, equipment and medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2024-10-30
- Publication Date
- 2026-06-16
Smart Images

Figure CN119399304B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of image generation technology and financial technology, and particularly to image generation methods, apparatus, devices and media based on conditional coding fusion. Background Technology
[0002] In marketing scenarios within the financial sector, there is a need for rapid generation of a large number of images, such as poster marketing and social media marketing. These marketing scenarios require the frequent use of different images as backgrounds. Traditional marketing poster design relies on professional designers to create them manually, which is not only time-consuming but also costly, and obviously cannot meet the current image generation needs. Therefore, more and more image generation models have emerged to automatically generate images to meet the demand for large-scale image generation.
[0003] Currently, in common image generation tasks, various unsupervised learning methods require significant manual data annotation to map data points between two distributions, impacting generation efficiency. GAN (Generative Adversarial Network) methods often suffer from numerical instability and pattern collapse, requiring substantial engineering effort and manual adjustments, and are unable to adapt to different model architectures and datasets. While mainstream Diffusion Generative Models (DGGMA) models produce excellent results, the process of generating realistic images from noisy maps requires hundreds or even thousands of denoising steps, resulting in slow image generation speeds and extremely long generation times in practical applications. Therefore, improving image generation efficiency while ensuring image quality is a pressing issue that needs to be addressed. Summary of the Invention
[0004] In view of the shortcomings of the prior art, the purpose of this invention is to provide a method, apparatus, device and medium for image generation based on conditional coding fusion that can be applied to financial technology or other related fields. Its main purpose is to ensure the quality of image generation while shortening the image generation time and improving the efficiency of image generation.
[0005] The technical solution of the present invention is as follows:
[0006] The first aspect of this invention provides an image generation method based on conditional coding fusion, comprising:
[0007] Obtain the prompt information used to guide image generation and the randomly initialized noise map;
[0008] The prompt information is input into the trained conditional encoder for encoding processing, and the corresponding conditional feature vector is generated based on the encoding result.
[0009] The conditional feature vector and the noise map are summed to obtain the conditional noise map;
[0010] The conditional noise map is input into the trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model.
[0011] A second aspect of the present invention provides an image generation apparatus based on conditional coding fusion, comprising:
[0012] The information acquisition module is used to acquire the prompt information used to generate the guide image and the randomly initialized noise map;
[0013] The conditional encoding module is used to input the prompt information into the trained conditional encoder for encoding processing, and generate corresponding conditional feature vectors based on the encoding results;
[0014] The conditional fusion module is used to sum the conditional feature vector and the noise map to obtain a conditional noise map;
[0015] The image generation module is used to input the conditional noise map into the trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model.
[0016] A third aspect of the present invention provides a computer device including at least one processor; and,
[0017] A memory communicatively connected to the at least one processor; wherein,
[0018] The memory stores instructions that can be executed by the at least one processor, which enables the at least one processor to perform the above-described image generation method based on conditional coding fusion.
[0019] A fourth aspect of the present invention provides a non-volatile computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the above-described image generation method based on conditional coding fusion.
[0020] Beneficial Effects: This invention discloses an image generation method, apparatus, device, and medium based on conditional coding fusion. Compared to existing technologies, this invention obtains prompting information for guiding image generation and a randomly initialized noise map; inputs the prompting information into a trained conditional encoder for encoding processing, and generates corresponding conditional feature vectors based on the encoding results; sums the conditional feature vectors with the noise map to obtain a conditional noise map; inputs the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompting information to obtain a conditional feature vector, fusing it with the noise map, and then inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompting information, ensuring image generation quality. Simultaneously, the image generation model, through the transformation mapping of straight-line paths, can significantly shorten image generation time and improve image generation efficiency. Attached Figure Description
[0021] To more clearly illustrate the solutions in this invention, the accompanying drawings used in the description of the embodiments of this invention will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0022] Figure 1 A schematic diagram of an application environment for the image generation method based on conditional coding fusion provided in an embodiment of the present invention;
[0023] Figure 2 A flowchart of an image generation method based on conditional coding fusion provided in an embodiment of the present invention;
[0024] Figure 3 This is a flowchart of step S202 in the image generation method based on conditional coding fusion provided in an embodiment of the present invention;
[0025] Figure 4 A flowchart illustrating the training process of the image generation model provided in this embodiment of the invention;
[0026] Figure 5 A flowchart of step S406 in the training process of the image generation model provided in the embodiment of the present invention;
[0027] Figure 6 A schematic diagram of the functional modules of the image generation device based on conditional coding fusion provided in an embodiment of the present invention;
[0028] Figure 7A schematic diagram of the hardware structure of a computer device provided in an embodiment of the present invention. Detailed Implementation
[0029] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention is further described in detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The embodiments of the invention are described below in conjunction with the accompanying drawings.
[0030] The image generation method based on conditional coding fusion provided in this invention can be applied to, for example... Figure 1 In the application environment, it includes a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired and / or wireless communication links, etc.
[0031] Users can use the first terminal device 101, the second terminal device 102, and the third terminal device 103 to interact with the server 105 via the network 104 to receive or send messages, etc. Various communication client applications can be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients, and / or social platform software, etc. (for example only).
[0032] The first terminal device 101, the second terminal device 102, and the third terminal device 103 can be various electronic devices with displays and support web browsing, including but not limited to smartphones, tablets, laptops, and desktop computers.
[0033] Server 105 can be a server providing various services, such as a backend server supporting the content browsed by users using the first terminal device 101, the second terminal device 102, and the third terminal device 103 (this is just an example). The backend server can analyze and process received user requests and other data, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices. Server 105 can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the shortcomings of traditional physical hosts and VPS services ("Virtual Private Server", or simply "VPS"), such as high management difficulty and weak business scalability. Server 105 can also be a server for a distributed system or a server combined with blockchain.
[0034] It should be noted that the image generation method based on conditional coding fusion provided in this application embodiment can generally be executed by the first terminal device 101, the second terminal device 102, or the third terminal device 103. Correspondingly, the image generation apparatus based on conditional coding fusion provided in this embodiment can also be located in the first terminal device 101, the second terminal device 102, or the third terminal device 103. Alternatively, the image generation method based on conditional coding fusion provided in this embodiment can generally be executed by the server 105. Correspondingly, the image generation apparatus based on conditional coding fusion provided in this embodiment can generally be located in the server 105.
[0035] It should be understood that the number of terminal devices, networks, and servers listed above is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be used.
[0036] like Figure 2 As shown, the image generation method based on conditional coding fusion provided in this embodiment of the invention specifically includes the following steps:
[0037] S201. Obtain the prompt information used to generate the guide image and the randomly initialized noise map.
[0038] In this embodiment, during image generation, users can input different forms of prompts, such as prompt images or prompt text, to guide and control the image generation process. For example, in a financial image generation scenario, the prompt could be text describing a specific financial scenario, such as "the stock market's performance during an economic recession," or an existing financial chart image, such as a line graph showing the historical performance of a particular stock.
[0039] Simultaneously, a randomly initialized noise map is obtained, meaning the noise map is obtained through random initialization using, for example, Gaussian noise. Specifically, the noise intensity is determined in advance based on the required image size. The noise intensity can be a fixed value or a value that varies with time or other conditions. A random noise map is generated using a Gaussian distribution (normal distribution). For example, if the image size is 256×256, Gaussian noise is used for random initialization to generate a 256×256 Gaussian distributed random matrix, resulting in the corresponding noise map.
[0040] The prompts provide the image generation model with context for generating specific types of images, while the randomly initialized noise map provides a random, unprocessed starting point for the generation process, enabling the model to generate images without relying on specific noise patterns and ensuring the reliability of the generated images.
[0041] S202. Input the prompt information into the trained conditional encoder for encoding processing, and generate the corresponding conditional feature vector based on the encoding result.
[0042] In this embodiment, the acquired prompt information, such as prompt text or prompt image, is input into a trained conditional encoder. This conditional encoder can uniformly encode different types of prompt information and generate corresponding conditional feature vectors, thereby converting the prompt information into a high-dimensional feature vector. This vector captures the core features of the prompt information and provides necessary contextual information for the generation process, so that the image generation process can be guided by the prompt information, ensuring that the generated image is highly relevant and can reflect the key features in the prompt information, such as market trends or specific financial products.
[0043] Specifically, a single-tower BERT model can be used as the structural foundation, and a contrastive learning framework can be used to train a single-tower encoder for text and images. This will map different forms of prompts to a unified feature representation space, thereby unifying user input and providing a unified conditional encoding result for subsequent image generation.
[0044] S203. The conditional feature vector and the noise map are summed to obtain the conditional noise map.
[0045] In this embodiment, the encoded conditional feature vector is added element-wise to the noise map at the pixel level. That is, the value of each element of the conditional feature vector is added to the value of the corresponding pixel in the noise map to obtain a new noise map, namely the conditional noise map. This conditional noise map integrates the information of the original noise map and the conditional feature vector, so that the conditional noise map contains both randomness and structured information guided by the prompt information. The fused conditional noise map is used as the new model input, so that the image generation model can maintain randomness while ensuring that the generated image conforms to the specific financial scenario, thereby improving the relevance and accuracy of the generated image.
[0046] S204. Input the conditional noise map into the trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model.
[0047] In this embodiment, the fused conditional noise map is input into an image generation model that has undergone two-stage straight-line path training and conditional fine-tuning. During training, this image generation model learns the process of transforming and mapping two distribution samples, such as a random noise distribution and a training image distribution, through a straight-line path. This achieves the transformation and mapping of two unknown data distributions, thus achieving the effect of image generation. The trained image generation model can transform and map the conditional noise map through the most efficient straight-line path. This transformation and mapping method through a straight-line path can greatly shorten the image generation time and improve the image generation efficiency.
[0048] The conditional noise map is input into the image generation model, which gradually removes noise from the conditional noise map while introducing structured information that matches the conditional feature vector. Finally, a target image related to the prompt information is generated, such as a stock chart, market analysis chart, or financial product image. This process ensures image quality while generating images quickly.
[0049] In the above embodiments, this invention discloses an image generation method based on conditional encoding fusion. The method involves acquiring prompting information to guide image generation and a randomly initialized noise map; inputting the prompting information into a trained conditional encoder for encoding processing, and generating a corresponding conditional feature vector based on the encoding result; summing the conditional feature vector with the noise map to obtain a conditional noise map; and inputting the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompting information to obtain a conditional feature vector, fusing it with the noise map, and then inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompting information, ensuring image generation quality. Simultaneously, the image generation model, through the transformation mapping of straight-line paths, can significantly shorten image generation time and improve image generation efficiency.
[0050] In one embodiment, such as Figure 3 As shown, step S202 includes:
[0051] S301. Preprocess the prompt information according to its type to obtain several prompt tokens;
[0052] S302. Input the plurality of prompt tokens into the trained conditional encoder for encoding processing to obtain the feature vectors of the plurality of prompt tokens;
[0053] S303. Perform a specified pooling operation on the feature vectors of the plurality of prompt tokens according to the type of the prompt information to obtain the corresponding pooling vector;
[0054] S304. Map the pooling vector to a vector with the same dimension as the noise map to obtain the corresponding conditional feature vector.
[0055] In this embodiment, different preprocessing methods are used to obtain several prompt tokens for different types of prompt information input by the user. Specifically, when the prompt information is prompt text, the preprocessing method involves segmenting the prompt text, removing stop words, and lowercase conversion, thereby obtaining several text tokens as prompt tokens to be input into the conditional encoder. For example, when the prompt text is "the trend of the stock market during the economic recession", the preprocessing can obtain a series of text tokens for words such as "stock", "market", "during", "economy", "recession", "period", "of", and "trend".
[0056] When the prompt message is an image, the preprocessing method involves converting the image to a specified size and then segmenting it into equal-sized images to obtain several image tokens as prompt tokens, ready to be input into the conditional encoder. For example, when a user inputs a line chart displaying stock trends as the prompt image, since the size of the images input by the user may vary, the preprocessing process converts all user-input prompt images to a specified size, such as 256*256, and then segments the prompt image into several image blocks of equal size, for example, 16 image blocks, each image block being 16*16, thus obtaining a series of image tokens.
[0057] The preprocessed cue tokens are input into a trained conditional encoder for encoding. This trained conditional encoder can uniformly encode different types of cue tokens, namely text tokens or image tokens, mapping the features of text and images to a unified representation space, thereby obtaining feature vectors for several cue tokens, namely feature vectors of text tokens or feature vectors of image tokens. The encoded feature vectors capture the semantic information of each text token or the visual information of each image token, thus providing rich contextual cues for generating images related to the cue information.
[0058] In practice, the conditional encoder training process can be based on a single-tower BERT model, trained using a contrastive learning framework. Original images are manually annotated with text, describing their content. During training, the original image and its corresponding text description are paired; otherwise, the original image is paired with text descriptions of other images. Based on the original images and text descriptions in the training data, corresponding text tokens and image tokens are obtained. The text token is a word from the user's input text, and the image token is obtained by dividing the user-input image into 16 equal-sized blocks. Since user-input images can vary in size, during preprocessing, the user's input images are resized to 256*256 pixels, resulting in 16 16-patch sizes for each image. The text token and image token are concatenated as input, resulting in: [CLS], [t1], [t2], ..., [tn], [SEP], [p1], [p2], ..., [p16], where t represents text, p represents image, and n can be the number of words (e.g., 128 allows users to input 128 words). [CLS] and [SEP] are special tags. This concatenated structure is then input into the BERT model, which outputs a feature vector for each token and its special tag. The single-tower structure uses BERT as a unified encoder for both text and images, calculating text-image similarity using the [CLS] feature output. [CLS] is input into a two-layer MLP network, with the sigmoid activation function. If the text and image inputs are a pair, the training objective is 1; otherwise, it's 0. The output of the Sigmoid function is used to calculate the cross-entropy loss function with the target value. The gradient of the loss value is backpropagated to train the BERT model until the loss converges, resulting in a trained single-tower conditional encoder. The trained conditional encoder can represent images and text as a unified space, meaning that regardless of whether the user inputs text or an image, it can obtain a feature vector with a unified feature expression, thereby unifying the user input and improving the reliability of image generation guidance based on prompts.
[0059] After obtaining the corresponding feature vectors through encoding, different pooling operations are performed based on different types of prompts. Specifically, when the prompt is text, max pooling is performed on the feature vectors of the encoded text tokens, selecting the maximum value in each dimension for output, resulting in the corresponding text pooled vector. Max pooling helps highlight the most significant features, reduces feature dimensionality, and retains key semantic information. When the prompt is an image, mean pooling is performed on the feature vectors of the encoded image tokens, calculating the mean in each dimension for output, resulting in the corresponding image pooled vector. That is, if the user inputs an image, mean pooling is performed on the feature vectors of all image tokens, calculating the mean in each dimension, resulting in 16 vectors, which are reduced to 1 vector after mean pooling. Mean pooling helps integrate feature information, reduces feature dimensionality, and retains the overall visual information of the image tokens. In this embodiment, different pooling operations are used to obtain corresponding pooled vectors based on different user inputs. Here, the text pooled vector is labeled T, and the image pooled vector is labeled P.
[0060] After obtaining the text pooling vector T or the image pooling vector P, since the conditional feature vector and the noise image need to be added at the pixel level, the conditional feature vector must have the same spatial dimension as the noise image. For example, assuming the noise image is a 256×256 single-channel grayscale noise image, the generated conditional feature vector must also be a 256×256 two-dimensional matrix to be added element-wise with the noise image. Therefore, the pooling vectors obtained after the pooling operation, including the text pooling vector T or the image pooling vector P, are mapped through a fully connected layer to transform the text pooling vector T or the image pooling vector P into a vector with the same dimension as the noise image. This vector is the conditional feature vector. This ensures the consistency of the dimensionality between the conditional feature vector and the noise image, allowing them to be combined to guide the generation model. That is, in this embodiment, regardless of whether the user inputs text prompts or image prompts, they can be effectively converted into conditional feature vectors. Based on the conditional feature vectors, the image generation model is guided to output a highly relevant target image, ensuring the quality of image generation.
[0061] In one embodiment, such as Figure 4 As shown, the process of obtaining an image generation model by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model includes:
[0062] S401. Construct the initial streaming model and training image dataset, and generate a training noise map with the same size as the training images;
[0063] S402. The initial flow model is trained to generate straight paths in the first stage by training the noise map and randomly sampled unpaired data from the training image, thus obtaining the first stage generation model.
[0064] S403. The training noise map is sampled using the first-stage generation model to obtain a sampled image paired with the training noise map;
[0065] S404. Based on the paired training noise map and sampled image, the first-stage generation model is trained for the second-stage straight path generation to obtain the second-stage generation model.
[0066] S405. Obtain the prompt sample and input the prompt sample into the trained conditional encoder for encoding processing, and generate the corresponding training conditional vector according to the encoding result.
[0067] S406. The second-stage generation model is fine-tuned using the training condition vector, the training noise map, and the original image that matches the prompt sample to obtain the image generation model.
[0068] In this embodiment, the training process of the image generation model includes two parts: one part is two-stage straight path training to learn the straight path mapping from one distribution to another; the other part is model fine-tuning training based on conditional features to enable the model to learn how to generate corresponding images according to given conditions (whether text or images).
[0069] In two-stage straight-path training, given samples from two distributions π0 and π1, the goal is to find a transport mapping T such that when Z0 ~ π0 (Z0 belongs to the π0 distribution), Z1 = T(Z0) ~ π1. The mapping T is implicitly defined by the following ordinary differential equation, or flow model:
[0070]
[0071] Z0, sampled from π0, is simulated as a particle, moving continuously from t=0, and at time t, it moves with velocity v(Z). t Let v(Z,t) be the velocity until it reaches Z1 at time t=1, and Z1 follows the distribution π1. t If z(Z,t) is a neural network, then the task of two-stage straight-line path training is to learn v(Z) from the data. t ,t) to achieve the purpose of Z1~π1.
[0072] Suppose we randomly sample from two distributions, with x0 ~ π0 and x1 ~ π1, meaning that x0 and x1 are not paired. We can obtain the following by linear interpolation:
[0073] x t = tx1 + (1-t)z0, t∈[0,1]
[0074] Differentiating it yields an ordinary differential equation:
[0075]
[0076] By learning a forward-simulated v(Z) t To approximate this derivative, we optimize v to minimize the squared error between the velocity functions of the two systems (v and x1-x0, respectively). Therefore, the objective loss function for model training can be defined as:
[0077]
[0078] Integrating from time 0 to time 1 represents integrating from the initial time to the completion time. This calculates the expectation of all samples x0 and x1, where x0~π0 represents x0 sampled from distribution π0, x1~π1 represents x1 sampled from distribution π1, and x1-x0 represents the difference between the initial data x0 and the final data x1. t (x, t) represents the velocity at time t, that is, the velocity from x0 to x at time t during the process. t Differences in change, ||*|| 2 dt represents the L2 loss, and dt represents the differential element of time t.
[0079] Specifically, during training, an initial flow model and training image dataset are first constructed. This initial flow model uses the U-net model, which is a reversible model structure conforming to the flow framework. A training noise map is generated through random initialization with Gaussian noise. The size of this training noise map is the same as the size of the training images, for example, 256*256 pixels. The initial flow model is then trained in the first stage by generating straight-line paths using unpaired data randomly sampled from the training noise map and the training images. Specifically, random samples are taken from two distributions: x0 to π0 and x1 to π1. x0 is random noise sampled from the distribution π0 of the training noise map, and x1 is a random data point sampled from the distribution π1 of the training images. The difference between x1 and x0 is used as the label value for the first stage. At this point, x0 and x1 are not paired. The x0 from the unpaired data is used as the input to U-net. Each output of U-net during training is denoted as v(x). t ,t), calculate the linear velocity v(x) using the target loss function formula above. tThe L2 loss of the first segment label value (t) is used to backpropagate the gradient of the loss to train the parameters in U-net. When the loss value converges, the first stage generation model Flow1 is obtained. After completion, the same U-net model is copied as the training for the next segment model distillation.
[0080] In the second stage of straight path generation training, the trained U-net model (i.e., the first-stage generation model Flow1) is first used to sample the training noise map offline, obtaining a sampled image paired with the training noise map. That is, for data x0 in the training noise map, the corresponding data x1 = Flow1(x0) can be sampled by the first-stage generation model. The resulting (x0, x1) is the paired data pair. Based on the paired data x0 in the training noise map and the data x1 in the sampled image, the first-stage generation model is trained for the second stage of straight path generation. At this time, the output x1 = Flow1(x0) of the first-stage generation model is used as the label for the second stage (i.e., x1 in the second stage is not the data of the real training image in the first stage, but the image data generated by Flow1). Model distillation begins, and the input of the model is still the noise image after Gaussian noise, in the same form as the input of the first stage. The L2 loss is calculated using the following formula:
[0081]
[0082] with x t = tx1 + (1-t)x0
[0083] The target loss function for the second stage is the same as that for the first stage. The only difference is the data pairing. In the first stage of training, x0 and x1 are randomly or arbitrarily paired. In the second stage, x0 and x1 are paired using the Flow1 model trained in the first stage. That is, x0 has the same meaning in both formulas, but while x1 in the first stage is the training image data, in the second stage x1 is the output data of the Flow1 generative model from the first stage. After the second stage training is completed, the second-stage generative model is obtained. This second-stage generative model can then be used as a generative model; that is, given an image with Gaussian random sampling noise as input, the second-stage generative model can output an image of the corresponding input size.
[0084] Preferably, in both the first and second stages of straight path generation training, the noise intensity of the training noise map is controlled by a preset hyperparameter, and the noise intensity increases with the number of training steps. Even when training with Gaussian noise to randomly initialize the noise map, a hyperparameter is added along with the Gaussian noise. This hyperparameter controls the intensity of the Gaussian noise and increases with the number of training steps. Initially, the noise intensity is relatively low to allow the initial model to become familiar with the semantic features of the original image. As the number of training steps increases, the noise intensity increases linearly, which can increase the model's generation capability. The noise intensity E is calculated using the formula: E = (i / I) * 0.3, where i is the current training iteration number and I is the total number of iterations set during training.
[0085] Furthermore, to improve the quality of generated images, conditional features are used to fine-tune the model to guide image generation. In the fine-tuning stage, cue samples are first acquired and input into a trained conditional encoder for encoding. The encoding results generate corresponding training conditional vectors. For example, to guide the model in generating specific financial images, specific financial terms, concepts, or small-scale financial datasets can be used as cue samples. These cue samples are then input into the conditional encoder, which converts them into training conditional vectors that will guide the image generation process.
[0086] The specific process of generating training conditional vectors based on different prompt sample types is the same as in the above embodiment. That is, if the prompt sample is text, all text token vectors output by the conditional encoder are max-pooled to obtain a text pooled vector T. If the prompt sample is an image, all image token vectors output by the conditional encoder are mean-pooled to obtain an image pooled vector P. The pooled vector T or P is input into a fully connected network layer and mapped to a vector with the same dimension as the training noise map, resulting in the training conditional vector. Based on the training conditional vector, the training noise map, and the original image matching the prompt sample, the trained second-stage generation model is conditionally fine-tuned, enabling the second-stage generation model to learn how to combine conditional information to generate specific images, thus obtaining an image generation model. The conditionally fine-tuned model can generate highly relevant images based on specific conditional information, improving the professionalism and practicality of the generated images.
[0087] In one embodiment, such as Figure 5 As shown, step S406 includes:
[0088] S501. Sum the training condition vector and the training noise map to obtain the training condition input map;
[0089] S502. Input the training condition input image into the second stage generation model, and perform residual connection between the training condition vector and the output image of the second stage generation model to obtain the conditional enhancement image;
[0090] S503. Calculate the corresponding loss value based on the enhanced image and the original image according to the conditions, and adjust the model parameters until the loss converges to obtain the image generation model.
[0091] In this embodiment, during the conditional fine-tuning stage, the training conditional vector is first summed with the training noise map x0 to obtain a new x0, which is the training conditional input map. This process combines conditional information with noise data during training so that the generative model can take this conditional information into account during generation. For example, if the training conditional vector represents a specific trend in the financial market, combining it with the noise map helps the generative model generate financial images that reflect this trend, thereby improving the relevance and accuracy of the generated images.
[0092] Based on the training conditional input map obtained through summation, the two-stage straight-line path training process is repeated. This time, the training conditional input map, incorporating the conditional vectors, is used as the model input, and the original image matching the prompt sample is used as the label. The difference is that for the model's output image, a residual connection is formed between the training conditional vectors and this output image. That is, by adding the conditional vectors to the output image, the influence of conditional information in the generated image is strengthened. For example, if the initial image does not clearly reflect the trend of the financial market, the residual connection can further strengthen this trend. Based on the conditionally strengthened image and the original image, the corresponding loss value is calculated according to the target loss function of the second stage, and the model parameters are adjusted. By adjusting the parameters of the generation model, the loss value is minimized until the generated image is sufficiently close to the original image, i.e., the loss converges. This yields a finely tuned image generation model. This image generation model can generate relevant images using efficient straight-line path mapping based on specific conditional information, ensuring generation quality while improving generation efficiency.
[0093] It should be noted that there is no necessary order between the above steps. Those skilled in the art will understand from the description of the embodiments of the present invention that the above steps may have different execution orders in different embodiments, that is, they may be executed in parallel or in turn, etc.
[0094] Further reference Figure 6 As a response to the above Figure 2 The present invention provides an embodiment of an image generation device based on conditional coding fusion, which is similar to the method shown. Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.
[0095] like Figure 6 As shown, the image generation device 60 based on conditional coding fusion described in this embodiment includes:
[0096] The information acquisition module 601 is used to acquire the prompt information used for generating the guide image and the randomly initialized noise map;
[0097] The conditional encoding module 602 is used to input the prompt information into the trained conditional encoder for encoding processing, and generate a corresponding conditional feature vector based on the encoding result;
[0098] The conditional fusion module 603 is used to sum the conditional feature vector and the noise map to obtain a conditional noise map;
[0099] Image generation module 604 is used to input the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model.
[0100] The module referred to in this invention is a series of computer program instruction segments that can perform specific functions. It is more suitable than a program for describing the image generation and execution process based on conditional coding fusion. For specific implementation methods of each module, please refer to the corresponding method embodiments above, which will not be repeated here.
[0101] In one embodiment, the conditional encoding module 602 includes:
[0102] A preprocessing unit is used to preprocess the prompt information according to the type of the prompt information to obtain several prompt tokens;
[0103] The encoding unit is used to input the plurality of prompt tokens into the trained conditional encoder for encoding processing to obtain the feature vectors of the plurality of prompt tokens;
[0104] A pooling unit is used to perform a specified pooling operation on the feature vectors of the plurality of prompt tokens according to the type of the prompt information, so as to obtain the corresponding pooling vector;
[0105] The mapping unit is used to map the pooling vector to a vector with the same dimension as the noise map to obtain the corresponding conditional feature vector.
[0106] In one embodiment, when the prompt information is prompt text, the preprocessing unit is specifically used for:
[0107] The prompt text is segmented into words to obtain several text tokens, which are used as the prompt token.
[0108] The pooling unit is specifically used for:
[0109] Max pooling is performed on the feature vectors of the several text tokens, and the maximum value in each dimension is selected for output to obtain the corresponding text pooling vector.
[0110] In one embodiment, when the prompt message is a prompt image, the preprocessing unit is specifically used for:
[0111] After converting the prompt image to a specified size, perform equal-sized image segmentation to obtain several image tokens as the prompt token;
[0112] The pooling unit is specifically used for:
[0113] The feature vectors of the aforementioned image tokens are subjected to mean pooling, and the mean value in each dimension is calculated and output to obtain the corresponding image pooling vector.
[0114] In one embodiment, the device 60 further includes:
[0115] The building module is used to build the initial streaming model and training image dataset, and generate a training noise map with the same size as the training images;
[0116] The first-stage training module is used to perform a first-stage straight-path generation training on the initial flow model using unpaired data randomly sampled from the training noise map and training images, to obtain the first-stage generation model.
[0117] The sampling module is used to sample the training noise map through the first stage generation model to obtain a sampled image paired with the training noise map;
[0118] The second-stage sampling module is used to perform a second-stage straight path generation training on the first-stage generation model based on the paired training noise map and sampled image, so as to obtain the second-stage generation model.
[0119] The prompt encoding module is used to acquire prompt samples and input the prompt samples into the trained conditional encoder for encoding processing, and generate corresponding training conditional vectors based on the encoding results.
[0120] The conditional fine-tuning module is used to fine-tune the second-stage generation model using the training conditional vector, the training noise map, and the original image that matches the prompt sample, to obtain the image generation model.
[0121] In one embodiment, the device 60 further includes:
[0122] The noise intensity control module is used to control the noise intensity of the training noise map through preset hyperparameters in both the first stage of straight path generation training and the second stage of straight path generation training. The noise intensity increases with the increase of the number of training steps.
[0123] In one embodiment, the conditional fine-tuning module includes:
[0124] The summation unit is used to sum the training condition vector and the training noise map to obtain the training condition input map;
[0125] The input unit is used to input the training condition input image into the second-stage generative model, and to perform a residual connection between the training condition vector and the output image of the second-stage generative model to obtain a condition-enhanced image;
[0126] The loss calculation and parameter tuning unit is used to calculate the corresponding loss value and adjust the model parameters according to the enhanced image and the original image based on the conditions, until the loss converges, thus obtaining the image generation model.
[0127] In the above embodiments, this invention discloses an image generation device based on conditional encoding fusion. It acquires prompting information to guide image generation and a randomly initialized noise map; inputs the prompting information into a trained conditional encoder for encoding processing, and generates corresponding conditional feature vectors based on the encoding results; sums the conditional feature vectors with the noise map to obtain a conditional noise map; inputs the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompting information to obtain the conditional feature vector, fusing it with the noise map, and inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompting information, ensuring image generation quality. Simultaneously, the image generation model, through straight-line path transformation mapping, can significantly shorten image generation time and improve image generation efficiency.
[0128] Another embodiment of the present invention provides a computer device, such as... Figure 7 As shown, the computer device 70 includes:
[0129] One or more processors 701 and memory 702, Figure 7 The following section uses a processor 701 as an example. The processor 701 and the memory 702 can be connected via a bus or other means. Figure 7Taking the example of a connection between China and Israel via a bus.
[0130] Processor 701 is used to perform various control logics of computer device 70, and can be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), microcontroller, ARM (Acorn RISC Machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Furthermore, processor 701 can also be any conventional processor, microprocessor, or state machine. Processor 701 can also be implemented as a combination of computing devices, such as a combination of DSP and microprocessor, multiple microprocessors, one or more microprocessors combined with DSP and / or any other such configuration.
[0131] The memory 702, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the program instructions corresponding to the image generation method based on conditional coding fusion in the embodiments of the present invention. The processor 701 executes various functional applications and data processing of the computer device 70 by running the non-volatile software programs, instructions, and units stored in the memory 702, thereby implementing the image generation method based on conditional coding fusion in the above method embodiments.
[0132] The memory 702 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the computer device 70. Furthermore, the memory 702 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory 702 may optionally include memory remotely located relative to the processor 701, and these remote memories may be connected to the computer device 70 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof. One or more units stored in the memory 702, when executed by one or more processors 701, perform the steps of the image generation method based on conditional coding fusion in any of the above method embodiments.
[0133] In the above embodiments, the present invention discloses a computer device that acquires prompt information for guiding image generation and a randomly initialized noise map; inputs the prompt information into a trained conditional encoder for encoding processing, and generates a corresponding conditional feature vector based on the encoding result; sums the conditional feature vector with the noise map to obtain a conditional noise map; inputs the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompt information to obtain a conditional feature vector, fusing it with the noise map, and then inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompt information, ensuring the quality of image generation. Simultaneously, the image generation model, through the transformation mapping of straight-line paths, can significantly shorten the image generation time and improve image generation efficiency.
[0134] This invention provides a non-volatile computer-readable storage medium storing computer-executable instructions. When these computer-executable instructions are executed by one or more processors, they perform the steps of the image generation method based on conditional coding fusion in any of the above method embodiments.
[0135] In the above embodiments, the present invention discloses a non-volatile computer-readable storage medium that acquires prompt information for guiding image generation and a randomly initialized noise map; inputs the prompt information into a trained conditional encoder for encoding processing, and generates a corresponding conditional feature vector based on the encoding result; sums the conditional feature vector with the noise map to obtain a conditional noise map; inputs the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompt information to obtain the conditional feature vector, fusing it with the noise map, and inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompt information, ensuring the quality of image generation. Simultaneously, the image generation model, through the transformation mapping of straight-line paths, can significantly shorten the image generation time and improve image generation efficiency.
[0136] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0137] This invention can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This invention can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0138] In summary, the image generation method, apparatus, device, and medium based on conditional encoding fusion disclosed in this invention include: acquiring prompt information for guiding image generation and a randomly initialized noise map; inputting the prompt information into a trained conditional encoder for encoding processing, and generating a corresponding conditional feature vector based on the encoding result; summing the conditional feature vector with the noise map to obtain a conditional noise map; and inputting the conditional noise map into a trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. By encoding the prompt information to obtain a conditional feature vector, fusing it with the noise map, and then inputting it into the image generation model for straight-line path image generation, the generated image can be controlled under the guidance of the prompt information, ensuring the quality of image generation. Simultaneously, the image generation model, through the transformation mapping of straight-line paths, can significantly shorten the image generation time and improve image generation efficiency.
[0139] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The computer program can be stored in a non-volatile, computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The storage medium can be a memory, magnetic disk, floppy disk, flash memory, optical storage, etc.
[0140] It should be noted that any software tools or components not belonging to this company appearing in the embodiments of this application are merely illustrative examples and do not represent actual use. It should be understood that the application of this invention is not limited to the examples described above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. An image generation method based on conditional coding fusion, characterized in that, include: Obtain different forms of prompts used to generate the guide image, as well as a randomly initialized noise map; The prompt information is input into a trained conditional encoder for encoding, which maps different forms of prompt information to a unified feature representation space, and generates corresponding conditional feature vectors based on the encoding results. The conditional feature vector and the noise map are summed, and the sum is performed element-wise at the pixel level to obtain the conditional noise map; The conditional noise map is input into the trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. The process of obtaining an image generation model by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model includes: Construct an initial streaming model and a dataset of training images, and generate a training noise map with the same size as the training images; The initial flow model is trained to generate straight paths in the first stage by training the noise map and randomly sampled unpaired data from the training images, thus obtaining the first-stage generation model. The training noise map is sampled using the first-stage generation model to obtain a sampled image paired with the training noise map; The first-stage generative model is trained to generate straight paths in the second stage based on the paired training noise map and sampled image, thus obtaining the second-stage generative model. Obtain prompt samples and input them into a trained conditional encoder for encoding processing. Generate corresponding training conditional vectors based on the encoding results. The second-stage generation model is fine-tuned using the training condition vector, the training noise map, and the original image that matches the prompt sample to obtain the image generation model.
2. The image generation method based on conditional coding fusion according to claim 1, characterized in that, The step of inputting the prompt information into a trained conditional encoder for encoding processing, and generating a corresponding conditional feature vector based on the encoding result, includes: The prompt information is preprocessed according to its type to obtain several prompt tokens; The aforementioned prompt tokens are input into the trained conditional encoder for encoding processing to obtain the feature vectors of the aforementioned prompt tokens; According to the type of the prompt information, the feature vectors of the plurality of prompt tokens are subjected to a specified pooling operation to obtain the corresponding pooling vector; The pooling vector is mapped to a vector with the same dimension as the noise map to obtain the corresponding conditional feature vector.
3. The image generation method based on conditional coding fusion according to claim 2, characterized in that, When the prompt message is a prompt text, the preprocessing of the prompt message according to its type to obtain several prompt tokens specifically includes: The prompt text is segmented into words to obtain several text tokens, which are used as the prompt token. The step of performing a specified pooling operation on the feature vectors of the plurality of prompt tokens according to the type of the prompt information to obtain the corresponding pooling vector specifically includes: Max pooling is performed on the feature vectors of the several text tokens, and the maximum value in each dimension is selected for output to obtain the corresponding text pooling vector.
4. The image generation method based on conditional coding fusion according to claim 2, characterized in that, When the prompt information is a prompt image, the preprocessing of the prompt information according to its type yields several prompt tokens, specifically including: After converting the prompt image to a specified size, perform equal-sized image segmentation to obtain several image tokens as the prompt token; The step of performing a specified pooling operation on the feature vectors of the plurality of prompt tokens according to the type of the prompt information to obtain the corresponding pooling vector specifically includes: The feature vectors of the aforementioned image tokens are subjected to mean pooling, and the mean value in each dimension is calculated and output to obtain the corresponding image pooling vector.
5. The image generation method based on conditional coding fusion according to claim 1, characterized in that, The method also includes: In both the first and second stages of straight path generation training, the noise intensity of the training noise map is controlled by preset hyperparameters, and the noise intensity increases with the number of training steps.
6. The image generation method based on conditional coding fusion according to claim 1, characterized in that, The step of fine-tuning the second-stage generation model using the training conditional vector, the training noise map, and the original image matching the prompt sample to obtain the image generation model includes: The training condition vector and the training noise map are summed to obtain the training condition input map; The training condition input image is input into the second-stage generative model, and the training condition vector is residually concatenated with the output image of the second-stage generative model to obtain a condition-enhanced image; The corresponding loss value is calculated based on the enhanced image and the original image, and the model parameters are adjusted until the loss converges to obtain the image generation model.
7. An image generation device based on conditional coding fusion, characterized in that, include: The information acquisition module is used to acquire different forms of prompt information for generating the guide image and a randomly initialized noise map; The conditional encoding module is used to input the prompt information into the trained conditional encoder for encoding processing, map different forms of prompt information to a unified feature representation space, and generate corresponding conditional feature vectors based on the encoding results. The conditional fusion module is used to sum the conditional feature vector and the noise map by performing element-wise addition at the pixel level to obtain the conditional noise map. The image generation module is used to input the conditional noise map into the trained image generation model to generate a target image corresponding to the conditional noise map. The image generation model is obtained by performing two-stage straight-line path training and conditional fine-tuning on a preset flow model. The building module is used to build the initial streaming model and training image dataset, and generate a training noise map with the same size as the training images; The first-stage training module is used to perform a first-stage straight-path generation training on the initial flow model using unpaired data randomly sampled from the training noise map and training images, to obtain the first-stage generation model. The sampling module is used to sample the training noise map through the first stage generation model to obtain a sampled image paired with the training noise map; The second-stage sampling module is used to perform a second-stage straight path generation training on the first-stage generation model based on the paired training noise map and sampled image, so as to obtain the second-stage generation model. The prompt encoding module is used to acquire prompt samples and input the prompt samples into the trained conditional encoder for encoding processing, and generate corresponding training conditional vectors based on the encoding results. The conditional fine-tuning module is used to fine-tune the second-stage generation model using the training conditional vector, the training noise map, and the original image that matches the prompt sample, to obtain the image generation model.
8. A computer device, characterized in that, Includes at least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the image generation method based on conditional coding fusion as described in any one of claims 1-6.
9. A non-volatile computer-readable storage medium, characterized in that, The non-volatile computer-readable storage medium stores computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the image generation method based on conditional coding fusion as described in any one of claims 1-6.