Method, device and equipment for generating stylized image generation model, and storage medium

By pre-training and fine-tuning the guided diffusion model, images are extracted from videos of specific styles, solving the problem of generating images of specific styles and achieving efficient generation and cost savings.

CN116740204BActive Publication Date: 2026-06-19NETEASE (HANGZHOU) NETWORK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NETEASE (HANGZHOU) NETWORK CO LTD
Filing Date
2023-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently generate images of specific styles, and building high-quality model fine-tuning datasets requires significant manpower and time.

Method used

By pre-training a guided diffusion model and fine-tuning it with images obtained from videos of a specific style, a dataset for fine-tuning is constructed, and images of a specific style are generated using a diffusion noise layer and a guided denoising layer.

Benefits of technology

It achieves efficient generation of images with specific styles, saving manpower and time costs, and constructs a high-quality model fine-tuning dataset.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116740204B_ABST
    Figure CN116740204B_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, device, and storage medium for generating a stylized image generation model. The method includes: acquiring multiple first images and first descriptive text for the first images, wherein the multiple first images include images of various image styles; using the first images as pre-training input samples and the first descriptive text as guiding conditions to pre-train a guided diffusion model, the guided diffusion model including a diffusion noise layer and a guided denoising layer; acquiring multiple second images from a video with a target image style, and acquiring second descriptive text for the second images; the second descriptive text includes a description of the target image style; using the second images as fine-tuning input samples and the second descriptive text as guiding conditions to fine-tune the pre-trained guided diffusion model, obtaining a stylized image generation model including a guided denoising layer in the fine-tuned guided diffusion model; the model is used to generate images with the target image style.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to a method, apparatus, electronic device, and computer-readable storage medium for generating a stylized image generation model. Background Technology

[0002] AI (Artificial Intelligence) painting has garnered significant attention in the field of artificial intelligence, leading to the emergence of numerous AI painting models and platforms. One applicable scenario for AI painting is the generation of images with a specific style, such as the style of "Big Fish & Begonia" or the style of Hayao Miyazaki, enabling style transfer. Therefore, further developing models capable of generating images with specific styles, building upon existing AI painting models, will become an important downstream task for AI painting. Summary of the Invention

[0003] This application provides a method, apparatus, electronic device, and computer-readable storage medium for generating stylized image generation models to solve or at least partially solve the aforementioned problems. Details are as follows.

[0004] In a first aspect, this application provides a method for generating a stylized image generation model, the method comprising:

[0005] Acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles;

[0006] The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition to pre-train the guided diffusion model; the guided diffusion model includes a diffusion noise-adding layer and a guided noise-reducing layer.

[0007] Obtain multiple second images from a video with a target image style, and obtain second descriptive text for the second images; the second descriptive text includes a description of the target image style.

[0008] Using the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, the pre-trained guided diffusion model is fine-tuned to obtain a stylized image generation model that includes a guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate images with the style of the target image.

[0009] Secondly, embodiments of this application also provide a stylized image generation model generation apparatus, the apparatus comprising:

[0010] A first acquisition module is used to acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles;

[0011] The pre-training module is used to pre-train the guided diffusion model by using the first image as a pre-training input sample and the first descriptive text as a guiding condition; the guided diffusion model includes a diffusion noise layer and a guided denoising layer.

[0012] The second acquisition module is used to acquire multiple second images from a video with a target image style, and to acquire second descriptive text for the second images; the second descriptive text includes a description of the target image style.

[0013] The fine-tuning module is used to fine-tune the pre-trained guided diffusion model by taking the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, to obtain a stylized image generation model including the guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate an image with the style of the target image.

[0014] Thirdly, embodiments of this application also provide an electronic device, including:

[0015] Processor; and

[0016] A memory for storing a program, which, when the electronic device is powered on and runs by the processor, executes the method described in the first aspect.

[0017] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a program that is executed by a processor to perform the method described in the first aspect.

[0018] Compared with the prior art, this application has the following advantages:

[0019] In this embodiment, a guided diffusion model with image generation capabilities is obtained through pre-training. Then, the pre-trained guided diffusion model is fine-tuned using images of a specific style obtained from videos of that style, thereby obtaining a stylized image generation model capable of generating images of that specific style. This stylized image generation model can then generate images with a particular style. Furthermore, by obtaining images of a specific style from videos of that style to construct a dataset for fine-tuning the model, the problem of not being able to find a large number of images for fine-tuning certain styles can be solved. This not only ensures that the fine-tuned model achieves the expected capabilities but also avoids consuming significant manpower and time to construct a high-quality model fine-tuning dataset, saving manpower and time costs associated with model fine-tuning. Attached Figure Description

[0020] Figure 1 This is a flowchart of a method for generating a stylized image generation model provided in an embodiment of this application;

[0021] Figure 2 This is a schematic diagram of a guided diffusion model provided in an embodiment of this application;

[0022] Figure 3 This is a schematic diagram of another guided diffusion model provided in an embodiment of this application;

[0023] Figure 4 This is a schematic diagram of a stylized image generation model provided in an embodiment of this application;

[0024] Figure 5 This is a block diagram of a stylized image generation model generation device provided in an embodiment of this application;

[0025] Figure 6 This is a schematic diagram of the logical structure of an electronic device for generating a stylized image generation model, provided in an embodiment of this application. Detailed Implementation

[0026] Many specific details are set forth in the following description to provide a full understanding of this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this application; therefore, this application is not limited to the specific embodiments disclosed below.

[0027] This application provides a method for generating a stylized image generation model, such as... Figure 1 As shown, the method includes the following steps S10 to S40.

[0028] Step S10: Obtain multiple first images and obtain the first description text of the first images; the multiple first images include images of various image styles.

[0029] In this embodiment of the application, multiple first images, such as multiple realistic images, and a first descriptive text corresponding to each first image can be obtained. The first descriptive text can be obtained by annotating the first image and is used to describe the key points or characteristics expressed by the first image, so as to better understand the information in the first image.

[0030] For example, if the first image is a portrait, and the person is the focus of the first image, then the first descriptive text of the first image can describe some features or characteristics of the person, such as "little girl, front view, running". As another example, if the first image is a landscape, and the landscape is the focus of the first image, then the first descriptive text of the first image can describe some features or characteristics of the landscape, such as "sea, sunset".

[0031] In this embodiment, the first image is used to pre-train a basic image generation model. After pre-training on a large amount of image data, the image generation model has strong semantic capabilities and can generate high-quality images. Since the first image is only used to enable the basic image generation model to generate images with good image generation capabilities, and not to generate images for a specific style, the acquired multiple first images include images of various image styles.

[0032] Step S20: Use the first image as a pre-training input sample and the first descriptive text as a guiding condition to pre-train the guided diffusion model; the guided diffusion model includes a diffusion noise layer and a guided denoising layer.

[0033] In this embodiment of the application, the image generation model can be a guided diffusion model. After obtaining the first image and its corresponding first descriptive text, the first image can be used as a pre-training input sample, and the first descriptive text can be used as a guiding condition to pre-train the initial guided diffusion model.

[0034] Guided diffusion models are a type of diffusion model. The principle of diffusion models is to add noise to an image, learn the image information attenuation caused by the noise, and then use the learned pattern to generate an image. Therefore, the inference process of a diffusion model can generate an image from a randomly given noisy image through denoising. However, randomly inputting a noisy image obviously cannot generate the desired image as intended, thus requiring additional guidance conditions to obtain the desired image. These guidance conditions are used to guide the denoising process of the diffusion model, thereby obtaining the desired output. Therefore, a guided diffusion model is a diffusion model that can generate an image under the guidance of guidance conditions.

[0035] In this embodiment of the application, the first image can be used as a pre-training input sample, and the first descriptive text used to describe the key points or specific details in the first image can be used as a guiding condition to input the initial guided diffusion model. This allows the guided diffusion model to add noise to the first image through a diffusion noise layer, and then learn the image information attenuation caused by noise under the guidance of the first descriptive text through a guided denoising layer, thereby learning the pattern of generating images and completing the pre-training of the guided diffusion model.

[0036] Optionally, the guided diffusion model can be a text-guided Stable Diffusion model.

[0037] Step S30: Obtain multiple second images from the video with the target image style, and obtain second descriptive text for the second images; the second descriptive text includes a description of the target image style.

[0038] After completing the pre-training of the guided diffusion model, the pre-trained guided diffusion model can be fine-tuned to enable the image generation model with strong ability to generate images of specific styles to also have the ability to generate images of specific styles.

[0039] In this embodiment, when it is necessary to generate a stylized image generation model with the image generation capability of a target image style, multiple second images can be obtained from a video with the target image style, as well as second descriptive text for the second images. The second descriptive text can be obtained by annotating the second images, at least to describe the image style of the second images, and can also be used to describe the key points or characteristics expressed by the second images, so as to better understand the information in the second images, especially the image style of the second images.

[0040] The difference between the acquired multiple second images and the acquired multiple first images is that the acquired multiple first images have multiple image styles, while the acquired multiple second images have a specific image style, namely the target image style. In other words, the image styles of the acquired multiple first images are diverse, while the image styles of the acquired multiple second images are uniform.

[0041] Based on the differences between the second image and the first image, the difference between the second descriptive text of the second image and the first descriptive text of the first image is that the first descriptive text does not need to include a description of the image style of the first image, while the second descriptive text needs to include a description of the image style of the second image. This is because, in this embodiment, it is necessary to guide the generation process of an image with a specific image style through a description of that specific image style. For example, the second descriptive text could be "Big Fish & Begonia style, little girl".

[0042] In addition, many image-based models currently use images obtained by crawling images from image URLs for training or fine-tuning. However, it is difficult to find a large number of images for some styles, such as a specific animation (or anime) style in a certain animated (or anime) movie, such as the style of Big Fish & Begonia or the style of Makoto Shinkai. This requires a lot of manpower and time to construct high-quality image data for fine-tuning the model for a specific style.

[0043] In this embodiment of the application, by obtaining images of a specific style from videos of a specific style and constructing a dataset for fine-tuning the model, the problem of not being able to find a large number of images for fine-tuning the model for certain styles can be solved. This not only enables the fine-tuned model to achieve the expected capabilities, but also avoids consuming a lot of manpower and time to construct a high-quality model fine-tuning dataset, thus saving manpower and time costs for model fine-tuning.

[0044] Step S40: Using the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, fine-tune the pre-trained guided diffusion model to obtain a stylized image generation model that includes the guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate images with the style of the target image.

[0045] In this embodiment, a second image with the style of the target image can be used as a fine-tuning input sample, and a second descriptive text including the descriptive content of the target image style can be used as a guiding condition. This is then input into a pre-trained guided diffusion model, so that the pre-trained guided diffusion model adds noise to the second image through a diffusion noise layer, and then learns the image information attenuation caused by noise under the guidance of the second descriptive text including the style of the target image through a guided denoising layer. This allows the model to learn the pattern of generating an image with the style of the target image, thus completing the fine-tuning of the guided diffusion model.

[0046] By removing the diffusion noise layer from the fine-tuned guided diffusion model and retaining the guided denoising layer, a stylized image generation model for generating images with the style of the target image can be obtained.

[0047] The method for generating a stylized image generation model provided in this application first acquires multiple first images and their first descriptive texts, wherein the multiple first images include images of various image styles; then, the first images are used as pre-training input samples, and the first descriptive texts are used as guiding conditions to pre-train a guided diffusion model including a diffusion noise layer and a guided denoising layer; then, multiple second images are acquired from a video with a target image style, and second descriptive texts for the second images are acquired, wherein the second descriptive texts include descriptions of the target image style; then, the second images are used as fine-tuning input samples, and the second descriptive texts are used as guiding conditions to fine-tune the pre-trained guided diffusion model, retaining the guided denoising layer in the fine-tuned model, thereby obtaining a stylized image generation model, wherein the stylized image generation model can be used to generate images with a specific style of the target image style.

[0048] In this embodiment, a guided diffusion model with image generation capabilities is obtained through pre-training. Then, the pre-trained guided diffusion model is fine-tuned using images of a specific style obtained from videos of that style, thereby obtaining a stylized image generation model capable of generating images of that specific style. This stylized image generation model can then generate images with a particular style. Furthermore, by obtaining images of a specific style from videos of that style to construct a dataset for fine-tuning the model, the problem of not being able to find a large number of images for fine-tuning certain styles can be solved. This not only ensures that the fine-tuned model achieves the expected capabilities but also avoids consuming significant manpower and time to construct a high-quality model fine-tuning dataset, saving manpower and time costs associated with model fine-tuning.

[0049] Based on the above embodiments, step S20 may optionally be implemented in the following ways, including:

[0050] S21: Using the first image as a pre-training input sample and the first descriptive text as a guiding condition, input the first image and the first descriptive text into the guided diffusion model so that the guided diffusion model outputs the first reconstructed image of the first image.

[0051] Specifically, the guided diffusion model obtains the first reconstructed image of the first image through the following steps:

[0052] Noise is added to the first image by using a diffusion noise-adding layer;

[0053] By guiding the denoising layer to denoise the first image after adding noise under the guidance of the first descriptive text, the image information attenuation characteristics of the first image caused by adding noise are learned, and a second reconstructed image of the first image is obtained.

[0054] The above steps will be explained in detail below through a specific embodiment. (Refer to...) Figure 2 The diagram illustrates a guided diffusion model. Specifically, the diffusion noise-adding layer in the guided diffusion model can include an image encoding layer and an image noise-adding layer. The image encoding layer encodes the image to obtain a first feature image. The image noise-adding layer progressively adds Gaussian noise to the first feature image to obtain a Gaussian noise image.

[0055] Reference Figure 2 The guided denoising layer in the guided diffusion model can specifically include a text encoding layer, an image denoising layer, and an image decoding layer. The text encoding layer encodes the descriptive text of the image to obtain a text feature vector. The image denoising layer, guided by the text feature vector, progressively denoises the Gaussian noise image of the image based on a cross-attention mechanism to obtain a second feature image. The image decoding layer decodes the second feature image to reconstruct or generate the image.

[0056] In step S21, the first image is input into the diffusion noise layer of the guided diffusion model. The image coding layer encodes the first image. During encoding, the probability distribution of the first image is first obtained, and then the image features of the first image are compressed according to this probability distribution to obtain a first feature image with the same probability distribution as the first image. For example, 8x compression can be used, compressing a 512*512*3 image feature into a 64*64*4 image feature. The third digit represents the layer number and is unrelated to the compression factor. The first feature image of the first image retains the main feature information of the first image. Optionally, the image coding layer can be implemented using an image encoder, as described above. Figure 3 The diagram shown illustrates a guided diffusion model. Specifically, the image encoder can be an image encoder in a variational autoencoder (VAE), which includes an image encoder and an image decoder.

[0057] Then, the first feature image of the first image is input to the image noise layer. The image noise layer gradually adds Gaussian noise to the first feature image to obtain the Gaussian noise image of the first image. For the guided diffusion model, this process is also called the forward process or the diffusion process.

[0058] In step S21, the first descriptive text and the Gaussian noise image of the first image obtained through the diffusion noise layer are input into the guided denoising layer of the guided diffusion model. Specifically, the first descriptive text is input into the text encoding layer of the guided denoising layer. The text encoding layer encodes the first descriptive text, thereby extracting the main feature information of the first descriptive text and obtaining the text feature vector of the first descriptive text. Optionally, refer to... Figure 3 The diagram illustrates the guided diffusion model. The text encoding layer can be implemented using the text encoder in the CLIP (Generative Pre-training) model. Optionally, when the descriptive text uses Chinese characters, the text encoder in the CLIP model should be a Chinese text encoder.

[0059] Then, the Gaussian noise image of the first image and the text feature vector of the first descriptive text are input into the image denoising layer. A cross-attention mechanism can be introduced into the image denoising layer, allowing it to progressively denoise the Gaussian noise image of the first image based on this mechanism. This progressive denoising process requires guidance from the text feature vector of the first descriptive text, thus reconstructing a feature image with the same probability distribution as the first image, i.e., the second feature image of the first image. Optionally, refer to... Figure 3 The diagram shown illustrates the guided diffusion model. The image denoising layer can be implemented using the Unet model.

[0060] Subsequently, the second feature image of the first image is input to the image decoding layer. The image decoding layer can decode the second feature image of the first image to obtain the first reconstructed image of the first image. Optionally, the image decoding layer can be implemented using an image decoder, as described above. Figure 3 The diagram shown illustrates a guided diffusion model. Specifically, the image decoder can be an image decoder within a variational autoencoder (VAE).

[0061] In short, during the pre-training phase, the guided diffusion model obtains a Gaussian noise image by progressively adding noise to the first image, and then, guided by the first descriptive text, reconstructs the first image by progressively denoising the Gaussian noise image.

[0062] S22: Determine the first loss function value of the guided diffusion model based on the similarity between the first reconstructed image and the first image.

[0063] In this step, the loss function of the guided diffusion model represents its image reconstruction capability. The higher the similarity between the reconstructed image and the original image, the stronger the image reconstruction capability of the guided diffusion model. During the pre-training phase, the lower the similarity between the reconstructed image and the original image, the larger the first loss function value of the guided diffusion model; conversely, the higher the similarity between the reconstructed image and the original image, the smaller the first loss function value of the guided diffusion model. In this embodiment, the loss function value of the guided diffusion model is calculated once for each first reconstructed image of a first image.

[0064] It should be noted that, in order to distinguish it from the loss function value of the guided diffusion model in the fine-tuning stage, the loss function value of the guided diffusion model calculated in the pre-training stage is called the first loss function value, and the loss function value of the guided diffusion model calculated in the fine-tuning stage is called the second loss function value.

[0065] S23: Adjust the model parameters of the guided diffusion model according to the first loss function value to achieve pre-training of the guided diffusion model.

[0066] Each time a first image is reconstructed, resulting in a first reconstructed image, the model parameters of the guided diffusion model are adjusted once, following a strategy of gradually reducing the value of the first loss function to less than a first preset value. This process of adjusting the model parameters multiple times based on all first images completes the pre-training of the guided diffusion model.

[0067] Furthermore, the fine-tuning process of the guided diffusion model is similar to the pre-training process of the guided diffusion model; that is, step S40 can be implemented in the following ways:

[0068] S41: Use the second image as the fine-tuning input sample and the second descriptive text as the guiding condition. Input the second image and the second descriptive text into the pre-trained guided diffusion model so that the pre-trained guided diffusion model outputs the second reconstructed image of the second image.

[0069] Specifically, the guided diffusion model obtains the second reconstructed image of the second image through the following steps:

[0070] Noise is added to the second image by using a diffusion noise-adding layer;

[0071] By guiding the denoising layer to denoise the second image after adding noise under the guidance of the second descriptive text, the image information attenuation characteristics of the second image caused by adding noise are learned, and the second reconstructed image of the second image is obtained.

[0072] The above steps will be explained in detail below through a specific embodiment. Referring to step S21, and... Figure 2 and Figure 3 In step S41, the second image is input into the diffusion noise layer of the guided diffusion model. The image coding layer encodes the second image. During the encoding process, the probability distribution of the second image is first obtained, and then the image features of the second image are compressed based on this probability distribution to obtain a first feature image with the same probability distribution as the second image. The first feature image of the second image retains the main feature information of the second image.

[0073] Then, the first feature image of the second image is input to the image noise layer. The image noise layer gradually adds Gaussian noise to the first feature image to obtain the Gaussian noise image of the second image. For the guided diffusion model, this process is also called the forward process or the diffusion process.

[0074] In step S41, the second descriptive text and the Gaussian noise image of the second image obtained through the diffusion noise layer are input into the guided denoising layer in the guided diffusion model. Specifically, the second descriptive text is input into the text encoding layer within the guided denoising layer. The text encoding layer encodes the second descriptive text, thereby extracting its main feature information, especially the style description features of the second image, to obtain the text feature vector of the second descriptive text.

[0075] Then, the Gaussian noise image of the second image and the text feature vector of the second descriptive text are input into the image denoising layer. The image denoising layer can perform stepwise denoising on the Gaussian noise image of the second image based on the cross-attention mechanism. The stepwise denoising process requires the guidance of the text feature vector of the second descriptive text, so that a feature image with the same probability distribution as the second image can be reconstructed, which is the second feature image of the second image.

[0076] Then, the second feature image of the second image is input into the image decoding layer, which can decode the second feature image of the second image to obtain the second reconstructed image of the second image.

[0077] In short, during the fine-tuning phase, the guided diffusion model obtains a Gaussian noise image by gradually adding noise to the second image, and then, guided by the second descriptive text, reconstructs the second image by gradually denoising the Gaussian noise image.

[0078] S42: Determine the second loss function value of the pre-trained guided diffusion model based on the similarity between the second reconstructed image and the second image.

[0079] In the embodiments of this application, the loss function value of the guided diffusion model is calculated once for each second reconstructed image of a second image.

[0080] S43: Adjust the model parameters of the pre-trained guided diffusion model according to the value of the second loss function to achieve fine-tuning of the pre-trained guided diffusion model.

[0081] Each time a second image is reconstructed, a second reconstructed image is obtained. Following a strategy of gradually reducing the value of the second loss function to below a second preset value, the model parameters of the pre-trained guided diffusion model are adjusted. This process of adjusting the model parameters multiple times based on all second images completes the fine-tuning of the pre-trained guided diffusion model, resulting in a stylized image generation model, such as... Figure 4 As shown.

[0082] Furthermore, embodiments of this application also provide an inference process including a stylized image generation model, that is, a process of generating images with a specific style through a stylized image generation model. (Refer to...) Figure 4 The reasoning process specifically includes:

[0083] Obtain target description text including descriptions of the target image style, and obtain random Gaussian noise images;

[0084] A random Gaussian noise image and target description text are input into a stylized image generation model so that the stylized image generation model outputs an image with the style of the target image under the guidance of the target description text.

[0085] The process involves providing a target descriptive text that describes the style of the target image, and randomly generating a Gaussian noise image. Both are input into a stylized image generation model. The model then progressively denoises the random Gaussian noise image and decodes it to generate an image with the style of the target image. The progressive denoising process is guided by the target descriptive text.

[0086] Furthermore, the specific method for acquiring the second image is described in this embodiment, as follows.

[0087] Optionally, step S30 can be implemented in the following ways:

[0088] S31: Extract keyframes from the video with the style of the target image to obtain multiple second images.

[0089] In this embodiment, for videos with distinct styles, such as those resembling Big Fish & Begonia, Miyazaki Hayao animation, Makoto Shinkai manga, or One Piece anime, keyframes can be extracted from the video and used as the second image for fine-tuning the model.

[0090] Keyframe extraction ensures that the image frames obtained from the video contain effective image information, removes black screens, transitions, and redundant image frames, and avoids model overfitting caused by repeated content in the model fine-tuning dataset.

[0091] Further, optionally, prior to step S31, the generation of the stylized image generation model may further include the following steps:

[0092] S32: Determine the beginning and end video segments of the video;

[0093] S33: Perform image information richness detection on the image frames in the beginning and end video segments respectively;

[0094] S34: When the image information richness of the image frames in the opening video segment is less than the preset richness threshold, the opening video segment will be deleted from the video.

[0095] S35: When the image information richness of the image frames in the ending video segment is less than the preset richness threshold, the ending video segment will be deleted from the video.

[0096] In this embodiment, for videos such as movies, the beginning and end of the video are often used to highlight subtitles, which is often not conducive to capturing the video style. Therefore, the richness of image information in the beginning and end of the video can be detected. If the richness of image information is low, it means that the image information in the beginning or end of the video is not rich, and it is more difficult to capture the video style. Therefore, the beginning or end of the video with low richness of image information can be deleted, thereby avoiding the subtitles at the beginning and end of the video from affecting the image generation effect of the model.

[0097] Specifically, the beginning and ending video segments can be determined based on the video's length. For example, the segment that occupies 3% of the video's length from the beginning can be designated as the beginning video segment, and the segment that occupies 3% of the video's length from the end can be designated as the ending video segment.

[0098] Then, image frames in the opening and closing video segments are subjected to image information richness detection. Optionally, image frames can be randomly selected from the opening and closing video segments, and image information richness detection can be performed on the selected image frames. This avoids detecting too many image frames. In one embodiment, image information richness detection of image frames can be achieved using the following formula:

[0099] rg = RG

[0100]

[0101]

[0102]

[0103] C = σ rgyb +0.3·μ rgyb

[0104] In the above formulas, R, G, and B represent the R (red) channel component, G (green) channel component, and B (blue) channel component of a pixel in an image frame, respectively. For ease of description in the following formulas, rg will be referred to as the first parameter, and yb as the second parameter. σ rg σ represents the standard deviation of the first parameter of all pixels in an image frame. yb μ represents the standard deviation of the second parameter of all pixels in an image frame. rg μ represents the average value of the first parameter of all pixels in an image frame. yb denoted by , represents the average value of the second parameter of all pixels in the image frame, and C represents the image information richness of the image frame.

[0105] If the image information richness of the first video segment is less than a preset richness threshold (e.g., 20), the first video segment is deleted from the video. If the image information richness of the first video segment is greater than or equal to the preset richness threshold, the first video segment is retained. If the image information richness of the last video segment is less than the preset richness threshold, the last video segment is deleted from the video. If the image information richness of the last video segment is greater than or equal to the preset richness threshold, the last video segment is retained.

[0106] Accordingly, in step S31, keyframes can be extracted from the video processed by steps S32-S35.

[0107] Furthermore, step S31 can be implemented through the following process, including:

[0108] S311: Select multiple candidate image frames from a video with the target image style.

[0109] In this step, multiple candidate image frames can be selected from the video with the target image style at preset intervals. The preset interval can be a time interval or an interval between image frames, such as selecting one image frame every 1 second or one image frame every 20 frames.

[0110] S312: Extract keyframes from each candidate image frame based on the structural similarity index between candidate image frames.

[0111] In this step, the structural similarity index between candidate image frames can be determined, and keyframes can be extracted from each candidate image frame accordingly. Specifically, the structural similarity index can be the MS-SSIM index (Multi-Scale Structural Similarity). The MS-SSIM index can examine the similarity between images from three aspects: brightness, contrast, and structure. The larger the MS-SSIM index, the higher the similarity between the two images.

[0112] In one embodiment, this step can be implemented in the following ways:

[0113] Using the first candidate image frame in the video as the initial comparison frame, the following keyframe determination steps are executed repeatedly until the comparison frame is updated to the last candidate image frame in the video, at which point the loop exits.

[0114] The keyframe determination steps include:

[0115] If the first candidate image frame is the current comparison frame, determine the structural similarity index between the first candidate image frame and the second candidate image frame; the second candidate image frame is adjacent to the first candidate image frame and is located after the first candidate image frame in the current video.

[0116] When the structural similarity index between the first candidate image frame and the second candidate image frame is less than a preset similarity threshold, the second candidate image frame is determined as a key frame, the comparison frame is updated to the second candidate image frame, and the process returns to the key frame determination step.

[0117] The keyframe determination step also includes:

[0118] When the structural similarity index between the first candidate image frame and the second candidate image frame is greater than or equal to a preset similarity threshold, the second candidate image frame is deleted from the video, and the process returns to the keyframe determination step.

[0119] Specifically, the first candidate image frame in the video can be used as the initial comparison frame. The MS-SSIM index is used to calculate the similarity between the next candidate image frame and the comparison frame based on three weighted factors: brightness, contrast, and structure. Specifically, the two image frames whose similarity is to be calculated are first read in, and both are converted to grayscale images. The MS-SSIM index between the two grayscale images is calculated using the `skimage.metrics.structural_similarity` function. If the MS-SSIM index is less than a preset similarity threshold (e.g., 0.4), the next candidate image frame of the current comparison frame is retained as a keyframe, and the next candidate image frame is used as the comparison frame required for the next keyframe determination, thus executing the next keyframe determination step.

[0120] If the MS-SSIM index is greater than or equal to the preset similarity threshold, the next candidate image frame of the current comparison frame is considered not a key frame, and it is deleted. The comparison frame remains unchanged, that is, the current comparison frame continues to be used as the comparison frame required for the next key frame determination, and the next key frame determination step is executed.

[0121] In this way, the keyframe determination step is executed repeatedly until the comparison frame is updated to the last candidate image frame in the video. That is, the MS-SSIM index of the last candidate image frame and the comparison frame in the video is calculated. Then the loop is exited, and the keyframe extraction of the video is completed.

[0122] In this embodiment of the application, the image determined as a keyframe needs to be representative, that is, it needs to be different from the content of other keyframes, so as to facilitate learning the same style from images with different content. If two images are highly similar, one of them can be selected as the keyframe.

[0123] The method provided in this application provides a stylized image generation model for generating images of a specific style, and enables fine-tuning of the stylized image generation model based on video keyframes. Specifically, keyframes can be extracted from video to construct a unified style image dataset, and image style descriptions can be added to the dataset for fine-tuning of the stylized image generation model. This method allows video footage to be used for fine-tuning of a pre-trained image generation model. Simultaneously, by utilizing keyframe extraction and information richness detection at the beginning and end of the video, the method ensures the effectiveness of the image information in the dataset, removes black screens, transitions, and redundant video frames, avoids model overfitting, and also avoids the negative impact of black screens and video beginning and ending subtitles on the model's performance.

[0124] Corresponding to the method for generating a stylized image generation model provided in the embodiments of this application, the embodiments of this application also provide an apparatus for generating a stylized image generation model. For example... Figure 5 As shown, the device includes:

[0125] A first acquisition module is used to acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles;

[0126] The pre-training module is used to pre-train the guided diffusion model by using the first image as a pre-training input sample and the first descriptive text as a guiding condition; the guided diffusion model includes a diffusion noise layer and a guided denoising layer.

[0127] The second acquisition module is used to acquire multiple second images from a video with a target image style, and to acquire second descriptive text for the second images; the second descriptive text includes a description of the target image style.

[0128] The fine-tuning module is used to fine-tune the pre-trained guided diffusion model by taking the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, to obtain a stylized image generation model including the guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate an image with the style of the target image.

[0129] Optionally, the second acquisition module includes:

[0130] The keyframe extraction submodule is used to extract keyframes from videos with the style of the target image, resulting in multiple second images.

[0131] Optionally, the device is further used for:

[0132] Determine the beginning and end video segments of the video;

[0133] The image frames in the opening video segment and the ending video segment are respectively subjected to image information richness detection;

[0134] When the image information richness of the image frames in the opening video segment is less than a preset richness threshold, the opening video segment will be deleted from the video.

[0135] When the image information richness of the image frames in the ending video segment is less than the preset richness threshold, the ending video segment is deleted from the video.

[0136] Optionally, the keyframe extraction submodule includes:

[0137] A selection unit is used to select multiple candidate image frames from a video with the target image style;

[0138] An extraction unit is used to extract keyframes from each of the candidate image frames based on the structural similarity index between the candidate image frames.

[0139] Optionally, the extraction unit is specifically used for:

[0140] The first candidate image frame in the video is used as the initial comparison frame. The following keyframe determination steps are executed repeatedly until the comparison frame is updated to the last candidate image frame in the video, and then the loop is exited.

[0141] The keyframe determination step includes:

[0142] If the first candidate image frame is the current comparison frame, a structural similarity index is determined between the first candidate image frame and the second candidate image frame; the second candidate image frame is adjacent to the first candidate image frame and is located after the first candidate image frame in the current video.

[0143] When the structural similarity index between the first candidate image frame and the second candidate image frame is less than a preset similarity threshold, the second candidate image frame is determined as a key frame, the comparison frame is updated to the second candidate image frame, and the process returns to execute the key frame determination step.

[0144] Optionally, the keyframe determination step further includes:

[0145] When the structural similarity index between the first candidate image frame and the second candidate image frame is greater than or equal to a preset similarity threshold, the second candidate image frame is deleted from the video, and the process returns to the keyframe determination step.

[0146] Optionally, the device further includes:

[0147] The third acquisition module is used to acquire target description text including descriptions of the target image style, and to acquire random Gaussian noise images;

[0148] A stylized image generation module is used to input the random Gaussian noise image and the target description text into the stylized image generation model, so that the stylized image generation model outputs an image with the style of the target image under the guidance of the target description text.

[0149] Optionally, the pre-training module is specifically used for:

[0150] The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition. The first image and the first descriptive text are input into the guided diffusion model so that the guided diffusion model outputs a first reconstructed image of the first image.

[0151] The first loss function value of the guided diffusion model is determined based on the similarity between the first reconstructed image and the first image.

[0152] The model parameters of the guided diffusion model are adjusted according to the first loss function value to achieve pre-training of the guided diffusion model.

[0153] Optionally, the fine-tuning module is specifically used for:

[0154] The second image is used as the fine-tuning input sample, and the second descriptive text is used as the guiding condition. The second image and the second descriptive text are input into the pre-trained guided diffusion model so that the pre-trained guided diffusion model outputs a second reconstructed image of the second image.

[0155] Based on the similarity between the second reconstructed image and the second image, the second loss function value of the pre-trained guided diffusion model is determined;

[0156] The model parameters of the pre-trained guided diffusion model are adjusted according to the second loss function value to achieve fine-tuning of the pre-trained guided diffusion model.

[0157] Optionally, the fine-tuning module is more specifically used for:

[0158] Noise is added to the second image through the diffusion noise layer;

[0159] The guided denoising layer denoises the second image after adding noise under the guidance of the second descriptive text, so as to learn the image information attenuation characteristics of the second image caused by adding noise, and obtain the second reconstructed image of the second image.

[0160] Optionally, the diffusion noise layer includes an image coding layer and an image noise layer;

[0161] The image coding layer is used to encode the image to obtain a first feature image;

[0162] The image noise-adding layer is used to gradually add Gaussian noise to the first feature image to obtain a Gaussian noise image.

[0163] Optionally, the guided denoising layer includes a text encoding layer, an image denoising layer, and an image decoding layer;

[0164] The text encoding layer is used to encode the descriptive text of the image to obtain a text feature vector;

[0165] The image denoising layer is used to progressively denoise the Gaussian noise image of the image based on the cross-attention mechanism, guided by the text feature vector, to obtain the second feature image;

[0166] The image decoding layer is used to decode the second feature image in order to reconstruct or generate the image.

[0167] Corresponding to the method for generating stylized image generation models provided in the embodiments of this application, the embodiments of this application also provide an electronic device for generating stylized image generation models. For example... Figure 6 As shown, the electronic device includes: a processor 601; and a memory 602 for storing a program for generating a stylized image generation model. After the device is powered on and the program for generating the stylized image generation model is run by the processor, the following steps are performed:

[0168] Acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles;

[0169] The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition to pre-train the guided diffusion model; the guided diffusion model includes a diffusion noise-adding layer and a guided noise-reducing layer.

[0170] Acquire multiple second images and acquire second descriptive text for the second images; the second images are images with the style of the target image, and the second descriptive text includes a description of the style of the target image;

[0171] Using the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, the pre-trained guided diffusion model is fine-tuned to obtain a stylized image generation model that includes a guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate images with the style of the target image.

[0172] Corresponding to the method for generating a stylized image generation model provided in the embodiments of this application, the embodiments of this application provide a computer-readable storage medium storing a program for generating a stylized image generation model. This program is executed by a processor to perform the following steps:

[0173] Acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles;

[0174] The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition to pre-train the guided diffusion model; the guided diffusion model includes a diffusion noise-adding layer and a guided noise-reducing layer.

[0175] Acquire multiple second images and acquire second descriptive text for the second images; the second images are images with the style of the target image, and the second descriptive text includes a description of the style of the target image;

[0176] Using the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, the pre-trained guided diffusion model is fine-tuned to obtain a stylized image generation model that includes a guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate images with the style of the target image.

[0177] It should be noted that for a detailed description of the apparatus, electronic device and computer-readable storage medium provided in the embodiments of this application, please refer to the relevant description of the method in the embodiments of this application, which will not be repeated here.

[0178] Although this application discloses preferred embodiments as described above, it is not intended to limit this application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of this application. Therefore, the scope of protection of this application should be determined by the scope defined in the claims of this application.

[0179] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0180] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0181] 1. Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information by any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage media, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.

[0182] 2. Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0183] Although this application discloses preferred embodiments as described above, it is not intended to limit this application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of this application. Therefore, the scope of protection of this application should be determined by the scope defined in the claims of this application.

Claims

1. A generation method of a stylized image generation model, characterized by, The method includes: Acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles; The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition to pre-train the guided diffusion model; the guided diffusion model includes a diffusion noise-adding layer and a guided noise-reducing layer. Obtain multiple second images from a video with a target image style, and obtain second descriptive text for the second images; the second descriptive text includes a description of the target image style. Using the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, the pre-trained guided diffusion model is fine-tuned to obtain a stylized image generation model that includes a guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate images with the style of the target image.

2. The method of claim 1, wherein, The process of acquiring multiple second images from a video with a target image style includes: Keyframes are extracted from videos with the style of the target image to obtain multiple second images.

3. The method of claim 2, wherein, Before extracting keyframes from the video with the target image style to obtain multiple second images, the process also includes: Determine the beginning and end video segments of the video; The image frames in the beginning video segment and the end video segment are respectively subjected to image information richness detection; When the image information richness of the image frames in the opening video segment is less than a preset richness threshold, the opening video segment will be deleted from the video. When the image information richness of the image frames in the ending video segment is less than the preset richness threshold, the ending video segment is deleted from the video.

4. The method of claim 2, wherein, The process of extracting keyframes from a video with the style of the target image yields multiple second images, including: Select multiple candidate image frames from a video with the target image style; Keyframes are extracted from each of the candidate image frames based on the structural similarity index between the candidate image frames.

5. The method according to claim 4, characterized in that, Extracting keyframes from the video based on the structural similarity index between the candidate image frames includes: The first candidate image frame in the video is used as the initial comparison frame. The following keyframe determination steps are executed repeatedly until the comparison frame is updated to the last candidate image frame in the video, and then the loop is exited. The keyframe determination step includes: If the first candidate image frame is the current comparison frame, a structural similarity index is determined between the first candidate image frame and the second candidate image frame; the second candidate image frame is adjacent to the first candidate image frame and is located after the first candidate image frame in the current video. When the structural similarity index between the first candidate image frame and the second candidate image frame is less than a preset similarity threshold, the second candidate image frame is determined as a key frame, the comparison frame is updated to the second candidate image frame, and the process returns to execute the key frame determination step.

6. The method of claim 5, wherein, The keyframe determination step further includes: When the structural similarity index between the first candidate image frame and the second candidate image frame is greater than or equal to a preset similarity threshold, the second candidate image frame is deleted from the video, and the process returns to the keyframe determination step.

7. The method of claim 1, wherein, The method further includes: Obtain target descriptive text including descriptions of the target image style, and obtain a random Gaussian noise image; The random Gaussian noise image and the target description text are input into the stylized image generation model so that the stylized image generation model outputs an image with the style of the target image under the guidance of the target description text.

8. The method according to claim 1, characterized in that, The step of using the first image as a pre-training input sample and the first descriptive text as a guiding condition to pre-train the guided diffusion model includes: The first image is used as a pre-training input sample, and the first descriptive text is used as a guiding condition. The first image and the first descriptive text are input into the guided diffusion model so that the guided diffusion model outputs a first reconstructed image of the first image. The first loss function value of the guided diffusion model is determined based on the similarity between the first reconstructed image and the first image. The model parameters of the guided diffusion model are adjusted according to the first loss function value to achieve pre-training of the guided diffusion model.

9. The method of claim 1, wherein, The step involves using the second image as a fine-tuning input sample and the second descriptive text as a guiding condition to fine-tune the pre-trained guided diffusion model, resulting in a stylized image generation model that includes a guided denoising layer in the fine-tuned guided diffusion model. This includes: The second image is used as the fine-tuning input sample, and the second descriptive text is used as the guiding condition. The second image and the second descriptive text are input into the pre-trained guided diffusion model so that the pre-trained guided diffusion model outputs a second reconstructed image of the second image. Based on the similarity between the second reconstructed image and the second image, the second loss function value of the pre-trained guided diffusion model is determined; The model parameters of the pre-trained guided diffusion model are adjusted according to the second loss function value to achieve fine-tuning of the pre-trained guided diffusion model.

10. The method of claim 9, wherein, The step of inputting the second image and the second descriptive text into the pre-trained guided diffusion model, so that the pre-trained guided diffusion model outputs a second reconstructed image of the second image, includes: Noise is added to the second image through the diffusion noise layer; The guided denoising layer denoises the second image after adding noise under the guidance of the second descriptive text, so as to learn the image information attenuation characteristics of the second image caused by adding noise, and obtain the second reconstructed image of the second image.

11. The method of claim 1, wherein, The diffusion noise layer includes an image coding layer and an image noise layer; The image coding layer is used to encode the image to obtain a first feature image; The image noise-adding layer is used to gradually add Gaussian noise to the first feature image to obtain a Gaussian noise image.

12. The method of claim 1, wherein, The guided denoising layer includes a text encoding layer, an image denoising layer, and an image decoding layer; The text encoding layer is used to encode the descriptive text of the image to obtain a text feature vector; The image denoising layer is used to progressively denoise the Gaussian noise image of the image based on the cross-attention mechanism, guided by the text feature vector, to obtain the second feature image; The image decoding layer is used to decode the second feature image in order to reconstruct or generate the image.

13. A device for generating a stylized image generation model, characterized in that, The device includes: A first acquisition module is used to acquire multiple first images and acquire first descriptive text for each first image; the multiple first images include images of various image styles; The pre-training module is used to pre-train the guided diffusion model by using the first image as a pre-training input sample and the first descriptive text as a guiding condition; the guided diffusion model includes a diffusion noise layer and a guided denoising layer. The second acquisition module is used to acquire multiple second images from a video with a target image style, and to acquire second descriptive text for the second images; the second descriptive text includes a description of the target image style. The fine-tuning module is used to fine-tune the pre-trained guided diffusion model by taking the second image as the fine-tuning input sample and the second descriptive text as the guiding condition, to obtain a stylized image generation model including the guided denoising layer in the fine-tuned guided diffusion model; the stylized image generation model is used to generate an image with the style of the target image.

14. An electronic device, comprising: include: processor; as well as A memory for storing a program, which, when the electronic device is powered on and runs by the processor, performs the method as described in any one of claims 1-12.

15. A computer-readable storage medium, characterized in that, A program is stored therein, which is executed by a processor to perform the method as described in any one of claims 1-12.