Video generation method, model training method, device, apparatus, and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By acquiring semantically relevant reference images and text, utilizing video feature generation models and decoders, and combining cascaded models and optical flow frame interpolation techniques, the problems of blurriness and poor controllability in AI-generated videos were solved, achieving high-quality video generation.

CN116320216BActive Publication Date: 2026-06-12BEIJING BAIDU NETCOM SCI & TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date: 2023-03-15
Publication Date: 2026-06-12

Application Information

Patent Timeline

15 Mar 2023

Application

12 Jun 2026

Publication

CN116320216B

IPC: H04N5/262; G06T7/269; G06V10/82; G06V10/774; H04N19/44

CPC: H04N5/262; G06T7/269; G06V10/82; G06V10/774; H04N19/44; G06T2207/10016; G06T2207/20081; G06T2207/20084

AI Tagging

Application Domain

Television system details Image enhancement

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing AI-generated video technologies produce blurry videos with poor content control and low quality, affecting the overall quality and effect of the videos.

⚗Method used

By acquiring semantically relevant reference images and text, and utilizing a pre-trained video feature generation model and video decoder, a target video feature sequence is generated and decoded. A cascaded model and optical flow-guided feature interpolation method are employed to improve the controllability and quality of the video content.

🎯Benefits of technology

The generated video content is controllable, the quality is greatly improved, the difficulty of the video generation model is reduced, and the video effect and efficiency are improved.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116320216B_ABST

Patent Text Reader

Abstract

The present disclosure provides a video generation method, a model training method, an apparatus, a device and a medium, relates to the technical field of artificial intelligence, specifically to the technical field of computer vision, deep learning and the like, and can be applied to the scene of AIGC and the like. The specific implementation scheme is as follows: a reference image and a text are obtained, wherein the reference image and the text are semantically related; a pre-trained video feature generation model is used to generate a target video feature sequence according to the features of the reference image and the features of the text; and a video decoder is used to decode the target video feature sequence to generate a target video. The present disclosure can improve the quality of the generated video.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, specifically computer vision, deep learning, and other related technologies, and can be applied to scenarios such as AIGC. Specifically, it relates to a video generation method, a model training method, an apparatus, a device, and a medium. Background Technology

[0002] AI-generated videos are a very hot topic right now. Compared to traditional manual video production, AI-generated videos offer a revolutionary improvement in efficiency. Moreover, AI-generated videos can be used in a variety of applications, including film, animation, and human-computer interaction.

[0003] Existing AI-generated video technologies often produce blurry videos with poor content control and low video quality, thus affecting the overall quality and effect of the video. Summary of the Invention

[0004] This disclosure provides a video generation method, a model training method, an apparatus, a device, and a medium.

[0005] According to one aspect of this disclosure, a video generation method is provided, comprising:

[0006] Acquire a reference image and text, wherein the reference image and the text are semantically related;

[0007] A target video feature sequence is generated based on the features of the reference image and the features of the text using a pre-trained video feature generation model.

[0008] The target video is generated by decoding the feature sequence of the target video using a video decoder.

[0009] According to another aspect of this disclosure, a method for training a video feature generation model is provided.

[0010] According to another aspect of this disclosure, a video generation apparatus is provided, comprising:

[0011] A reference image and text acquisition module is used to acquire a reference image and text, wherein the reference image and the text are semantically related;

[0012] The video feature sequence generation module is used to generate a target video feature sequence based on the features of the reference image and the features of the text using a pre-trained video feature generation model.

[0013] The video generation module is used to decode the target video feature sequence using a video decoder to generate the target video.

[0014] According to another aspect of this disclosure, a training apparatus for a video feature generation model is provided, comprising:

[0015] According to another aspect of this disclosure, an electronic device is provided, comprising:

[0016] At least one processor; and

[0017] A memory communicatively connected to the at least one processor; wherein,

[0018] The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the video generation method or the training method for the video feature generation model described in any embodiment of this disclosure.

[0019] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided that stores computer instructions for causing a computer to execute the video generation method or the training method of the video feature generation model described in any embodiment of this disclosure.

[0020] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0021] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0022] Figure 1 This is a schematic flowchart of a video generation method according to an embodiment of the present disclosure;

[0023] Figure 2 This is a schematic flowchart of a video generation method according to an embodiment of the present disclosure;

[0024] Figure 3 This is a flowchart illustrating the training method of a video feature generation model according to an embodiment of the present disclosure;

[0025] Figure 4 This is a schematic diagram of the training process of an N-level video feature generation sub-model according to an embodiment of the present disclosure;

[0026] Figure 5 This is a schematic flowchart of a video generation method according to an embodiment of the present disclosure;

[0027] Figure 6 This is a prediction flowchart of a video generation method according to an embodiment of the present disclosure;

[0028] Figure 7This is a schematic diagram of the structure of a video generation apparatus according to an embodiment of the present disclosure;

[0029] Figure 8 This is a schematic diagram of the structure of a training device for a video feature generation model according to an embodiment of the present disclosure;

[0030] Figure 9 This is a block diagram of an electronic device used to implement the video generation method of the embodiments of this disclosure. Detailed Implementation

[0031] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0032] Figure 1 This is a flowchart illustrating a video generation method according to an embodiment of the present disclosure. This embodiment is applicable to situations involving automatic video generation, such as generating a video from text, generating a video from an image, or generating a new video from an input video. It relates to the field of artificial intelligence technology, specifically computer vision, deep learning, and other technical fields, and can be applied to AIGC (AI Generative Content Generation) scenarios. This method can be executed by a video generation device, which is implemented in software and / or hardware, preferably configured in an electronic device, such as a computer, server, or smart terminal. Figure 1 As shown, the method specifically includes the following:

[0033] S101. Obtain a reference image and text, wherein the reference image and text are semantically related.

[0034] In text-to-video applications, this disclosure can automatically generate a video based on given text. The text can be user-inputted text, such as "pour water into a cup," "a match burning," or "penguins walking on the beach," or it can be text converted from spoken words. This disclosure does not limit the form or content of the acquired text. Then, a pre-trained text-to-image model can be used to generate a reference image based on this text. The text-to-image model can also generate a high-quality reference image, providing a foundation for subsequent video generation.

[0035] In the application of image-to-video generation, this disclosure can animate the content within a given image to automatically generate a video. The given image can serve as a reference image, and a pre-trained image-to-text model is used to generate semantically relevant text based on the reference image.

[0036] In video-to-video applications, this disclosure can generate a new video based on a given video, such as converting a video of a real-world scene into an anime-style video without altering its content. In this application, the reference image can be obtained by extracting a reference frame from the original input video and editing it, such as the first frame of the original video. Image editing can be implemented using a pre-trained text-to-image model, which will not be elaborated further here. Through image editing, not only can the quality of the reference image be improved, but the style of the original image can also be changed, or the original image can be modified based on other application requirements to generate a new image, thus enabling video generation based on specific needs, such as editing the video background or changing the clothing of characters in the video.

[0037] S102. Using a pre-trained video feature generation model, generate a target video feature sequence based on the features of the reference image and the text.

[0038] The video feature generation model, for example, is trained based on a diffusion model or a generative adversarial network. It is used to predict multi-frame continuous video features, i.e., a video feature sequence, such as 16 frames of video features, based on features of a reference image and text. During training, reference image samples and their semantically related text samples can be obtained from the video training samples. Features of the reference image samples and text samples are extracted using an image decoder and a text decoder, respectively. These features are then used as input to the video feature generation model, and the video feature sequence from the video training samples is used as the output to train the model. The extraction of the video feature sequence from the video training samples can be achieved using any existing technology, which will not be elaborated here. An example is a Unet structure (a U-shaped neural network structure) model for this video feature generation model.

[0039] S103. Use a video decoder to decode the target video feature sequence to generate the target video.

[0040] After obtaining the target video feature sequence, the target video can be generated by decoding it using a decoder.

[0041] The technical solution of this disclosure first obtains a high-quality reference image, and then uses a video feature generation model to guide the generation of video based on the reference image, so that the generated video content is controllable and the quality is greatly improved, thereby improving the video effect.

[0042] Figure 2 This is a flowchart illustrating a video generation method according to an embodiment of the present disclosure. Based on the above embodiments, this embodiment further optimizes the video feature generation model. The video feature generation model is a cascaded model composed of multiple video feature generation sub-models. The cascaded model includes N levels of video feature generation sub-models, where the output of the previous level's video feature generation sub-model serves as the input of the next level's video feature generation sub-model, and N is a natural number greater than 1. Correspondingly, as... Figure 2 As shown, the method specifically includes the following:

[0043] S201. Obtain a reference image and text, wherein the reference image and text are semantically related.

[0044] S202. By downsampling the reference image, N reference images of different resolution levels are obtained, including the reference image itself.

[0045] For example, in N reference images with different resolution levels, the resolution increases progressively from the first level to the Nth level.

[0046] S203. Extract features from N reference images at different resolution levels.

[0047] For example, feature extraction is performed on the original reference image using an image encoder to obtain reference image features at resolution N, which is the highest resolution image feature. Then, the original reference image is downsampled once to reduce its resolution, and then feature extraction is performed again using the image encoder to obtain reference image features at resolution N-1. This process continues; by performing a second downsampling and feature extraction, even lower resolution reference image features can be obtained, with the first resolution being the lowest.

[0048] In one implementation, taking a three-level cascaded model as an example, the original reference image is downsampled twice to obtain a reference image of the first-level resolution. The features of the first-level resolution reference image can then be obtained by an image encoder. The original reference image is then downsampled once and feature extracted to obtain the features of the second-level resolution reference image. Finally, the original reference image is directly processed by an image encoder to extract features, resulting in the features of the third-level resolution reference image.

[0049] S204. Input the features of the reference image at the first-level resolution and the features of the text into the first-level video feature generation sub-model.

[0050] The first-level video feature generation sub-model is the first model in the cascaded model. Its input text features are the same as those of the other video feature generation sub-models, while its input image features are those of the lowest-resolution reference image. The first-level video feature generation sub-model obtains the corresponding resolution-level video feature sequence based on the features of the reference image at the first-level resolution and the text features.

[0051] S205. For any current-level video feature generation sub-model other than the first-level video feature generation sub-model in the cascaded model, process it as follows: Upsample the video feature sequence output by the previous-level video feature generation sub-model of the current-level video feature generation sub-model, and input the upsampled video feature sequence, the features of the reference image at the current-level resolution, and the text features into the current-level video feature generation sub-model.

[0052] S206. Use the output of the Nth level video feature generation sub-model as the target video feature sequence.

[0053] S207. Use a video decoder to decode the target video feature sequence to generate the target video.

[0054] Taking the three-level cascaded model as an example, after obtaining the video feature sequence output by the first-level video feature generation sub-model, it is first upsampled. Then, the upsampled video feature sequence, the features of the reference image at the second-level resolution, and the text features are input into the second-level video feature generation sub-model. Similarly, the video feature sequence output by the second-level video feature generation sub-model is upsampled. The upsampled video feature sequence, the features of the reference image at the third-level resolution, and the text features are input into the third-level video feature generation sub-model. The output of the third-level video feature generation sub-model is the final target video feature sequence. Upsampling can transform the video feature sequence of the previous level into a video feature sequence corresponding to the resolution of the current level, for example, through bilinear interpolation.

[0055] When N is greater than 3, the number of cascaded video feature generation sub-models increases, but the processing method remains the same as described above, and will not be repeated here. In practical applications, the required cascaded models can be configured according to needs, and this disclosure does not impose any limitations on this.

[0056] In one implementation, each video feature generation sub-model in the N-level video feature generation sub-model is trained separately.

[0057] Figure 3 This is a flowchart illustrating the training method of a video feature generation model according to an embodiment of the present disclosure. Figure 3 As shown, the method includes:

[0058] S301. Obtain reference image samples from the video training samples, and obtain text samples that are semantically related to the reference image samples.

[0059] S302. Extract the features of the reference image samples and the text samples respectively, and extract the video feature sequence samples of the video training samples.

[0060] S303. The video feature generation model is trained by taking the features of the reference image sample and the features of the text sample as inputs and the video feature sequence sample as outputs; wherein, the video feature generation model is a cascaded model composed of multiple video feature generation sub-models.

[0061] This disclosure does not limit the method of obtaining video training samples. After obtaining video training samples, the first frame can be extracted from them, and high-quality reference image samples can be obtained through image editing. At the same time, reference image samples that meet different needs can also be obtained through image editing, such as changing the style of the image.

[0062] In one implementation, the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level serving as the input to the next level. During training, each video feature generation sub-model in the cascaded model can be trained independently, thereby improving the flexibility of model training.

[0063] Specifically, training the video feature generation model includes:

[0064] The first-level video feature generation sub-model is initialized using a pre-trained text-to-image model, and then trained.

[0065] For any current-level video feature generation sub-model in the cascaded model other than the first-level video feature generation sub-model, training shall be performed as follows:

[0066] The current-level video feature generation sub-model is initialized and then trained using the model parameters obtained from the training of the previous-level video feature generation sub-model.

[0067] The sub-models for generating video features at each level have the same structure, and N is a natural number greater than 1.

[0068] The training of the first-level video feature generation sub-model includes:

[0069] By downsampling the reference image sample, N reference image samples at different resolution levels are obtained, including the reference image sample.

[0070] The features of the reference image sample at the first resolution and the features of the text sample are used as inputs to the first-level video feature generation sub-model. The video features at the first resolution of the video training sample are extracted as the output of the first-level video feature generation sub-model, and the first-level video feature generation sub-model is trained.

[0071] Figure 4 This is a schematic diagram illustrating the training process of an N-level video feature generation sub-model according to an embodiment of this disclosure. For example... Figure 4 As shown, the training process of the N-level video feature generation sub-model includes:

[0072] S401. Initialize the first-level video feature generation sub-model using the pre-trained text-to-image model, and in the network structure of the text-to-image model, add a 1D convolutional module with a time dimension after each 2D convolutional module, and add a temporal attention module after each spatial attention module.

[0073] The text-to-image generation model includes a convolutional module and a spatial attention module. By adding a temporal convolutional module and a temporal attention module, the video feature generation sub-model can fit video data and predict video feature sequences.

[0074] S402. Perform image editing on the first frame of the video training sample to obtain a reference image sample, and obtain text samples that are semantically related to the reference image sample.

[0075] S403. By downsampling the reference image samples, N reference image samples of different resolution levels are obtained, including the reference image samples.

[0076] S404. Use the features of the reference image samples and text samples at the first-level resolution as input to the first-level video feature generation sub-model, extract the video features at the first-level resolution of the video training samples as output to the first-level video feature generation sub-model, and train the first-level video feature generation sub-model.

[0077] S405. For any current-level video feature generation sub-model in the cascaded model other than the first-level video feature generation sub-model, train it as follows: use the model parameters obtained by training the previous-level video feature generation sub-model to initialize the current-level video feature generation sub-model before training it. The structure of each level of video feature generation sub-model is the same.

[0078] In other words, the first-level video feature generation sub-model is initialized and trained using the parameters of a pre-trained text-to-image model. Subsequent levels of the video feature generation sub-model are then initialized and trained using the parameters obtained from training the previous level. This not only improves the overall efficiency of model training but also enhances the prediction performance of each individual level.

[0079] Furthermore, the current-level video feature generation sub-model is initialized and then trained again using the model parameters obtained from the training of the previous-level video feature generation sub-model, including:

[0080] The current video feature generation sub-model is initialized using the model parameters of the previous-level video feature generation sub-model obtained after training;

[0081] Upsample the video features at the next higher resolution of the video training samples;

[0082] The current-level video feature generation sub-model is trained by using the features of the reference image at the current resolution, the text features, and the upsampled video features as inputs, extracting the current-level video features from the video training samples as outputs, and then training the current-level video feature generation sub-model.

[0083] Therefore, the technical solution of this disclosure, through the implementation of a cascaded model, decomposes the prediction task. Each cascaded video feature generation sub-model predicts video features at its corresponding resolution level. Each level predicts more details of video features based on the previous level, and these predictions are then stacked sequentially to obtain more accurate and higher-quality prediction results. Furthermore, the training difficulty for each level is lower, as it processes features in a low-resolution feature space, significantly reducing the time required for model training and testing, and improving the efficiency of model training.

[0084] Furthermore, in applications that generate videos from videos, it is also necessary to obtain the video condition information of the video training samples and use this information as input for training the video feature generation sub-models at each level. This video condition information includes at least depth maps and target keypoint maps, thereby generating new videos guided by the original video. The new videos are consistent with the original videos in terms of objects and people, but only differ in style, background, or clothing.

[0085] Figure 5 This is a schematic flowchart of a video generation method according to an embodiment of the present disclosure. This embodiment is a further optimization based on the above embodiments. Figure 5 As shown, the method includes:

[0086] S501. Obtain a reference image and text, wherein the reference image and text are semantically related.

[0087] S502. Using a pre-trained video feature generation model, generate a target video feature sequence based on the features of the reference image and the text.

[0088] S503. Interpolate frames into the target video feature sequence.

[0089] S504. Use a video decoder to decode the interpolated target video feature sequence to generate the target video.

[0090] Frame interpolation mainly refers to upsampling the video in the time dimension. For example, a 2-second video can be transformed into a 4-second or even longer video through frame interpolation to enhance the viewing experience. To ensure that the interpolated frames are not blurry and do not jitter in timing, the embodiments of this disclosure employ an optical flow-guided feature interpolation method.

[0091] Specifically, frame interpolation is performed on the target video feature sequence, including:

[0092] Predict the optical flow of a target video feature sequence using a pre-trained optical flow prediction model;

[0093] Based on optical flow, the target video feature sequence is interpolated in the feature space to obtain multiple initial interpolated frames;

[0094] Fine-tuning of multiple initial interpolated frames is performed using a pre-trained fine-tuning model.

[0095] One method for training the optical flow prediction model is to use an existing image-based optical flow prediction model as supervision. That is, first, the existing image-based optical flow prediction model is used to obtain the optical flow from video training samples. Then, the video feature sequences of the video training samples are used as input to the optical flow prediction model, and the corresponding optical flow is used as the output. This process trains the optical flow prediction model so that it can predict the optical flow based on any input video feature sequence.

[0096] Next, based on optical flow, the target video feature sequence is interpolated in the feature space to obtain multiple initial interpolated frames. For example, if the target video feature sequence includes 16 frames of video features, interpolation can yield 16 more interpolated frames, which, together with the original 16 frames, form 32 frames of video features, thus improving the video side length.

[0097] Furthermore, to improve the quality of frame interpolation and avoid issues such as image holes or target distortion, these initial interpolated frames need to be fine-tuned. Specifically, this fine-tuning can be based on a pre-trained fine-tuning model, which can be trained using a diffusion model or a generative adversarial network, details of which will not be elaborated here.

[0098] It should be noted that this disclosure utilizes optical flow to directly interpolate video features in the feature space, which can greatly reduce the computational cost of interpolating frames on the image.

[0099] In one implementation, to further improve video quality, an enhanced video decoder can be used to decode the interpolated target video feature sequence. Specifically, during training, the video decoder can be trained together with the image encoder. The image encoder can use an existing image autoencoder, while a temporal convolutional module and an attention module are added to the video decoder, enabling the model to model the video data and resulting in a more stable decoded video. During training, the input video samples can be degraded, for example, by blurring, adding noise, and compressing, to obtain a degraded video sample. This degraded video sample is then used as input to the video decoder, but the target video for the video decoder to learn from remains the original high-quality video sample, thus enhancing the image quality.

[0100] For example, the video decoder in this embodiment can generate a video with a resolution of 512*512 and a duration of 5 seconds at 25fps. Furthermore, depending on the requirements, other video super-resolution or frame interpolation methods in the prior art can be used to finally generate videos with higher resolutions such as 1080p or even 4K. Subsequent processing of the target video generated by the video decoder can be configured according to requirements, and this disclosure does not impose any limitations on it.

[0101] Figure 6This is a flowchart illustrating the prediction process of a video generation method according to an embodiment of this disclosure. As shown, taking text-to-video generation using a three-level cascaded model as an example, the three-level cascaded model includes a first-level video feature generation sub-model 1, a second-level video feature generation sub-model 2, and a third-level video feature generation sub-model 3. In the process, an input text is first obtained, and a reference image related to the text semantics is obtained using a text-to-image model. The features of the original reference image extracted by the image encoder are input into the third-level video feature generation sub-model 3. The original reference image is downsampled once to obtain a second-level resolution reference image, and the features extracted by the image encoder are input into the second-level video feature generation sub-model 2. The second-level resolution reference image is downsampled again to obtain a first-level resolution reference image, and the features extracted by the image encoder are input into the first-level video feature generation sub-model 1. The text features obtained from the input text by the text encoder are then input into each of the respective video feature generation sub-models. The output of the first-level video feature generation sub-model is upsampled and input into the second-level video feature generation sub-model. The output of the second-level video feature generation sub-model is then upsampled and input into the third-level video feature generation sub-model. The output of the third-level video feature generation sub-model is the target video feature sequence. This target video feature sequence undergoes an optical flow-guided feature interpolation process. First, the optical flow of the target video feature sequence is predicted. Then, frames are interpolated based on the predicted optical flow to obtain multiple relatively coarse initial frames. Finally, a fine-tuning model is used to fine-tune the interpolated frames to obtain high-quality interpolated frames. These interpolated frames, together with the original video features, form the final video feature sequence. This sequence is then decoded by an enhanced video decoder to obtain the target video. In the figure, the generation sub-model 1 corresponds to 16×4×16×16. The first 16 represents 16 frames of video features, indicating the length of the feature sequence. The 4 represents the feature dimension, i.e., the number of channels. The last two 16 represent the length and width of the video image, respectively. Similarly, generation sub-model 2 differs from generation sub-model 1 in that the length and width of the generated video image are both 32, while generation sub-model 3 generates images with a length and width of 64, with the resolution increasing progressively. In the optical flow-guided feature interpolation N×4×64×64 diagram, N represents the number of frames obtained after interpolation.

[0102] The technical solution of this disclosure utilizes mature text-to-image technology to first generate a high-quality reference image, and then uses this reference image to guide video generation. This makes the generated content controllable and significantly improves its quality, reducing the difficulty of the video generation model, allowing the model to focus more on dynamic synthesis. Furthermore, the entire solution greatly reduces the time required for training and testing in the feature space. Additionally, by employing a separate training method for the video decoder, it is possible to train the decoder using 4K resolution ultra-high-definition video, resulting in higher quality generated videos.

[0103] Figure 7This is a schematic diagram of a video generation apparatus according to an embodiment of the present disclosure. This embodiment is applicable to situations involving automatic video generation, such as generating video from text, generating video from images, or generating a new video from an input video. It relates to the field of artificial intelligence technology, specifically computer vision, deep learning, and other technical fields, and can be applied to AIGC (AI Generated Content) scenarios. This apparatus can implement the video generation method described in any embodiment of the present disclosure. Figure 7 As shown, the device 700 specifically includes:

[0104] The reference image and text acquisition module 701 is used to acquire a reference image and text, wherein the reference image and the text are semantically related;

[0105] The video feature sequence generation module 702 is used to generate a target video feature sequence based on the features of the reference image and the features of the text using a pre-trained video feature generation model.

[0106] The video generation module 703 is used to decode the target video feature sequence using a video decoder to generate the target video.

[0107] Optionally, the reference image is generated based on the text using a pre-trained text-to-image model.

[0108] Optionally, the device further includes:

[0109] The reference frame editing module is used to extract reference frames from the original video, perform image editing on the reference frames, and obtain the reference image.

[0110] Optionally, the text may include text entered by the user.

[0111] Optionally, the text is generated based on the reference image using a pre-trained image-to-text model.

[0112] Optionally, the features of the reference image are extracted using an image encoder, and the features of the text are extracted using a text encoder.

[0113] Optionally, the video feature generation model is trained based on a diffusion model or an adversarial generative network.

[0114] Optionally, the video feature generation model is a cascaded model composed of multiple video feature generation sub-models.

[0115] Optionally, the cascaded model includes N levels of video feature generation sub-models, where the output of the previous level video feature generation sub-model is used as the input of the next level video feature generation sub-model, and N is a natural number greater than 1.

[0116] Optionally, the video feature sequence generation module includes:

[0117] The downsampling unit is used to obtain N reference images of different resolution levels, including the reference image, by downsampling the reference image.

[0118] A feature extraction unit is used to extract features from the N reference images at different resolution levels, respectively.

[0119] The first-level video feature generation sub-model processing unit is used to input the features of the reference image at the first-level resolution and the features of the text into the first-level video feature generation sub-model.

[0120] The current-level video feature generation sub-model processing unit is used to process any current-level video feature generation sub-model other than the first-level video feature generation sub-model in the cascaded model in the following manner: upsample the video feature sequence output by the previous-level video feature generation sub-model of the current-level video feature generation sub-model, and input the upsampled video feature sequence, the features of the reference image at the current-level resolution, and the features of the text into the current-level video feature generation sub-model;

[0121] The output unit is used to take the output of the Nth level video feature generation sub-model as the target video feature sequence.

[0122] Optionally, the device further includes:

[0123] A frame interpolation module is used to interpolate frames into the target video feature sequence.

[0124] Accordingly, the video generation module is specifically used for:

[0125] The target video is generated by decoding the interpolated target video feature sequence using a video decoder.

[0126] Optionally, the frame interpolation module includes:

[0127] An optical flow prediction unit is used to predict the optical flow of the target video feature sequence using a pre-trained optical flow prediction model;

[0128] The frame interpolation unit is used to interpolate the target video feature sequence in the feature space according to the optical flow to obtain multiple initial interpolated frames;

[0129] The fine-tuning unit is used to fine-tune the plurality of initial interpolated frames using a pre-trained fine-tuning model.

[0130] Optionally, a time-series convolution module and an attention module may be added to the video decoder.

[0131] Optionally, during the training process of the video decoder, the input video samples are subjected to degradation processing, which includes blurring, adding noise, and compression.

[0132] Figure 8 This is a schematic diagram of a training apparatus for a video feature generation model according to an embodiment of the present disclosure. This embodiment is applicable to training a video feature generation model to automatically generate videos based on the model, such as generating videos from text, images, or a new video from an input video. It relates to the field of artificial intelligence technology, specifically computer vision and deep learning, and can be applied to AIGC (AI Generative Content Generation) scenarios. This apparatus can implement the video generation method described in any embodiment of the present disclosure. Figure 8 As shown, the device 800 specifically includes:

[0133] The acquisition module 801 is used to acquire reference image samples from video training samples and acquire text samples that are semantically related to the reference image samples;

[0134] The feature extraction module 802 is used to extract the features of the reference image sample and the text sample respectively, and to extract the video feature sequence samples of the video training sample;

[0135] The model training module 803 is used to train the video feature generation model by taking the features of the reference image sample and the features of the text sample as inputs to the video feature generation model and taking the video feature sequence sample as outputs to the video feature generation model.

[0136] The video feature generation model is a cascaded model composed of multiple video feature generation sub-models.

[0137] Optionally, each video feature generation sub-model in the cascaded model is trained separately.

[0138] Optionally, the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level video feature generation sub-model serving as the input of the next level video feature generation sub-model.

[0139] Accordingly, the model training module includes:

[0140] The first-level video feature generation sub-model training unit is used to initialize the first-level video feature generation sub-model using a pre-trained text-to-image model and to train the first-level video feature generation sub-model.

[0141] The current-level video feature generation sub-model training unit is used to train any current-level video feature generation sub-model in the cascaded model other than the first-level video feature generation sub-model, in the following manner:

[0142] The current-level video feature generation sub-model is initialized and then trained using the model parameters obtained from the training of the previous-level video feature generation sub-model.

[0143] The sub-models for generating video features at each level have the same structure, and N is a natural number greater than 1.

[0144] Optionally, the first-level video feature generation sub-model training unit includes:

[0145] The downsampling subunit is used to obtain N reference image samples at different resolution levels, including the reference image samples, by downsampling the reference image samples.

[0146] The first-level video feature generation sub-model training sub-unit is used to take the features of the reference image sample at the first-level resolution and the features of the text sample as input to the first-level video feature generation sub-model, extract the video features at the first-level resolution of the video training sample as output of the first-level video feature generation sub-model, and train the first-level video feature generation sub-model.

[0147] Optionally, the current-level video feature generation sub-model training unit includes:

[0148] An initialization sub-unit is used to initialize the current-level video feature generation sub-model using the model parameters of the previous-level video feature generation sub-model obtained after training.

[0149] An upsampling subunit is used to upsample the video features of the next higher resolution of the video training samples;

[0150] The current-level video feature generation sub-model training sub-unit is used to take the features of the reference image at the current resolution, the text features, and the upsampled video features as input to the current-level video feature generation sub-model, extract the video features of the video training samples at the current resolution as output to the current-level video feature generation sub-model, and train the current-level video feature generation sub-model.

[0151] Optionally, the acquisition module includes:

[0152] A reference frame acquisition unit is used to extract reference frames from the video training samples;

[0153] An image editing unit is used to edit the reference frame to obtain the reference image sample.

[0154] Optionally, the text-to-image generation model includes a convolutional module and a spatial attention module, and the device further includes:

[0155] The network structure processing unit is used to add a 1D convolutional module with a temporal dimension after each 2D convolutional module and a temporal attention module after each spatial attention module in the network structure of the text-to-image generation model.

[0156] Optionally, the device further includes a video condition information processing module, specifically used for:

[0157] Obtain video condition information of the video training samples, wherein the video condition information includes at least a depth map and a target key point map;

[0158] The video condition information is used as input to the video feature generation sub-models at each level for training.

[0159] The above-described products can perform the methods provided in any embodiment of this disclosure, and have the corresponding functional modules and beneficial effects for performing the methods.

[0160] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0161] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0162] Figure 9 A schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0163] like Figure 9As shown, device 900 includes a computing unit 901, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 902 or a computer program loaded into random access memory (RAM) 903 from storage unit 908. RAM 903 may also store various programs and data required for the operation of device 900. The computing unit 901, ROM 902, and RAM 903 are interconnected via bus 904. Input / output (I / O) interface 905 is also connected to bus 904.

[0164] Multiple components in device 900 are connected to I / O interface 905, including: input unit 906, such as keyboard, mouse, etc.; output unit 907, such as various types of monitors, speakers, etc.; storage unit 908, such as disk, optical disk, etc.; and communication unit 909, such as network card, modem, wireless transceiver, etc. Communication unit 909 allows device 900 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0165] The computing unit 901 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as video generation methods. For example, in some embodiments, the video generation method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and / or installed on device 900 via ROM 902 and / or communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the video generation method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the video generation method by any other suitable means (e.g., by means of firmware).

[0166] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0167] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0168] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0169] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0170] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0171] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is established by computer programs running on the respective computers and having a client-server relationship with each other. A server can be a cloud server, also known as a cloud computing server or cloud host, a hosting product within the cloud computing service ecosystem that addresses the management difficulties and weak business scalability inherent in traditional physical hosting and VPS services. Servers can also be servers for distributed systems or servers integrated with blockchain technology.

[0172] Artificial intelligence (AI) is the study of enabling computers to simulate certain human thought processes and intelligent behaviors (such as learning, reasoning, thinking, and planning). It encompasses both hardware and software technologies. AI hardware technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, and big data processing. AI software technologies mainly include computer vision, speech recognition, natural language processing, machine learning / deep learning, big data processing, and knowledge graph technologies.

[0173] Cloud computing refers to a technology system that enables access to a shared pool of physical or virtual resources via a network. These resources can include servers, operating systems, networks, software, applications, and storage devices, and can be deployed and managed on demand and in a self-service manner. Cloud computing technology can provide efficient and powerful data processing capabilities for applications such as artificial intelligence and blockchain, as well as for model training.

[0174] Furthermore, according to embodiments of this disclosure, this disclosure also provides another electronic device, another readable storage medium, and another computer program product for performing one or more steps of the training method for the video feature generation model described in any embodiment of this disclosure. The specific structure and program code can be found as follows: Figure 9 The content of the illustrated embodiments will not be repeated here.

[0175] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution provided in this disclosure can be achieved, and this is not limited herein.

[0176] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A video generation method, comprising: Acquire a reference image and text, wherein the reference image and the text are semantically related; Using a pre-trained video feature generation model, a target video feature sequence is generated based on the features of the reference image and the features of the text; wherein, the video feature generation model is a cascaded model composed of multiple video feature generation sub-models; the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level video feature generation sub-model serving as the input of the next level video feature generation sub-model, where N is a natural number greater than 1; The target video feature sequence is decoded using a video decoder to generate the target video; The step of generating a target video feature sequence using a pre-trained video feature generation model based on the features of the reference image and the text includes: By downsampling the reference image, N reference images at different resolution levels, including the reference image, are obtained; Extract features from the N reference images at different resolution levels respectively; The features of the reference image at the first-level resolution and the features of the text are input into the first-level video feature generation sub-model; For any current-level video feature generation sub-model other than the first-level video feature generation sub-model in the cascaded model, the following processing method is used: the video feature sequence output by the previous-level video feature generation sub-model of the current-level video feature generation sub-model is upsampled, and the video feature sequence obtained by the upsampling, the features of the reference image at the current level resolution, and the features of the text are input into the current-level video feature generation sub-model. The output of the Nth level video feature generation sub-model is used as the target video feature sequence.

2. The method of claim 1, wherein, The reference image is generated based on the text using a pre-trained text-to-image model.

3. The method according to claim 1, further comprising: Reference frames are extracted from the original video, and image editing is performed on the reference frames to obtain the reference image.

4. The method of claim 1 or 2, wherein, The text includes text entered by the user.

5. The method according to claim 1 or 3, wherein, The text is generated from the reference image using a pre-trained image-to-text model.

6. The method according to any one of claims 1-3, wherein, The features of the reference image are extracted using an image encoder, and the features of the text are extracted using a text encoder.

7. The method according to any one of claims 1-3, wherein, The video feature generation model is trained based on a diffusion model or a generative adversarial network.

8. The method according to claim 1, further comprising: Frame interpolation is performed on the target video feature sequence; Accordingly, the step of decoding the target video feature sequence using a video decoder to generate the target video includes: The target video is generated by decoding the interpolated target video feature sequence using a video decoder.

9. The method according to claim 8, wherein, The step of interpolating frames into the target video feature sequence includes: Predict the optical flow of the target video feature sequence using a pre-trained optical flow prediction model; Based on the optical flow, the target video feature sequence is interpolated in the feature space to obtain multiple initial interpolated frames; The multiple initial interpolated frames are fine-tuned using a pre-trained fine-tuning model.

10. The method according to claim 1, wherein, The video decoder adds a time-series convolution module and an attention module.

11. The method according to claim 1, wherein, During the training process of the video decoder, the input video samples are subjected to degradation processing, which includes blurring, adding noise, and compression.

12. A training method for a video feature generation model, comprising: Obtain reference image samples from the video training samples, and obtain text samples that are semantically related to the reference image samples; The features of the reference image samples and the text samples are extracted respectively, and the video feature sequence samples of the video training samples are extracted; The video feature generation model is trained by using the features of the reference image sample and the features of the text sample as inputs and the video feature sequence sample as outputs. The video feature generation model is a cascaded model composed of multiple video feature generation sub-models; the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level video feature generation sub-model serving as the input of the next level video feature generation sub-model. Training the video feature generation model includes: The first-level video feature generation sub-model is initialized using a pre-trained text-to-image model, and N reference image samples at different resolution levels, including the reference image samples, are obtained by downsampling the reference image samples. The features of the reference image sample at the first resolution and the features of the text sample are used as the input of the first-level video feature generation sub-model. The video features at the first resolution of the video training sample are extracted as the output of the first-level video feature generation sub-model, and the first-level video feature generation sub-model is trained. For any current-level video feature generation sub-model in the cascaded model other than the first-level video feature generation sub-model, training shall be performed as follows: The current-level video feature generation sub-model is initialized and then trained using the model parameters obtained from the training of the previous-level video feature generation sub-model. The sub-models for generating video features at each level have the same structure, and N is a natural number greater than 1.

13. The method according to claim 12, wherein, Each video feature generation sub-model in the cascaded model is trained separately.

14. The method according to claim 12, wherein, The step of initializing the current-level video feature generation sub-model using the model parameters obtained from the training of the previous-level video feature generation sub-model, and then training it again, includes: The current-level video feature generation sub-model is initialized using the model parameters of the previous-level video feature generation sub-model obtained after training; Upsample the video features at the next higher resolution of the video training samples; The current-level video feature generation sub-model is trained by taking the features of the reference image at the current resolution, the features of the text, and the upsampled video features as inputs, extracting the video features of the current resolution of the video training samples as outputs of the current-level video feature generation sub-model.

15. The method according to claim 12, wherein, The step of obtaining reference image samples from video training samples includes: Extract reference frames from the video training samples; The reference frame is image edited to obtain the reference image sample.

16. The method according to any one of claims 12, 14, and 15, wherein, The text-to-image generation model includes a convolutional module and a spatial attention module, and the method further includes: In the network structure of the text-to-image generation model, a 1D convolutional module with a temporal dimension is added after each 2D convolutional module, and a temporal attention module is added after each spatial attention module.

17. The method according to any one of claims 12, 14, and 15, further comprising: Obtain video condition information of the video training samples, wherein the video condition information includes at least a depth map and a target key point map; The video condition information is used as input to the video feature generation sub-models at each level for training.

18. A video generation apparatus, comprising: A reference image and text acquisition module is used to acquire a reference image and text, wherein the reference image and the text are semantically related; A video feature sequence generation module is used to generate a target video feature sequence based on the features of the reference image and the features of the text using a pre-trained video feature generation model; wherein the video feature generation model is a cascaded model composed of multiple video feature generation sub-models; the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level video feature generation sub-model serving as the input of the next level video feature generation sub-model, where N is a natural number greater than 1; The video generation module is used to decode the target video feature sequence using a video decoder to generate the target video; The video feature sequence generation module includes: The downsampling unit is used to obtain N reference images of different resolution levels, including the reference image, by downsampling the reference image. A feature extraction unit is used to extract features from the N reference images at different resolution levels, respectively. The first-level video feature generation sub-model processing unit is used to input the features of the reference image at the first-level resolution and the features of the text into the first-level video feature generation sub-model. The current-level video feature generation sub-model processing unit is used to process any current-level video feature generation sub-model other than the first-level video feature generation sub-model in the cascaded model in the following manner: upsample the video feature sequence output by the previous-level video feature generation sub-model of the current-level video feature generation sub-model, and input the upsampled video feature sequence, the features of the reference image at the current-level resolution, and the features of the text into the current-level video feature generation sub-model; The output unit is used to take the output of the Nth level video feature generation sub-model as the target video feature sequence.

19. The apparatus according to claim 18, wherein, The reference image is generated based on the text using a pre-trained text-to-image model.

20. The apparatus of claim 18, further comprising: The reference frame editing module is used to extract reference frames from the original video, perform image editing on the reference frames, and obtain the reference image.

21. The apparatus according to claim 18 or 19, wherein, The text includes text entered by the user.

22. The apparatus according to claim 18 or 20, wherein, The text is generated from the reference image using a pre-trained image-to-text model.

23. The apparatus according to any one of claims 18-20, wherein, The features of the reference image are extracted using an image encoder, and the features of the text are extracted using a text encoder.

24. The apparatus according to any one of claims 18-20, wherein, The video feature generation model is trained based on a diffusion model or a generative adversarial network.

25. The apparatus of claim 18, further comprising: A frame interpolation module is used to interpolate frames into the target video feature sequence. Accordingly, the video generation module is specifically used for: The target video is generated by decoding the interpolated target video feature sequence using a video decoder.

26. The apparatus according to claim 25, wherein, The frame interpolation module includes: An optical flow prediction unit is used to predict the optical flow of the target video feature sequence using a pre-trained optical flow prediction model; The frame interpolation unit is used to interpolate the target video feature sequence in the feature space according to the optical flow to obtain multiple initial interpolated frames; The fine-tuning unit is used to fine-tune the plurality of initial interpolated frames using a pre-trained fine-tuning model.

27. The apparatus according to claim 18, wherein, The video decoder adds a time-series convolution module and an attention module.

28. The apparatus according to claim 18, wherein, During the training process of the video decoder, the input video samples are subjected to degradation processing, which includes blurring, adding noise, and compression.

29. A training device for a video feature generation model, comprising: The acquisition module is used to acquire reference image samples from video training samples and acquire text samples that are semantically related to the reference image samples; The feature extraction module is used to extract the features of the reference image sample and the text sample respectively, and to extract the video feature sequence samples of the video training sample; The model training module is used to train the video feature generation model by taking the features of the reference image samples and the features of the text samples as inputs to the video feature generation model and taking the video feature sequence samples as outputs to the video feature generation model. The video feature generation model is a cascaded model composed of multiple video feature generation sub-models; the cascaded model includes N levels of video feature generation sub-models, with the output of the previous level video feature generation sub-model serving as the input of the next level video feature generation sub-model. Accordingly, the model training module includes: The first-level video feature generation sub-model training unit is used to initialize the first-level video feature generation sub-model using a pre-trained text-to-image model and to train the first-level video feature generation sub-model. The current-level video feature generation sub-model training unit is used to train any current-level video feature generation sub-model in the cascaded model other than the first-level video feature generation sub-model, in the following manner: The current-level video feature generation sub-model is initialized and then trained using the model parameters obtained from the training of the previous-level video feature generation sub-model. Among them, the structure of each level of video feature generation sub-model is the same, and N is a natural number greater than 1; The first-level video feature generation sub-model training unit includes: The downsampling subunit is used to obtain N reference image samples at different resolution levels, including the reference image samples, by downsampling the reference image samples. The first-level video feature generation sub-model training sub-unit is used to take the features of the reference image sample at the first-level resolution and the features of the text sample as input to the first-level video feature generation sub-model, extract the video features at the first-level resolution of the video training sample as output of the first-level video feature generation sub-model, and train the first-level video feature generation sub-model.

30. The apparatus according to claim 29, wherein, Each video feature generation sub-model in the cascaded model is trained separately.

31. The apparatus according to claim 29, wherein, The current-level video feature generation sub-model training unit includes: An initialization sub-unit is used to initialize the current-level video feature generation sub-model using the model parameters of the previous-level video feature generation sub-model obtained after training. An upsampling subunit is used to upsample the video features of the next higher resolution of the video training samples; The current-level video feature generation sub-model training sub-unit is used to take the features of the reference image at the current resolution, the features of the text, and the upsampled video features as input to the current-level video feature generation sub-model, extract the video features of the current resolution of the video training samples as output to the current-level video feature generation sub-model, and train the current-level video feature generation sub-model.

32. The apparatus according to claim 29, wherein, The acquisition module includes: A reference frame acquisition unit is used to extract reference frames from the video training samples; An image editing unit is used to edit the reference frame to obtain the reference image sample.

33. The apparatus according to any one of claims 29, 31, and 32, wherein, The text-to-image generation model includes a convolutional module and a spatial attention module, and the device further includes: The network structure processing unit is used to add a 1D convolutional module with a temporal dimension after each 2D convolutional module and a temporal attention module after each spatial attention module in the network structure of the text-to-image generation model.

34. The apparatus according to any one of claims 29, 31, and 32, further comprising a video condition information processing module, specifically used for: Obtain the video condition information of the video training samples, wherein, The video condition information includes at least a depth map and a target key point map; The video condition information is used as input to the video feature generation sub-models at each level for training.

35. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the video generation method of any one of claims 1-11, or the training method of the video feature generation model of any one of claims 12-17.

36. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to execute the video generation method according to any one of claims 1-11, or the training method of the video feature generation model according to any one of claims 12-17.

37. A computer program product comprising a computer program that, when executed by a processor, implements the video generation method according to any one of claims 1-11, or the training method for the video feature generation model according to any one of claims 12-17.