Animation video generation method and apparatus
By acquiring the target guidance image and text prompts, determining the spatiotemporal guidance conditions, and using the spatiotemporal mask module, encoding module, and transformer module to generate a series of frames that are associated with the target guidance image and match the text prompts, the problem of uncontrollability and type limitation of existing video generation models is solved, and high-quality animated video generation is achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SHANGHAI HODE INFORMATION TECH CO LTD
- Filing Date
- 2025-08-25
- Publication Date
- 2026-06-18
AI Technical Summary
Existing video generation models produce uncontrollable video content, have limited video types, and negatively impact user experience.
By acquiring the target guidance image and text prompts, the spatiotemporal guidance conditions are determined. The spatiotemporal mask module, encoding module, and transformer module are used to generate a series of frames that are associated with the target guidance image and match the text prompts, thus forming an animated video clip.
It enables the generation of controllable animated video clips under the guidance of spatiotemporal conditions and text prompts, improving the quality and diversity of video generation and making it suitable for various video generation tasks.
Smart Images

Figure CN2025116719_18062026_PF_FP_ABST
Abstract
Description
Animated video generation method and apparatus
[0001] This application claims priority to Chinese Patent Application No. 202411853390.7, filed on December 13, 2024, entitled "Method and Apparatus for Generating Animated Videos", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of artificial intelligence technology, and in particular to an animation video generation method, apparatus, computer equipment, computer-readable storage medium, and computer program product. Background Technology
[0003] Generative AI-based video creation is an efficient and low-cost technology that greatly reduces the difficulty of video generation. However, the inventors have found that current video generation models still suffer from problems such as uncontrollable generated video content and limited video types, which negatively impact user experience.
[0004] It should be noted that the above content is not necessarily prior art, nor is it intended to limit the scope of patent protection of this application. Summary of the Invention
[0005] This application provides an animation video generation method, apparatus, computer device, computer-readable storage medium, and computer program product to solve or alleviate one or more of the technical problems mentioned above.
[0006] One aspect of this application provides a method for generating animated videos, the method comprising:
[0007] Obtain the target guidance image and target text prompt;
[0008] Determine the spatiotemporal guidance conditions based on the target guidance image;
[0009] Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, a series of frames that are associated with the target guidance image in time and space and match the target text prompt are generated to constitute the animated video clip.
[0010] Optionally, the method is implemented using a pre-trained video generation model, which includes: a spatiotemporal masking module, an encoding module, and a transformer module, wherein:
[0011] The spatiotemporal guidance conditions are determined by the spatiotemporal masking module based on the target guidance image. The spatiotemporal guidance conditions include a guidance sequence and a masking sequence. The guidance sequence is used to specify the position of the target guidance image in the animation video segment, and the masking sequence is used to define whether the series of frames are displayed in the animation video segment.
[0012] The guiding sequence and the mask sequence are encoded by the encoding module to obtain the guiding feature sequence and the mask encoding sequence;
[0013] The animated video clip is generated by the transformer module based on the guiding feature sequence, the mask encoding sequence, noise, and the target text prompt.
[0014] Optionally, the video generation model is obtained through the following operations:
[0015] Obtain a training dataset, which includes multiple target animation video clips, each of which has a corresponding text prompt. The target animation video clips are used to determine the guidance conditions.
[0016] Based on the multiple target animated video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, which is used to generate animated videos.
[0017] Optionally, obtain the training dataset, including:
[0018] Obtain the original video set, which includes multiple animated videos;
[0019] The multiple animated videos are segmented into scenes to obtain multiple video clips;
[0020] Based on preset filtering rules, the multiple target animation video segments are determined from the multiple video segments;
[0021] The multiple target animation video clips are input into a pre-trained video annotation model to obtain multiple text prompts through the video annotation model, with each text prompt corresponding to a target animation video clip;
[0022] The training dataset is constructed based on the multiple target animated video clips and the multiple text prompts.
[0023] Optionally, the base model includes: a spatiotemporal mask module, an encoding module, and a transformer module, and the target animation video segment includes a frame sequence composed of multiple frames;
[0024] Correspondingly, based on the multiple target animation video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, including:
[0025] The frame sequence is input to the spatiotemporal masking module to determine a guide frame from the plurality of frames, and a guide sequence and a mask sequence are determined based on the guide frame; wherein, the guide frame is used to guide the converter module to generate a series of frames associated with the guide frame;
[0026] The guiding sequence and the mask sequence are input into the encoding module to obtain the guiding feature sequence and the mask encoding sequence through the encoding module;
[0027] Based on the guiding feature sequence, the mask encoding sequence, noise, and the corresponding text prompts, the transformer module is subjected to supervised fine-tuning to obtain the target video generation model.
[0028] Optionally, determining the boot sequence and mask sequence based on the boot frame includes:
[0029] The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame;
[0030] A first mask sequence is determined based on the boot sequence, the first mask sequence including a demasking mask, the demasking mask corresponding to the boot frame.
[0031] Optionally, determining the boot sequence and mask sequence based on the boot frame includes:
[0032] The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame;
[0033] A target mask is obtained based on the frame sequence. The target mask includes a masked region and an unmasked region. The target mask is used to control the motion region of the series of frames. The masked region corresponds to the non-motion region, and the unmasked region corresponds to the motion region.
[0034] A second mask sequence is determined based on the guiding sequence and the target mask. The second mask sequence has the same length as the guiding sequence and includes multiple target masks.
[0035] Optionally, obtaining the target mask based on the frame sequence includes:
[0036] Foreground detection is performed on the first frame of the frame sequence to determine the foreground region of the first frame;
[0037] Based on the foreground region of the first frame, a foreground mask corresponding to each of the plurality of frames is generated, and the foreground mask is used to represent the motion region and non-motion region of the corresponding frame;
[0038] The target mask is generated based on the foreground mask corresponding to each of the multiple frames.
[0039] Optionally, based on the multiple target animation video clips and corresponding text prompts and guidance conditions, a pre-trained base model is subjected to supervised fine-tuning to obtain a target video generation model, including:
[0040] Determine the resolution and frame rate of each of the multiple target animation video segments;
[0041] Based on the resolution and frame rate of each of the multiple target animation video segments, the training dataset is divided into multiple subsets, and at least one of the resolution and frame rate of each subset is different from the other subsets;
[0042] The training priority for each subset is determined based on the resolution and frame rate corresponding to each subset.
[0043] Based on the training priority of each subset, the base model is sequentially fine-tuned using different subsets to obtain the target video generation model.
[0044] Among them, the lower the resolution and / or frame rate of the subset, the higher the corresponding training priority.
[0045] Optionally, the animated video generation method further includes:
[0046] Obtain a benchmark dataset, which includes multiple benchmark animated video clips, each with a corresponding text prompt;
[0047] Based on the target video generation model, the baseline animation video clip, and the corresponding text prompts, the target video is obtained;
[0048] Determine the visual quality and matching degree of the target video;
[0049] Based on the visual quality and the matching degree, benchmark test results are determined, and the benchmark test results are used to evaluate the performance of the target visual model.
[0050] Optionally, the target video includes multiple frames, and the visual quality includes visual smoothness, which is used to represent the coherence between the multiple frames;
[0051] Correspondingly, determining the visual quality of the target video includes:
[0052] Obtain the visual features of the multiple frames;
[0053] Based on the visual features of the multiple frames, the similarity between two adjacent frames in the multiple frames is obtained sequentially;
[0054] Based on the similarity between two adjacent frames in the plurality of frames, a first average similarity is determined, which is used to represent the visual smoothness.
[0055] Optionally, the visual quality includes motion attributes; correspondingly, determining the visual quality of the target video includes:
[0056] The corresponding text prompt is input into a pre-trained motion scoring model to obtain a first motion score through the motion scoring model;
[0057] The target video is input into the motion scoring model to obtain a second motion score;
[0058] Determine the similarity between the first motion score and the second motion score, the similarity being used to represent the motion attribute.
[0059] Optionally, the target video comprises multiple frames, and the visual quality includes visual appeal;
[0060] Correspondingly, determining the visual quality of the target video includes:
[0061] Multiple keyframes are determined from the multiple frames;
[0062] Aesthetic scores are obtained from the multiple keyframes to determine an average aesthetic score, which is used to represent the visual appeal.
[0063] Optionally, the target video includes multiple frames, and the matching degree includes the matching degree between the corresponding text prompt and the target video;
[0064] Correspondingly, obtaining the matching degree of the target video includes:
[0065] The corresponding text prompt and the multiple frames are input into a multimodal pre-trained model to obtain multiple matching scores, each matching score representing the degree of matching between the corresponding frame and the text prompt;
[0066] An average matching score is determined based on the multiple matching scores, and the average matching score is used to represent the degree of matching between the corresponding text prompt and the target video.
[0067] Optionally, the target video includes multiple frames, including a guide frame, which is used to guide the target video generation model to generate the target video, and the matching degree includes the matching degree between the guide frame and the target video;
[0068] Correspondingly, determining the matching degree of the target video includes:
[0069] Obtain the target text from the corresponding text prompt, the target text being used to specify the image style of the guide frame;
[0070] The guiding frame and the target text are input into a multimodal pre-trained model to obtain a first matching score;
[0071] The multiple frames and the target text are input into the multimodal pre-trained model to obtain multiple second matching scores, each second matching score corresponding to one frame;
[0072] A second average similarity is determined based on the similarity between the first matching score and the plurality of second matching scores, and the second average similarity is used to represent the similarity between the guide frame and the target video.
[0073] Optionally, the target video includes multiple frames, the target video includes a target character, the reference animation video clip includes a reference character, and the matching degree includes the matching degree between the target character and the reference character;
[0074] Correspondingly, determining the matching degree of the target video includes:
[0075] The features of the benchmark character are obtained based on the benchmark animated video clip;
[0076] Multiple sample frames are obtained from the multiple frames;
[0077] The multiple sample frames are input into a pre-trained character detection model to obtain the features of the multiple target characters;
[0078] Based on the similarity between the features of the baseline character and the features of multiple target characters, a character similarity is determined, which is used to represent the matching degree between the target character and the baseline character.
[0079] Another aspect of this application provides an animation video generation apparatus, the apparatus comprising:
[0080] The acquisition module is used to acquire a training dataset, which includes multiple target animation video clips, each of which has a corresponding text prompt. The target animation video clips are used to determine the guidance conditions.
[0081] The supervised fine-tuning module is used to perform supervised fine-tuning on the pre-trained base model based on the multiple target animation video clips and corresponding text prompts and guidance conditions to obtain the target video generation model, which is used to generate animation videos.
[0082] Another aspect of this application provides a computer device, including:
[0083] At least one processor; and
[0084] A memory that is communicatively connected to the at least one processor;
[0085] Wherein: the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the following operations:
[0086] Obtain the target guidance image and target text prompt;
[0087] Determine the spatiotemporal guidance conditions based on the target guidance image;
[0088] Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, a series of frames that are associated with the target guidance image in time and space and match the target text prompt are generated to constitute the animated video clip.
[0089] Another aspect of this application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the following operations:
[0090] Obtain the target guidance image and target text prompt;
[0091] Determine the spatiotemporal guidance conditions based on the target guidance image;
[0092] Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, a series of frames that are associated with the target guidance image in time and space and match the target text prompt are generated to constitute the animated video clip.
[0093] Another aspect of this application provides a computer program product including computer-readable instructions that, when executed by a processor, perform the following operations:
[0094] Obtain the target guidance image and target text prompt;
[0095] Determine the spatiotemporal guidance conditions based on the target guidance image;
[0096] Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, a series of frames that are associated with the target guidance image in time and space and match the target text prompt are generated to constitute the animated video clip.
[0097] The embodiments of this application employing the above-described technical solution may have the following advantages:
[0098] The process involves acquiring a target guidance image and a target text prompt. Spatiotemporal guidance conditions are determined based on the target guidance image. Following the guidance of the spatiotemporal guidance conditions and the target text prompt, a series of frames that are temporally and spatially associated with the target guidance image and match the target text prompt are generated to constitute the animated video clip. It is understood that the technical solution of this application embodiment can generate an animated video clip composed of a series of frames that change temporally and spatially relative to the target guidance image and match the target text, under the guidance of spatiotemporal guidance conditions and text prompts. This is applicable to various video generation tasks and is particularly optimized for animated video generation. Attached Figure Description
[0099] The accompanying drawings exemplify embodiments and form part of the specification, serving together with the textual description to explain exemplary implementations of the embodiments. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numerals refer to similar but not necessarily identical elements.
[0100] Figure 1 schematically illustrates a flowchart of an animation video generation method according to Embodiment 1 of this application;
[0101] Figure 2 schematically illustrates the overall flowchart of the animation video generation method according to Embodiment 1 of this application;
[0102] Figure 3 schematically illustrates the training flowchart of the video generation model according to Embodiment 1 of this application;
[0103] Figure 4 schematically illustrates the benchmark test results of the video generation model according to Embodiment 1 of this application;
[0104] Figure 5 schematically illustrates the motion accuracy test results of the video generation model according to Embodiment 1 of this application;
[0105] Figure 6 schematically illustrates the effect of the video generation model according to Embodiment 1 of this application;
[0106] Figure 7 schematically shows a block diagram of an animation video generation apparatus according to Embodiment 2 of this application; and
[0107] Figure 8 schematically illustrates a hardware architecture diagram of a computer device according to Embodiment 3 of this application. Embodiments of the present invention
[0108] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application. All other embodiments obtained by those skilled in the art based on the embodiments in this application without inventive effort are within the scope of protection of this application.
[0109] It should be noted that the descriptions involving "first," "second," etc., in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature. Furthermore, the technical solutions of the various embodiments can be combined with each other, but this must be based on the ability of those skilled in the art to implement them. If the combination of technical solutions is contradictory or impossible to implement, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed in this application.
[0110] In the description of this application, it should be understood that the numerical labels before the steps do not indicate the order of the steps, but are only used to facilitate the description of this application and to distinguish each step, and therefore should not be construed as a limitation of this application.
[0111] First, a definition of the terminology used in this application is provided:
[0112] Supervised Fine-Tuning (SFT): Targeted fine-tuning of an already trained model using labeled data to optimize its performance in a specific domain or task.
[0113] UGC (User-Generated Content): User-generated content.
[0114] MMD: Video used to create virtual characters.
[0115] Secondly, to facilitate understanding of the technical solutions provided in the embodiments of this application by those skilled in the art, the relevant technologies are described below:
[0116] Generative artificial intelligence (AI) is an efficient and low-cost technology for video creation, greatly reducing the difficulty of video generation. However, the applicant understands that related video generation models still suffer from problems such as uncontrollable generated video content and limited video types, which negatively impact user experience.
[0117] Therefore, this application provides a technical solution for constructing a video generation model. In this technical solution: (1) a method for constructing an animation dataset (training dataset) is provided, which provides more than 10 million high-quality text-video pairs, which can improve the generalization ability and generation accuracy of the video generation model and meet diverse video generation needs; (2) a video generation model supporting multiple tasks is constructed, such as image-to-video generation, keyframe interpolation, and local image-guided animation; (3) the conditional generation capability of the video generation model is enhanced by using a spatiotemporal mask module to achieve motion control for specific regions, such as precise control of character actions; (4) a high-quality benchmark dataset is constructed, and a metric specifically for evaluating animation video generation is developed; (5) full-parameter supervised fine-tuning is performed on the pre-trained model to adapt to animation video generation tasks; (6) image generation is incorporated into a multi-task training framework to improve the generalization ability of the video generation model in diverse artistic styles. See below for details.
[0118] The technical solutions of this application are described below through several embodiments. It should be understood that these embodiments can be implemented in many different forms and should not be construed as being limited to the embodiments set forth herein.
[0119] Example 1
[0120] Figure 1 schematically illustrates a flowchart of an animated video generation method according to Embodiment 1 of this application.
[0121] As shown in Figure 1, the animation video generation method may include steps S1 to S3, wherein:
[0122] Step S1: Obtain the target guidance image and target text prompt.
[0123] Step S2: Determine the spatiotemporal guidance conditions based on the target guidance image.
[0124] Step S3: Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, generate a series of frames that are associated with the target guidance image in time and space and match the target text prompt, so as to constitute the animation video segment.
[0125] The animated video generation method provided in this embodiment obtains a target guidance image and a target text prompt. Spatiotemporal guidance conditions are determined based on the target guidance image. According to the guidance of the spatiotemporal guidance conditions and the target text prompt, a series of frames that are temporally and spatially associated with the target guidance image and match the target text prompt are generated to constitute the animated video segment. It can be seen that the technical solution of this application embodiment can generate an animated video segment composed of a series of frames that change temporally and spatially relative to the target guidance image and match the target text, under the guidance of spatiotemporal guidance conditions and text prompts. It is applicable to various video generation tasks and is particularly optimized for animated video generation.
[0126] The following, with reference to Figure 1, elaborates on each step in steps S1 to S3, as well as other optional steps.
[0127] Step S1: Obtain the target guidance image and target text prompt.
[0128] The target guide image can be any type of image, such as a real photograph, a virtual character image, a screenshot from an animation or comic, or a video frame from a video. There can be one or more target guide images. The target text cues are textual descriptions of the animated video clip to be generated, and can include scenes, events, characters, actions, art style, emotions, etc. For example, the target guide image could be a puppy. The target text cue could be "The puppy is spinning in circles."
[0129] Step S2: Determine the spatiotemporal guidance conditions based on the target guidance image.
[0130] Spatiotemporal guidance conditions can be determined based on the target guidance image. These conditions may include the position of the target guidance image in the animated video clip (a series of frames) to be generated, or the motion and non-motion regions of each frame.
[0131] Step S3: Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, generate a series of frames that are associated with the target guidance image in time and space and match the target text prompt, so as to constitute the animation video segment.
[0132] For example, guided by spatiotemporal guidance conditions and target text prompts, animated video clips can be generated. An animated video clip may include a series of frames that are temporally and spatially associated with the target guiding image and match the target text prompts.
[0133] In this embodiment, under the guidance of spatiotemporal guidance conditions and text prompts, an animated video clip consisting of a series of frames that change in time and space relative to the target guidance image and match the target text can be generated. It is applicable to a variety of video generation tasks and has been optimized for animated video generation.
[0134] In an optional embodiment, the animated video clip can be generated using a video generation model. The video generation model includes a spatiotemporal masking module, an encoding module, and a transformer module. Specifically, the spatiotemporal masking module determines spatiotemporal guidance conditions based on the target guidance image. These conditions include a guidance sequence and a mask sequence. The guidance sequence specifies the position of the target guidance image within the animated video clip, and the mask sequence defines whether a series of frames are displayed within the animated video clip. The encoding module encodes the guidance sequence and mask sequence to obtain a guidance feature sequence and a mask encoding sequence. Based on the guidance feature sequence, the mask encoding sequence, noise, and the target text cue, the transformer module generates the animated video clip.
[0135] In this embodiment, the spatiotemporal guidance conditions can be determined through the video generation model, and the generated animation video can be controlled based on the spatiotemporal guidance conditions to ensure that the result is controllable and improve the quality of video generation.
[0136] In an optional embodiment, the video generation model is obtained through the following operations:
[0137] Step S100: Obtain a training dataset, which includes multiple target animation video clips, each of which has a corresponding text prompt. The target animation video clips are used to determine the guidance conditions.
[0138] The training dataset can include multiple sets of training data. Each set of training data includes a target animated video clip and its corresponding text prompt, i.e., a text-video pair. The target animated video clip can be collected and filtered from a large number of animated videos according to preset rules (such as whether resolution, frame rate, number of views, and interaction data meet requirements). The corresponding text prompt can include key information about the target animated video clip, such as its scene, events, characters, art style, and emotions. An example scheme is provided below.
[0139] In an optional embodiment, step S100 may include:
[0140] Step S200: Obtain the original video set, which includes multiple animated videos.
[0141] Step S202: Segment the multiple animated videos into scenes to obtain multiple animated video segments.
[0142] Step S204: Based on preset filtering rules, determine the multiple target animation video segments from the multiple animation video segments.
[0143] Step S206: Input the multiple target animation video clips into a pre-trained video annotation model to obtain multiple text prompts through the video annotation model, with each text prompt corresponding to a target animation video clip.
[0144] Step S208: Construct the training dataset based on the plurality of target animation video clips and the plurality of text prompts.
[0145] For example, as shown in Figure 2, a raw video set can be obtained based on large-scale, diverse, and high-quality popular animated films and television series (such as animated dramas, animated movies, UGC animations, etc.) and virtual animations (such as MMD, virtual idol videos, etc.). The raw video set can include multiple animated videos (e.g., more than 1 million). Scene detection and segmentation (Shot Detection & Split) of multiple animated videos can yield multiple animated video segments. Based on preset filtering rules (such as whether the frame rate, resolution, duration, content, etc. meet the requirements), multiple target animated video segments (e.g., more than 10 million) can be obtained from the multiple animated video segments. Inputting the multiple target animated video segments into a pre-trained video annotation model (e.g., Qwen-VL2, a multimodal visual language model that supports image description, video description, etc.) can yield text prompts corresponding to each of the multiple target animated video segments. Based on the multiple target animated video segments and their corresponding text prompts, a large-scale and high-quality training dataset (text-video pairs) can be constructed. The data sources used to construct the training dataset in this application are legal and publicly available. All data has been anonymized and desensitized to ensure that no personal privacy or sensitive information is involved. The acquisition and use of the data are strictly compliant.
[0146] In this embodiment, scene detection, video segmentation, preset filtering rules, and video annotation are combined to generate multiple text-video pairs to construct a high-quality training dataset, thereby improving the training efficiency of the model and the ability to generate animated videos.
[0147] In an optional embodiment, step S204 may include:
[0148] Step S300: Obtain the text coverage score, optical flow score, aesthetic score, and total frame count for each of the multiple animated video segments. The text coverage score represents the proportion of the text area relative to the frame in the animated video segment; the optical flow score represents the motion attribute of the animated video segment; and the aesthetic score represents the aesthetic quality of the animated video segment.
[0149] Step S302: The animated video segment whose text coverage score is lower than the corresponding preset threshold, whose optical flow score and aesthetic score are both higher than their respective preset thresholds, and whose total number of frames is within a preset range is determined as the target animated video segment.
[0150] For example, computer vision technology can be used to analyze animated video clips and calculate corresponding text coverage scores, optical flow scores, aesthetic scores, and total frame counts. The text coverage score represents the proportion of text areas relative to the frame in the animated video clip. A lower text coverage score indicates a higher proportion of text in the frame, resulting in lower quality animated video clips, such as end credits with numerous subtitles. The optical flow score represents the motion properties of the animated video clip. A lower optical flow score indicates less motion in the animated video clip, resulting in lower quality animated video clips, such as scenes with static images or fast flashbacks (with low saturation and blurring effects). The aesthetic score represents the aesthetic quality of the animated video clip. A lower aesthetic score indicates poorer aesthetic quality. As an example, animated video clips with text coverage scores below a corresponding preset threshold, optical flow and aesthetic scores above their respective preset thresholds, and total frame counts within a preset range can be identified as target animated video clips. Filtering based on total frame count can yield target animated video clips with durations within a preset range (e.g., 2s to 20s). Of course, the filtering rules can also be dynamically adjusted. For example, one or more indicators can be considered to filter out the target animated video clip from multiple animated video clips.
[0151] In this embodiment, filtering rules are constructed based on four different dimensions to achieve accurate evaluation and effective screening of animated video clips, thereby obtaining high-quality target animated video clips that meet the standards.
[0152] Target animation video clips can be used to determine guiding conditions. These guiding conditions can include guiding frames (one or more keyframes), motion regions (regions in a video frame that exhibit significant changes), non-motion regions (regions in a video frame that do not exhibit significant changes), artistic style, etc. Guiding conditions are used in the model training process as a guide to help the model learn how to generate expected outputs. The model training process will be illustrated below with several examples.
[0153] Step S102: Based on the multiple target animation video clips and corresponding text prompts and guidance conditions, supervised fine-tuning is performed on the pre-trained base model to obtain the target video generation model, which is used to generate animation videos.
[0154] The base model can be a video generation model pre-trained on a large-scale dataset (such as over 35 million video clips) (e.g., a third-party general-purpose large model). The video clips can be natural videos (real-world footage) or animated videos. Based on the base model, fully parameter-supervised fine-tuning (SFT) can be performed using the training dataset to specifically adapt it for animated video generation tasks. An exemplary scheme is provided below.
[0155] In an optional embodiment, the base model can be a text-to-video diffusion model based on DiT (Denoising Image Transformer), as shown in Figure 3. The base model may include a spatiotemporal masking module, an encoding module, and a transformer module. Based on these three modules, the base model can achieve the task of video generation aligned with text prompts. Exemplarily, step S102 may include:
[0156] Step S400: The frame sequence is input to the spatiotemporal masking module to determine a guide frame from the plurality of frames through the spatiotemporal masking module, and a guide sequence and a mask sequence are determined based on the guide frame; wherein, the guide frame is used to guide the converter module to generate a series of frames associated with the guide frame.
[0157] Step S402: Input the guide sequence and the mask sequence into the encoding module to obtain the guide feature sequence and the mask encoding sequence through the encoding module.
[0158] Step S404: Based on the guiding feature sequence, the mask encoding sequence, noise, and the corresponding text prompts, the transformer module is subjected to supervised fine-tuning to obtain the target video generation model.
[0159] The target animation video clip consists of a frame sequence composed of multiple frames. The following example, using one set of training data, illustrates the model training process. Specifically, the frame sequence can be input into a spatiotemporal masking module, which can determine a guiding frame from multiple frames through uniform or random sampling. The guiding frame can include one or more frames in the frame sequence, such as the first frame, the last frame, and other frames. The guiding frame can be used to guide the transformer module in generating the video. Based on the guiding frame, a guiding sequence and a mask sequence can be determined. The guiding sequence can be used to specify the position of the guiding frame, and the mask sequence can be used to control the guiding effect on the generated video. As shown in Figure 3, the guiding sequence includes at least one frame, and the mask sequence can be a binary sequence or an equivalent sequence.
[0160] The guide sequence and mask sequence are input into the encoding module, which may include a 3D Casual VAE and a reprojection network. The 3D Casual VAE performs VAE encoding on the guide sequence, obtaining the guide feature sequence G = {G1, G2, ..., Gpi, ..., Gn}, where Gpi corresponds to the encoded guide frame, and Gj = 0 at other positions, j ≠ pi. The mask sequence can be represented as M = {M1, M2, ..., Mpi, ..., Mn}. The reprojection network encodes the mask sequence, obtaining the mask-encoded sequence Reproj(M).
[0161] The guiding feature sequence G, the masked coding sequence Reproj(M), and noise will be used. t By concatenating the corresponding text prompt (T5) along the channel dimension, we can obtain:
[0162] ;
[0163] By using X as input to the transformer module, supervised fine-tuning of the transformer module can be performed, ultimately yielding the target video generation model.
[0164] In this embodiment, a guiding frame is selected from multiple frames and added to the model training process as a guiding condition. This allows the model to learn to generate animated videos that conform to text prompts under the guidance of the guiding condition, thereby achieving keyframe interpolation, i.e., generating videos under the condition of one or more arbitrary frames.
[0165] In an optional embodiment, step S400 may include:
[0166] Step S500: Place the guide frame at the target position of the guide sequence, where other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame.
[0167] Step S502: Determine a first mask sequence based on the boot sequence. The first mask sequence includes a demask, which corresponds to the boot frame. The first mask sequence and the boot sequence have the same length. The first mask sequence also includes multiple masking masks, and the demask corresponds to the boot frame.
[0168] For example, after determining the guide frame, the guide frame can be placed at the target position of the guide sequence. The other positions of the guide sequence correspond to a series of frames associated with the guide frame generated by the transformer module. That is, the other positions of the guide sequence can be null values or masked frames, which await frame insertion. After the transformer module generates a series of frames, these frames are arranged in order according to the instructions of the guide sequence, thus forming an animated video, as shown in Figure 3. The corresponding first mask sequence can be determined based on the guide sequence. The first mask sequence has the same length as the guide sequence. The first mask sequence includes a demask and multiple masking masks, wherein the demask (value 1) corresponds to the position of the guide frame, and the masking mask (value 0) corresponds to the positions of multiple masked frames. The demask and masking masks have the same size as the guide frame. The first mask sequence can be represented as: M={M1, M2, ..., Mpi, ..., Mn}, where Mpi=1, Mj=0, j≠pi. The first mask sequence can be used to represent the guide frame.
[0169] In this embodiment, the guiding frame can be effectively located based on the guiding sequence and the first mask sequence, thereby improving the model's video generation capability under guiding conditions.
[0170] In an optional embodiment, step S400 may include:
[0171] Step S600: Place the guide frame at the target position of the guide sequence, where other positions of the guide sequence correspond to a series of frames generated by the converter module and associated with the guide frame.
[0172] Step S602: Obtain a target mask based on the frame sequence. The target mask includes a masked region and an unmasked region. The target mask is used to control the motion region of the series of frames. The masked region corresponds to the non-motion region, and the unmasked region corresponds to the motion region.
[0173] Step S604: Determine a second mask sequence based on the guiding sequence and the target mask. The second mask sequence has the same length as the guiding sequence and includes multiple target masks.
[0174] For example, after determining the guide frame, the guide frame can be placed at the target position of the guide sequence to obtain the guide sequence. A target mask MF can be obtained based on the frame sequence, and the target mask has the same size as the guide frame. The target mask can be used to control the motion region of the generated video. The target mask can include masked regions (value 0) and unmasked regions (value 1), where the masked regions correspond to non-motion regions and the unmasked regions correspond to motion regions. Accordingly, the second mask sequence can be represented as {MF, MF, ..., MF}, and the second mask sequence includes multiple target masks MF, with the same length as the guide sequence.
[0175] In this embodiment, the target mask can effectively control the moving and non-moving regions in video generation, thereby enhancing the dynamic consistency and motion control accuracy of the generated video and realizing local image-guided animation.
[0176] In an optional embodiment, step S602 may include:
[0177] Step S700: Perform foreground detection on the first frame of the frame sequence to determine the foreground region of the first frame.
[0178] Step S702: Based on the foreground region of the first frame, generate a foreground mask corresponding to each of the plurality of frames. The foreground mask is used to represent the motion region and non-motion region of the corresponding frame.
[0179] Step S704: Generate the target mask based on the foreground mask corresponding to each of the multiple frames.
[0180] For example, a foreground detector can be used to detect the foreground region in the first frame of a frame sequence. The foreground region is the area where a character or main object in the frame image is located. This foreground region is tracked in subsequent frames, generating a corresponding foreground mask for each frame. The foreground mask can be used to represent the motion and non-motion regions of the corresponding frame. Combining the foreground masks of multiple frames creates a unified mask, known as the target mask (MF). The target mask (MF) represents the union of all foreground regions in the frame sequence.
[0181] In this embodiment, foreground detection and tracking effectively extract moving and non-moving regions in video frames to generate a high-precision target mask, thereby improving the dynamic accuracy and regional consistency in the generated video.
[0182] In some embodiments, the non-motion regions of the guide frame can be used as video latent features, and the video latent feature representation z0 is obtained through VAE encoding. This video latent feature z0 can be used as input to the transformer module to ensure that static regions (non-motion regions) in the generated video follow the guidance of the guide frame.
[0183] In an optional embodiment, step S102 may include:
[0184] Step S800: Determine the resolution and frame rate corresponding to each of the plurality of target animation video segments.
[0185] Step S802: Based on the resolution and frame rate corresponding to each of the multiple target animation video segments, the training dataset is divided into multiple subsets, and at least one of the resolution and frame rate of each subset is different from the other subsets.
[0186] Step S804: Determine the training priority of each subset based on the resolution and frame rate corresponding to each subset.
[0187] Step S806: Based on the training priority of each subset, different subsets are used sequentially to perform supervised fine-tuning of the base model to obtain the target video generation model; wherein, the lower the resolution and / or frame rate of the subset, the higher the corresponding training priority.
[0188] For example, the training dataset can be divided into multiple subsets based on the resolution (e.g., 480p, 720p, 1080p, etc.) and frame rate (8fps, 16fps, 24fps, etc.) of the target animation video clips. Each subset has at least one different resolution and frame rate than the others, for example: subset A (480p, 8fps), subset B (480p, 16fps), subset C (720p, 16fps), etc. Based on the resolution and frame rate of each subset, the training priority of each subset is determined. For example, the lower the resolution and / or frame rate of a subset, the higher its training priority, such as training priority: subset A > subset B > subset C. Based on the training priority of each subset, the base model is sequentially fine-tuned using different subsets under supervised supervision to obtain the target video generation model. For example, the model is first trained on subset A (480p, 8fps) for several epochs (e.g., 3 epochs) to enable it to capture basic spatiotemporal dynamics at lower frame rates. Then, the model is trained on subset B (480p, 16fps) for an additional number of epochs (e.g., 1.9 epochs) to refine temporal consistency and adapt to higher frame rates. Subsequently, the model can be fine-tuned on subset C (720p, 16fps) for a number of epochs (e.g., 2.3 epochs) to generate high-resolution, temporally coherent video output using the previously learned features.
[0189] In this embodiment, a training strategy from weak to strong is adopted to gradually improve the model's learning ability at different resolutions and frame rates, thereby improving model performance.
[0190] In some embodiments, the training dataset can be adjusted or more rigorously selected based on the performance of the target video generation model. For example, the proportion of dialogue videos or videos with large motion amplitudes in the training dataset can be increased to obtain an ultra-high-quality dataset (such as 1 million) for fine-tuning in the final stage, which can significantly improve the quality of high-resolution videos.
[0191] In some embodiments, animated video clips without subtitles or watermarks can be used as a dataset to fine-tune the model. For example, the dataset could consist of 790,000 animated video clips, which have been cropped to remove subtitles or have no subtitles at all. Applying this dataset to the model for fully parameter-supervised fine-tuning, after 5.5k iterations, the target video generation model can effectively eliminate the generation of subtitles and watermarks, resulting in an overall performance improvement.
[0192] In some embodiments, the target animation video clips in the training dataset are of varying lengths. Specifically, target animation video clips ranging from 2 seconds to 8 seconds can be used for model training to maximize data utilization. This variable-length video hybrid training strategy allows the model to generate videos of flexible lengths.
[0193] In some embodiments, image generation can be incorporated into a multi-task training framework. That is, in addition to the training dataset, multiple images with diverse artistic styles are added for model training to improve the model's generalization ability across diverse artistic styles, which can effectively reduce the gap in video generation quality caused by differences in guide frame style.
[0194] The above embodiments exemplify the training process of the model. The following embodiments will exemplify the evaluation process of the model to better construct video generation models.
[0195] In an optional embodiment, the animated video generation method may further include:
[0196] Step S900: Obtain a benchmark dataset, which includes multiple benchmark animated video clips, each with a corresponding text prompt.
[0197] Step S902: Obtain the target video based on the target video generation model, the benchmark animation video clip, and the corresponding text prompts.
[0198] Step S904: Determine the visual quality and matching degree of the target video.
[0199] Step S906: Determine the benchmark test results based on the visual quality and the matching degree. The benchmark test results are used to evaluate the performance of the target visual model.
[0200] For example, the benchmark dataset may include multiple benchmark animated video clips (e.g., 943, including 857 2D benchmark animated video clips and 91 3D benchmark animated video clips). These benchmark animated video clips have different action labels, such as: speaking, walking & running, eating, etc. There can be more than 100 action labels, and each action label can correspond to 10-30 benchmark animated video clips. The text prompts corresponding to the benchmark animated video clips can also be obtained through a video annotation model.
[0201] By inputting text prompts into the trained target video generation model, the target video can be obtained. Determining the visual quality (e.g., visual smoothness, motion attributes, visual appeal) and matching degree (text-video, image-video, character consistency) of the target video can determine the benchmark results, which can be used to evaluate the performance of the target visual model.
[0202] In this embodiment, a comprehensive benchmark dataset was constructed and a brand-new metric was developed to accurately evaluate the model's performance.
[0203] The metrics will be illustrated below through several examples.
[0204] In an optional embodiment, visual quality may include visual smoothness, which represents the coherence between the plurality of frames. Correspondingly, step S904 may include:
[0205] Step S1000: Obtain the visual features of the multiple frames.
[0206] Step S1002: Based on the visual features of the multiple frames, the similarity between adjacent frames in the multiple frames is obtained sequentially.
[0207] Step S1004: Based on the similarity between two adjacent frames in the plurality of frames, a first average similarity is determined, and the first average similarity is used to represent the visual smoothness.
[0208] For example, the visual features of multiple frames can be obtained through the CLIP model, and the similarity (such as cosine similarity) between adjacent frames can be calculated sequentially. Based on the similarity between adjacent frames in the multiple frames, a first average similarity can be obtained, which can reflect visual smoothness.
[0209] The calculation method can be expressed as follows:
[0210] ;
[0211] Among them, I i Let N represent the frame, N represent the total number of frames, CLIP represent the feature extractor, and Cos represent the cosine similarity function.
[0212] In an optional embodiment, visual quality may include motion attributes, which may be the range of motion of main elements (characters, objects, etc.) in the target video. Correspondingly, step S904 may include:
[0213] Step S1100: Input the corresponding text prompt into the pre-trained motion scoring model to obtain the first motion score through the motion scoring model.
[0214] Step S1102: Input the target video into the motion scoring model to obtain a second motion score.
[0215] Step S1104: Determine the similarity between the first motion score and the second motion score, wherein the similarity is used to represent the motion attribute.
[0216] For example, 10 million animated video clips and their corresponding motion descriptions can be collected, and the motion descriptions can be divided into 6 levels (from static to salient motion) to fine-tune the base model (such as CLIP), finally resulting in a motion scoring model. The motion scoring model is used to score text prompts and target videos, yielding a first motion score and a second motion score. The similarity between the first and second motion scores is calculated; this similarity can represent the motion attribute.
[0217] The calculation formula can be expressed as follows:
[0218] ;
[0219] Where MCLIP represents the motion scoring model, V represents the target video, and Tm represents the motion-related text fragments in the text prompts.
[0220] In an optional embodiment, visual quality may include visual attractiveness. Correspondingly, step S904 may include:
[0221] Step S1200: Determine multiple keyframes from the multiple frames.
[0222] Step S1202: Obtain the aesthetic scores of the multiple keyframes to determine the average aesthetic score, which is used to represent the visual appeal.
[0223] For example, keyframes from multiple frames can be collected using a keyframe extraction method, and their aesthetic scores can be calculated. The average aesthetic score is then obtained to represent visual appeal.
[0224] The calculation formula can be expressed as follows:
[0225] ;
[0226] Here, KeyFrm represents the keyframe extraction method, Aes represents the aesthetic evaluation method, and K represents the number of keyframes.
[0227] In an optional embodiment, the matching degree may include the matching degree between the corresponding text prompt and the target video. Correspondingly, step S904 may include:
[0228] Step S1300: Input the corresponding text prompt and the multiple frames into a multimodal pre-trained model to obtain multiple matching scores, where each matching score represents the degree of matching between the corresponding frame and the text prompt.
[0229] Step S1302: Determine an average matching score based on the multiple matching scores, whereby the average matching score represents the degree of matching between the corresponding text prompt and the target video.
[0230] For example, a matching score between the text prompt and each frame can be determined using a multimodal pre-trained model to characterize the matching degree. An average matching score can be determined based on multiple matching scores to represent the matching degree between the text and the video.
[0231] The calculation formula can be expressed as follows:
[0232] ;
[0233] Where N represents the total number of frames and T represents the text prompt.
[0234] In an optional embodiment, the matching degree may include the matching degree between the guide frame and the target video.
[0235] Correspondingly, step S904 includes:
[0236] Step S1400: Obtain the target text from the corresponding text prompt, the target text being used to specify the image style of the guide frame.
[0237] Step S1402: Input the guiding frame and the target text into the multimodal pre-trained model to obtain the first matching score.
[0238] Step S1404: Input the multiple frames and the target text into the multimodal pre-trained model to obtain multiple second matching scores, each second matching score corresponding to one frame.
[0239] Step S1406: Based on the similarity between the first matching score and the plurality of second matching scores, a second average similarity is determined, wherein the second average similarity is used to represent the similarity between the guide frame and the target video.
[0240] For example, target text is extracted from text prompts; the target text is a text fragment used to specify the style of the guide frame image. A first matching score between the guide frame and the target text, and a second matching score between each frame and the target text, can be determined using a multimodal pre-trained model. A second average similarity can be determined based on the similarity between the first matching score and each of the second matching scores. The second average similarity can represent the degree of matching between the guide frame and the video.
[0241] The calculation formula can be expressed as follows:
[0242] ;
[0243] Where Ip represents the guide frame and Ts represents the target text.
[0244] In an optional embodiment, the target video includes a target character, the benchmark animation video clip includes a benchmark character, and the matching degree may include the matching degree between the target character and the benchmark character.
[0245] Correspondingly, step S904 may include:
[0246] Step S1500: Obtain the features of the benchmark character based on the benchmark animation video clip.
[0247] Step S1502: Obtain multiple sample frames from the multiple frames.
[0248] Step S1504: Input the multiple sample frames into the pre-trained character detection model to obtain the features of the multiple target characters.
[0249] Step S1506: Based on the similarity between the features of the baseline character and the features of multiple target characters, determine the character similarity, which is used to represent the matching degree between the target character and the baseline character.
[0250] For example, GroundingDino and SAM can be applied to detect, segment, and identify each frame (or multiple sample frames) of the target video to extract the mask of the target character in each frame (or multiple sample frames). The mask of the target character is input into the character detection model to obtain the features of multiple target characters. The similarity between the features of multiple target characters and the features of a pre-stored benchmark character (obtained from a benchmark animation video clip) is calculated to obtain the character similarity. The character similarity can represent the character consistency, that is, the degree of matching between the target character and the benchmark character.
[0251] The calculation formula can be expressed as follows:
[0252] ;
[0253] Where S represents the number of sample frames, Mi represents the mask obtained from GroundingDino and SAM, and feac represents the feature of the baseline role.
[0254] In some embodiments, in addition to the quantitative analysis described above, the model can also be evaluated by humans. For example, 20 volunteers can be asked to give scores based on multiple indicators mentioned above (such as visual smoothness, motion properties, etc.).
[0255] Figure 4 shows the benchmark test results obtained by evaluating the trained target video generation model according to the evaluation benchmark provided in the embodiments of this application. It can be seen that the human evaluation results and the benchmark test results are highly correlated, indicating that the evaluation benchmark provided in the embodiments of this application is feasible and effective, and can be used for standardized testing. Furthermore, single-frame guidance can achieve better video generation results, and adding more guidance frames can further improve character consistency and motion stability, producing animated videos with a larger range of motion and more realistic animation. As shown in Figures 5 and 6, motion region control of the generated video can effectively control the movable and immovable areas (characters and background) of the video, improving the video animation effect and ensuring alignment with various storylines. Even without motion control, the video generation model provided in the embodiments of this application still shows a certain degree of control. Here, AnimateAnything is a controllable video generation technology.
[0256] Example 2
[0257] Figure 7 schematically illustrates a block diagram of an animation video generation apparatus according to Embodiment 2 of this application. This apparatus can be divided into one or more program modules. One or more program modules are stored in a storage medium and executed by one or more processors to complete the embodiment of this application. The program module referred to in this embodiment is a series of computer program instruction segments capable of performing a specific function. The following description will specifically introduce the functions of each program module in this embodiment. As shown in Figure 7, the apparatus 1000 may include: an acquisition module 1100, a determination module 1200, and a generation module 1300, wherein:
[0258] Module 1100 is used to acquire the target guidance image and the target text prompt;
[0259] The determination module 1200 is used to determine the spatiotemporal guidance conditions based on the target guidance image;
[0260] The generation module 1300 is used to generate a series of frames that are associated with the target guidance image in time and space and match the target text prompt, based on the spatiotemporal guidance conditions and the guidance of the target text prompt, so as to constitute the animation video segment.
[0261] As an optional embodiment, the method is implemented using a pre-trained video generation model, which includes: a spatiotemporal masking module, an encoding module, and a transformer module, wherein:
[0262] The spatiotemporal guidance conditions are determined by the spatiotemporal masking module based on the target guidance image. The spatiotemporal guidance conditions include a guidance sequence and a masking sequence. The guidance sequence is used to specify the position of the target guidance image in the animation video segment, and the masking sequence is used to define whether the series of frames are displayed in the animation video segment.
[0263] The guiding sequence and the mask sequence are encoded by the encoding module to obtain the guiding feature sequence and the mask encoding sequence;
[0264] The animated video clip is generated by the transformer module based on the guiding feature sequence, the mask encoding sequence, noise, and the target text prompt.
[0265] As an optional embodiment, the video generation model is obtained through the following operations:
[0266] Obtain a training dataset, which includes multiple target animation video clips, each of which has a corresponding text prompt. The target animation video clips are used to determine the guidance conditions.
[0267] Based on the multiple target animated video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, which is used to generate animated videos.
[0268] As an optional embodiment, obtaining the training dataset includes:
[0269] Obtain the original video set, which includes multiple animated videos;
[0270] The multiple animated videos are segmented into scenes to obtain multiple video clips;
[0271] Based on preset filtering rules, the multiple target animation video segments are determined from the multiple video segments;
[0272] The multiple target animation video clips are input into a pre-trained video annotation model to obtain multiple text prompts through the video annotation model, with each text prompt corresponding to a target animation video clip;
[0273] The training dataset is constructed based on the multiple target animated video clips and the multiple text prompts.
[0274] As an optional embodiment, the base model includes: a spatiotemporal mask module, an encoding module, and a transformer module, and the target animation video segment includes a frame sequence composed of multiple frames;
[0275] Correspondingly, based on the multiple target animation video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, including:
[0276] The frame sequence is input to the spatiotemporal masking module to determine a guide frame from the plurality of frames, and a guide sequence and a mask sequence are determined based on the guide frame; wherein, the guide frame is used to guide the converter module to generate a series of frames associated with the guide frame;
[0277] The guiding sequence and the mask sequence are input into the encoding module to obtain the guiding feature sequence and the mask encoding sequence through the encoding module;
[0278] Based on the guiding feature sequence, the mask encoding sequence, noise, and the corresponding text prompts, the transformer module is subjected to supervised fine-tuning to obtain the target video generation model.
[0279] As an optional embodiment, determining the boot sequence and mask sequence based on the boot frame includes:
[0280] The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame;
[0281] A first mask sequence is determined based on the boot sequence, the first mask sequence including a demasking mask, the demasking mask corresponding to the boot frame.
[0282] As an optional embodiment, determining the boot sequence and mask sequence based on the boot frame includes:
[0283] The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame;
[0284] A target mask is obtained based on the frame sequence. The target mask includes a masked region and an unmasked region. The target mask is used to control the motion region of the series of frames. The masked region corresponds to the non-motion region, and the unmasked region corresponds to the motion region.
[0285] A second mask sequence is determined based on the guiding sequence and the target mask. The second mask sequence has the same length as the guiding sequence and includes multiple target masks.
[0286] As an optional embodiment, obtaining the target mask based on the frame sequence includes:
[0287] Foreground detection is performed on the first frame of the frame sequence to determine the foreground region of the first frame;
[0288] Based on the foreground region of the first frame, a foreground mask corresponding to each of the plurality of frames is generated, and the foreground mask is used to represent the motion region and non-motion region of the corresponding frame;
[0289] The target mask is generated based on the foreground mask corresponding to each of the multiple frames.
[0290] As an optional embodiment, based on the multiple target animation video clips and corresponding text prompts and guidance conditions, a pre-trained base model is subjected to supervised fine-tuning to obtain a target video generation model, including:
[0291] Determine the resolution and frame rate of each of the multiple target animation video segments;
[0292] Based on the resolution and frame rate of each of the multiple target animation video segments, the training dataset is divided into multiple subsets, and at least one of the resolution and frame rate of each subset is different from the other subsets;
[0293] The training priority for each subset is determined based on the resolution and frame rate corresponding to each subset.
[0294] Based on the training priority of each subset, the base model is sequentially fine-tuned using different subsets to obtain the target video generation model.
[0295] Among them, the lower the resolution and / or frame rate of the subset, the higher the corresponding training priority.
[0296] As an optional embodiment, the device 1000 is further used for:
[0297] Obtain a benchmark dataset, which includes multiple benchmark animated video clips, each with a corresponding text prompt;
[0298] Based on the target video generation model, the baseline animation video clip, and the corresponding text prompts, the target video is obtained;
[0299] Determine the visual quality and matching degree of the target video;
[0300] Based on the visual quality and the matching degree, benchmark test results are determined, and the benchmark test results are used to evaluate the performance of the target visual model.
[0301] As an optional embodiment, the target video includes multiple frames, and the visual quality includes visual smoothness, which is used to represent the coherence between the multiple frames;
[0302] Correspondingly, determining the visual quality of the target video includes:
[0303] Obtain the visual features of the multiple frames;
[0304] Based on the visual features of the multiple frames, the similarity between two adjacent frames in the multiple frames is obtained sequentially;
[0305] Based on the similarity between two adjacent frames in the plurality of frames, a first average similarity is determined, which is used to represent the visual smoothness.
[0306] As an optional embodiment, the visual quality includes motion attributes; correspondingly, determining the visual quality of the target video includes:
[0307] The corresponding text prompt is input into a pre-trained motion scoring model to obtain a first motion score through the motion scoring model;
[0308] The target video is input into the motion scoring model to obtain a second motion score;
[0309] Determine the similarity between the first motion score and the second motion score, the similarity being used to represent the motion attribute.
[0310] As an optional embodiment, the target video comprises multiple frames, and the visual quality includes visual appeal;
[0311] Correspondingly, determining the visual quality of the target video includes:
[0312] Multiple keyframes are determined from the multiple frames;
[0313] Aesthetic scores are obtained from the multiple keyframes to determine an average aesthetic score, which is used to represent the visual appeal.
[0314] As an optional embodiment, the target video includes multiple frames, and the matching degree includes the matching degree between the corresponding text prompt and the target video;
[0315] Correspondingly, obtaining the matching degree of the target video includes:
[0316] The corresponding text prompt and the multiple frames are input into a multimodal pre-trained model to obtain multiple matching scores, each matching score representing the degree of matching between the corresponding frame and the text prompt;
[0317] An average matching score is determined based on the multiple matching scores, and the average matching score is used to represent the degree of matching between the corresponding text prompt and the target video.
[0318] As an optional embodiment, the target video includes multiple frames, including a guide frame, which is used to guide the target video generation model to generate the target video, and the matching degree includes the matching degree between the guide frame and the target video;
[0319] Correspondingly, determining the matching degree of the target video includes:
[0320] Obtain the target text from the corresponding text prompt, the target text being used to specify the image style of the guide frame;
[0321] The guiding frame and the target text are input into a multimodal pre-trained model to obtain a first matching score;
[0322] The multiple frames and the target text are input into the multimodal pre-trained model to obtain multiple second matching scores, each second matching score corresponding to one frame;
[0323] A second average similarity is determined based on the similarity between the first matching score and the plurality of second matching scores, and the second average similarity is used to represent the similarity between the guide frame and the target video.
[0324] As an optional embodiment, the target video includes multiple frames, the target video includes a target character, the reference animation video clip includes a reference character, and the matching degree includes the matching degree between the target character and the reference character;
[0325] Correspondingly, determining the matching degree of the target video includes:
[0326] The features of the benchmark character are obtained based on the benchmark animated video clip;
[0327] Multiple sample frames are obtained from the multiple frames;
[0328] The multiple sample frames are input into a pre-trained character detection model to obtain the features of the multiple target characters;
[0329] Based on the similarity between the features of the baseline character and the features of multiple target characters, a character similarity is determined, which is used to represent the matching degree between the target character and the baseline character.
[0330] Example 3
[0331] Figure 8 schematically illustrates the hardware architecture of a computer device 10000 suitable for implementing an animation video generation method according to Embodiment 3 of this application. In some embodiments, the computer device 10000 may be a smartphone, wearable device, tablet computer, personal computer, vehicle terminal, game console, virtual device, workbench, digital assistant, set-top box, robot, or other terminal device. In other embodiments, the computer device 10000 may be a rack server, blade server, tower server, or cabinet server (including independent servers or server clusters composed of multiple servers). As shown in Figure 8, the computer device 10000 includes, but is not limited to: a memory 10010, a processor 10020, and a network interface 10030 that can communicate with each other via a system bus. Wherein:
[0332] The memory 10010 includes at least one type of computer-readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 10010 may be an internal storage module of a computer device 10000, such as the hard disk or memory of the computer device 10000. In other embodiments, the memory 10010 may also be an external storage device of the computer device 10000, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device 10000. Of course, the memory 10010 may also include both the internal storage module and the external storage device of the computer device 10000. In this embodiment, the memory 10010 is typically used to store the operating system and various application software installed on the computer device 10000, such as program code for animation video generation methods. In addition, the memory 10010 can also be used to temporarily store various types of data that have been output or will be output.
[0333] In some embodiments, processor 10020 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other chip. Processor 10020 is typically used to control the overall operation of computer device 10000, such as performing control and processing related to data interaction or communication with computer device 10000. In this embodiment, processor 10020 is used to run program code stored in memory 10010 or process data.
[0334] Network interface 10030 may include a wireless network interface or a wired network interface, which is typically used to establish a communication link between computer device 10000 and other computer devices. For example, network interface 10030 is used to connect computer device 10000 to an external terminal via a network, establishing a data transmission channel and communication link between computer device 10000 and the external terminal. The network may be an intranet, the Internet, Global System for Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth, Wi-Fi, or other wireless or wired networks.
[0335] It should be noted that Figure 8 only shows a computer device with components 10010-10030, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
[0336] In this embodiment, the animation video generation method stored in memory 10010 can also be divided into one or more program modules and executed by one or more processors (such as processor 10020) to complete the embodiment of this application.
[0337] Example 4
[0338] This application also provides a computer-readable storage medium storing computer-readable instructions thereon, wherein the computer-readable instructions, when executed by a processor, implement the steps of the animation video generation method in the embodiment.
[0339] In this embodiment, the computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium may be an internal storage unit of a computer device, such as the hard disk or memory of the computer device. In other embodiments, the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device. Of course, the computer-readable storage medium may include both the internal storage unit and the external storage device of the computer device. In this embodiment, the computer-readable storage medium is typically used to store the operating system and various application software installed on the computer device, such as the program code of the animation video generation method in the embodiment. In addition, the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.
[0340] Example 5
[0341] This application also provides a computer program product, including computer-readable instructions that, when executed by a processor, implement the methods described in the above embodiments.
[0342] Obviously, those skilled in the art should understand that the modules or steps of the embodiments of this application described above can be implemented using general-purpose computer devices. They can be centralized on a single computer device or distributed across a network of multiple computer devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by a computer device. In some cases, the steps shown or described can be performed in a different order than those presented here, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the embodiments of this application are not limited to any particular combination of hardware and software.
[0343] It should be noted that the above are merely preferred embodiments of this application and do not limit the scope of patent protection of this application. Any equivalent structural or procedural changes made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of this application.
Claims
1. A method for generating animated videos, wherein, The method includes: Obtain the target guidance image and target text prompt; Determine the spatiotemporal guidance conditions based on the target guidance image; Based on the spatiotemporal guidance conditions and the guidance of the target text prompt, a series of frames that are associated with the target guidance image in time and space and match the target text prompt are generated to constitute the animated video clip.
2. The method according to claim 1, wherein, The method is implemented using a pre-trained video generation model, which includes a spatiotemporal masking module, an encoding module, and a transformer module, wherein: The spatiotemporal guidance conditions are determined by the spatiotemporal masking module based on the target guidance image. The spatiotemporal guidance conditions include a guidance sequence and a masking sequence. The guidance sequence is used to specify the position of the target guidance image in the animation video segment, and the masking sequence is used to define whether the series of frames are displayed in the animation video segment. The guiding sequence and the mask sequence are encoded by the encoding module to obtain the guiding feature sequence and the mask encoding sequence; The animated video clip is generated by the transformer module based on the guiding feature sequence, the mask encoding sequence, noise, and the target text prompt.
3. The method according to claim 2, wherein, The video generation model is obtained through the following operations: Obtain a training dataset, which includes multiple target animation video clips, each of which has a corresponding text prompt. The target animation video clips are used to determine the guidance conditions. Based on the multiple target animated video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, which is used to generate animated videos.
4. The method according to claim 3, wherein, Obtain the training dataset, including: Obtain the original video set, which includes multiple animated videos; The multiple animated videos are segmented into scenes to obtain multiple video clips; Based on preset filtering rules, the multiple target animation video segments are determined from the multiple video segments; The multiple target animation video clips are input into a pre-trained video annotation model to obtain multiple text prompts through the video annotation model, with each text prompt corresponding to a target animation video clip; The training dataset is constructed based on the multiple target animated video clips and the multiple text prompts.
5. The method according to claim 3, wherein, The base model includes: a spatiotemporal mask module, an encoding module, and a transformer module; the target animation video clip includes a frame sequence composed of multiple frames. Correspondingly, based on the multiple target animation video clips and corresponding text prompts and guidance conditions, the pre-trained base model is subjected to supervised fine-tuning to obtain the target video generation model, including: The frame sequence is input to the spatiotemporal masking module to determine a guide frame from the plurality of frames, and a guide sequence and a mask sequence are determined based on the guide frame; wherein, the guide frame is used to guide the converter module to generate a series of frames associated with the guide frame; The guiding sequence and the mask sequence are input into the encoding module to obtain the guiding feature sequence and the mask encoding sequence through the encoding module; Based on the guiding feature sequence, the mask encoding sequence, noise, and the corresponding text prompts, the transformer module is subjected to supervised fine-tuning to obtain the target video generation model.
6. The method according to claim 5, wherein, Determining the boot sequence and mask sequence based on the boot frame includes: The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame; A first mask sequence is determined based on the boot sequence, the first mask sequence including a demasking mask, the demasking mask corresponding to the boot frame.
7. The method according to claim 5, wherein, Determining the boot sequence and mask sequence based on the boot frame includes: The guide frame is placed at the target position in the guide sequence, and the other positions in the guide sequence correspond to a series of frames generated by the converter module that are associated with the guide frame; A target mask is obtained based on the frame sequence. The target mask includes a masked region and an unmasked region. The target mask is used to control the motion region of the series of frames. The masked region corresponds to the non-motion region, and the unmasked region corresponds to the motion region. A second mask sequence is determined based on the guiding sequence and the target mask. The second mask sequence has the same length as the guiding sequence and includes multiple target masks.
8. The method according to claim 7, wherein, Obtaining the target mask based on the frame sequence includes: Foreground detection is performed on the first frame of the frame sequence to determine the foreground region of the first frame; Based on the foreground region of the first frame, a foreground mask corresponding to each of the plurality of frames is generated, and the foreground mask is used to represent the motion region and non-motion region of the corresponding frame; The target mask is generated based on the foreground mask corresponding to each of the multiple frames.
9. The method according to any one of claims 3 to 8, wherein, Based on the multiple target animation video clips and corresponding text prompts and guidance conditions, a pre-trained base model is subjected to supervised fine-tuning to obtain a target video generation model, including: Determine the resolution and frame rate of each of the multiple target animation video segments; Based on the resolution and frame rate of each of the multiple target animation video segments, the training dataset is divided into multiple subsets, and at least one of the resolution and frame rate of each subset is different from the other subsets; The training priority for each subset is determined based on the resolution and frame rate corresponding to each subset. Based on the training priority of each subset, the base model is sequentially fine-tuned using different subsets to obtain the target video generation model.
10. The method according to any one of claims 3 to 8, wherein, Also includes: Obtain a benchmark dataset, which includes multiple benchmark animated video clips, each with a corresponding text prompt; Based on the target video generation model, the baseline animation video clip, and the corresponding text prompts, the target video is obtained; Determine the visual quality and matching degree of the target video; Based on the visual quality and the matching degree, benchmark test results are determined, and the benchmark test results are used to evaluate the performance of the target visual model.
11. The method according to claim 10, wherein, The target video comprises multiple frames, and the visual quality includes visual smoothness, which is used to represent the coherence between the multiple frames. Correspondingly, determining the visual quality of the target video includes: Obtain the visual features of the multiple frames; Based on the visual features of the multiple frames, the similarity between two adjacent frames in the multiple frames is obtained sequentially; Based on the similarity between two adjacent frames in the plurality of frames, a first average similarity is determined, which is used to represent the visual smoothness.
12. The method according to claim 10, wherein, The visual quality includes motion attributes; correspondingly, determining the visual quality of the target video includes: The corresponding text prompt is input into a pre-trained motion scoring model to obtain a first motion score through the motion scoring model; The target video is input into the motion scoring model to obtain a second motion score; Determine the similarity between the first motion score and the second motion score, the similarity being used to represent the motion attribute.
13. The method according to claim 10, wherein, The target video comprises multiple frames, and the visual quality includes visual appeal. Correspondingly, determining the visual quality of the target video includes: Multiple keyframes are determined from the multiple frames; Aesthetic scores are obtained from the multiple keyframes to determine an average aesthetic score, which is used to represent the visual appeal.
14. The method of claim 10, wherein, The target video includes multiple frames, and the matching degree includes the matching degree between the corresponding text prompt and the target video; Correspondingly, obtaining the matching degree of the target video includes: The corresponding text prompt and the multiple frames are input into a multimodal pre-trained model to obtain multiple matching scores, each matching score representing the degree of matching between the corresponding frame and the text prompt; An average matching score is determined based on the multiple matching scores, and the average matching score is used to represent the degree of matching between the corresponding text prompt and the target video.
15. The method according to claim 10, wherein, The target video includes multiple frames, including a guide frame. The guide frame is used to guide the target video generation model to generate the target video. The matching degree includes the matching degree between the guide frame and the target video. Correspondingly, determining the matching degree of the target video includes: Obtain the target text from the corresponding text prompt, the target text being used to specify the image style of the guide frame; The guiding frame and the target text are input into a multimodal pre-trained model to obtain a first matching score; The multiple frames and the target text are input into the multimodal pre-trained model to obtain multiple second matching scores, each second matching score corresponding to one frame; A second average similarity is determined based on the similarity between the first matching score and the plurality of second matching scores, and the second average similarity is used to represent the similarity between the guide frame and the target video.
16. The method of claim 10, wherein, The target video includes multiple frames, the target video includes a target character, the benchmark animation video clip includes a benchmark character, and the matching degree includes the matching degree between the target character and the benchmark character; Correspondingly, determining the matching degree of the target video includes: The features of the benchmark character are obtained based on the benchmark animated video clip; Multiple sample frames are obtained from the multiple frames; The multiple sample frames are input into a pre-trained character detection model to obtain the features of the multiple target characters; Based on the similarity between the features of the baseline character and the features of multiple target characters, a character similarity is determined, which is used to represent the matching degree between the target character and the baseline character.
17. An animation video generation apparatus, wherein, The device includes: The acquisition module is used to acquire the target guidance image and target text prompt; The determination module is used to determine the spatiotemporal guidance conditions based on the target guidance image; The generation module is used to generate a series of frames that are associated with the target guidance image in time and space and match the target text prompt, based on the spatiotemporal guidance conditions and the guidance of the target text prompt, so as to constitute the animated video segment.
18. A computer device, wherein, include: At least one processor; and A memory communicatively connected to the at least one processor; wherein: The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 16.
19. A computer-readable storage medium, wherein, The computer-readable storage medium stores computer instructions that, when executed by a processor, implement the method as described in any one of claims 1 to 16.
20. A computer program product comprising computer-readable instructions, wherein, When executed by a processor, the computer-readable instructions implement the steps of the method as described in claims 1 to 16.