Image animation adapters for diffusion models
The system addresses the challenges of I2V diffusion models by using lightweight neural network adapters to efficiently generate high-quality videos with flexible motion animation, reducing training costs and computational demands.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-18
AI Technical Summary
Current image-to-video (I2V) diffusion models are costly to train, require large datasets and computing power, struggle with precise text prompt descriptions, and often produce inconsistent results, especially when animating real-world images, and are limited to specific motion patterns.
A system using lightweight neural network adapters, such as self-attention and cross-attention adapters, is attached to a pre-trained diffusion model to learn motion types with minimal data and computational cost, enabling flexible animation of images into videos without text prompts.
The system efficiently generates high-quality videos with temporal coherence and flexibility to animate different subjects with various motion patterns, reducing training costs and computational requirements.
Smart Images

Figure CN2024137979_18062026_PF_FP_ABST
Abstract
Description
IMAGE ANIMATION ADAPTERS FOR DIFFUSION MODELSTECHNICAL FIELD
[0001] The present application relates to generative artificial intelligence and, more particularly, to image-to-video (I2V) generation systems and methods.BACKGROUND
[0002] Diffusion models are a type of probabilistic generative model used in machine learning to generate data, by simulating transformation from random noise to a desired data distribution. These models are widely used for generating realistic text, images, videos, or other forms of data, and are often used in tasks like video generation.
[0003] BRIEF SUMMARY
[0004] In accordance with one aspect, the present disclosure describes a computer-implemented method for generating a video based on an input image using a pre-trained diffusion model. The method may include: receiving an input text prompt indicating a first motion type; training a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model; deploying the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; and processing the input image, by the modified diffusion model, to output the video.
[0005] In some implementations, the first image animation adapter may further include one or more of: a cross-attention adapter module that is paired with a cross-attention layer of the pre-trained diffusion model; a low-rank approximation layer that is paired with a temporal attention layer of the pre-trained diffusion model; or a text embedding comprising a feature vector based on the input text prompt.
[0006] In some implementations, the cross-attention adapter module may be configured to: for a given input video tensor, convert spatial token sequences associated with frames of the input video to a spatiotemporal token sequence by temporal pooling in order to obtain a query tensor in the cross-attention layer.
[0007] In some implementations, the cross-attention adapter module may be further configured to apply spatial down-sampling operation on the input video frame prior to projecting the spatiotemporal token sequence into a query embedding space.
[0008] In some implementations, the cross-attention adapter module may be further configured to associate each token of a query tensor in the cross-attention layer with a respective positional embedding representing at least one of a location or a frame number associated with the token.
[0009] In some implementations, the cross-attention adapter module may be further configured to fine-tune text conditioning by the cross-attention layer based on modifying encodings of the text embedding using low-rank adaptation modules.
[0010] In some implementations, the input text prompt may indicate a second motion type and the method may further include: training a second image animation adapter comprising a second cross-attention adapter module to learn the second motion type; and deploying a combination of the trained first image animation adapter and the trained second image animation adapter to obtain the modified diffusion model, wherein the output video is obtained by processing the input image by the modified diffusion model.
[0011] In some implementations, deploying the combination of the trained first image animation adapter and the second image animation adapter may include concatenating key and value tensors associated with each cross-attention adapter module using respective text embeddings.
[0012] In some implementations, the input text prompt may indicate a first subject that is identifiable in the input image and a movement associated with the first subject and the text embedding may be learnable.
[0013] In some implementations, the pre-trained diffusion model may be a text-to-video (T2V) model or a text-to-image (T2I) model.
[0014] In another aspect, a computing system is disclosed. The computing system includes a processor and memory coupled to the processor. The memory stores computer-executable instructions that, when executed by the processor, may configure the processor to: receive an input text prompt indicating a first motion type; train a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model; deploy the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; and process the input image, by the modified diffusion model, to output the video.
[0015] In another aspect, a non-transitory, computer-readable medium (CRM) is disclosed. The CRM stores instructions that, when executed by a processor, may configure the processor to: receive an input text prompt indicating a first motion type; train a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model; deploy the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; and process the input image, by the modified diffusion model, to output the video.
[0016] Other aspects and features of the present application will be understood by those of ordinary skill in the art from a review of the following description of examples in conjunction with the accompanying figures. Example embodiments of the present application are not limited to any particular operating system, system architecture, mobile device architecture, server architecture, or computer programming language.BRIEF DESCRIPTION OF THE DRAWINGS
[0017] Embodiments are described in detail below, with reference to the following drawings:
[0018] FIG. 1A is a high-level operation diagram of an example computing device;
[0019] FIG. 1B illustrates a simplified organization of software components stored in memory of the example computing device;
[0020] FIG. 2 is a schematic diagram illustrating a high-level design of a video generation diffusion model and image animation adapters in accordance with embodiments of the present disclosure;
[0021] FIG. 3 illustrates a process flow of a cross-attention layer in a conventional video generation diffusion model;
[0022] FIG. 4A illustrates a conventional process for obtaining a Query tensor based on an input frame tensor of a video;
[0023] FIG. 4B illustrates a proposed process for obtaining a Query tensor based on an input frame tensor of a video;
[0024] FIG. 5 illustrates a process flow for combining multiple animation adapters; and
[0025] FIG. 6 shows, in flowchart form, an example method for generating a video based on an input image using a pre-trained video generation diffusion model.
[0026] Like reference numerals are used in the drawings to denote like elements and features.
[0027] DETAILED DESCRIPTION OF VARIOUS EMBODIMENTS
[0028] In the present application, the term “and / or” is intended to cover all possible combinations and sub-combinations of the listed elements, including any one of the listed elements alone, any sub-combination, or all of the elements, and without necessarily excluding additional elements.
[0029] In the present application, the phrase “at least one of …or…” is intended to cover any one or more of the listed elements, including any one of the listed elements alone, any sub-combination, or all of the elements, without necessarily excluding any additional elements, and without necessarily requiring all of the elements.
[0030] Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce data. The success of diffusion models, such as Stable Diffusion, has elevated image and video generation quality to unprecedented levels. In particular, text-to-image (T2I) and text-to-video (T2V) models abound. Image-to-video, also known as image animation, refers to a generative setting in which a user inputs a reference image, I, and the model outputs an “animated” version of I. I2V diffusion models are still far from mature as a viable commercial solution, for several reasons.
[0031] First, a diffusion model is costly to train. Most publicly available diffusion models (including Stable Diffusion) are T2I or T2V models taking text prompt as the only input. Training an I2V diffusion model, either from scratch or by fine-tuning a T2V diffusion model, requires a massive dataset (typically millions of videos) and hundreds or thousands of GPU hours to optimize the billion-level parameters.
[0032] Second, most I2V diffusion models require input of a text prompt in addition to the image I. It is rather difficult to precisely describe the desired dynamics by text prompt, with heavy prompt engineering and hyperparameter tuning workload required. Furthermore, I2V diffusion models often struggle to comply with I and the text prompt simultaneously.
[0033] Training I2V models typically require a huge amount of data and computing power. A recent method named LAMP (Learn A Motion Pattern) enables a T2I diffusion model to learn a specific motion pattern with 8-16 videos on a single GPU, effectively reducing the cost of obtaining an I2V generation model. Although LAMP shows potential of generating video from an input image with a T2I model with less computational cost, there are a number of drawbacks.
[0034] LAMP learns the entire temporal module from scratch, typically at least 25% (and generally, over 33%) size of the base diffusion model, with few-shot data. As a result, the training convergence can be highly inconsistent across different training tasks. Furthermore, LAMP works for U-Net-based diffusion models only, as it relies on temporal convolutional layers attached to ResBlocks. As a consequence, LAMP has no support for another family of diffusion models, Diffusion Transformers (DiT) , which is free of ResBlocks.
[0035] LAMP was originally designed to animate an image generated by a T2I diffusion model. For animating real world images, LAMP produces mixed results; issues like poor temporal coherency and frame quality can be frequently observed, making the generation quality far from satisfactory.
[0036] Under LAMP setting, each learned I2V model can only produce a specific motion pattern, greatly narrowing applicable scenes. Users may wish to animate different subjects with respective moving patterns. But the design of LAMP model is unparallelizable, impossible to produce multiple motions by combining multiple models.
[0037] The present application discloses a system and methods for image-to-video generation. More specifically, given an input image I and a pre-trained diffusion model, the proposed system animates I, resulting in a video with motion that is specified by a user. The solution disclosed in the present application enables I2V generation with minimal data and computational cost, in contrast to large-scale GPU clusters and massive datasets. With a pre-trained diffusion model as the basis (preferably, a pre-trained T2V model) , the solution is characterized by a set of lightweight neural networks, referred to herein as “an animation adapter” , attached to the pre-trained diffusion model. An animation adapter learns a single type of motion, e.g. cloud moving, waterfall cascading, and the like.
[0038] At deployment stage, an animation adapter can be easily attached to the base diffusion model (for example, by implementing the attention adapter modules described herein with the base model) without any prompt engineering (i.e., no user-provided text prompt is needed) . Users can also opt to attach multiple animation adapters, each corresponding to a respective subject that is identified in the video. The additional computational cost due to the animation adapter (s) is negligible.
[0039] The number of parameters with an animation adapter is less than 5%of the base model, making it easy to converge, without affecting the prior knowledge within the base diffusion model. Moreover, in training, the animation adapter can be learned with the base diffusion model frozen under few-shot setting. Data preparation is easy: not more than very few (i.e., size at 101 scale) sample videos of the target motion are required. Unlike most diffusion model training processes requiring text-image / video pairs as training data, the present solution does not rely on text captions.
[0040] The proposed system can be deployed on any computing device supporting the computation of deep neural networks. Specifically, a computer with high parallel computing performance (e.g., one having a GPU) may be an ideal deployment environment for the proposed system.
[0041] The proposed solution can be utilized in any diffusion model that is directed to I2V tasks, and the trained animation adapter can be integrated regardless of the particular architecture (e.g., U-Net, DiT, etc. ) .
[0042] Reference is first made to FIG. 1A which is a high-level operation diagram of an example computing device 105. In at least some embodiments, the example computing device 105 may be configured to implement the proposed I2V generation system of the present disclosure. The example computing device 105 includes a variety of modules. For example, as illustrated, the example computing device 105, may include a processor 100, a memory 110, an input interface module 120, an output interface module 130, and a communications module 140. As illustrated, the foregoing example modules of the example computing device 105 are in communication over a bus 150.
[0043] The processor 100 is a hardware processor and may, for example, be one or more ARM, Intel x86, PowerPC processors or the like.
[0044] The memory 110 allows data to be stored and retrieved. The memory 110 may include, for example, random access memory, read-only memory, and persistent storage. Persistent storage may be, for example, flash memory, a solid-state drive or the like. Read-only memory and persistent storage are a computer-readable medium. A computer-readable medium may be organized using a file system such as may be administered by an operating system governing overall operation of the example computing device 105.
[0045] The input interface module 120 allows the example computing device 105 to receive input signals. Input signals may, for example, correspond to input received from a user. The input interface module 120 may serve to interconnect the example computing device 105 with one or more input devices. Input signals may be received from input devices by the input interface module 120. Input devices may, for example, include one or more of a touchscreen input, keyboard, trackball or the like. In some embodiments, all or a portion of the input interface module 120 may be integrated with an input device. For example, the input interface module 120 may be integrated with one of the aforementioned input devices.
[0046] The output interface module 130 allows the example computing device 105 to provide output signals. Some output signals may, for example allow provision of output to a user. The output interface module 130 may serve to interconnect the example computing device 105 with one or more output devices. Output signals may be sent to output devices by output interface module 130. Output devices may include, for example, a display screen such as, for example, a liquid crystal display (LCD) , a touchscreen display. Additionally, or alternatively, output devices may include devices other than screens such as, for example, a speaker, indicator lamps (such as for, example, light-emitting diodes (LEDs) ) , and printers. In some embodiments, all or a portion of the output interface module 130 may be integrated with an output device. For example, the output interface module 130 may be integrated with one of the aforementioned output devices.
[0047] The communications module 140 allows the example computing device 105 to communicate with other electronic devices and / or various communications networks. For example, the communications module 140 may allow the example computing device 105 to send or receive communications signals. Communications signals may be sent or received according to one or more protocols or according to one or more standards. For example, the communications module 140 may allow the example computing device 105 to communicate via a cellular data network, such as for example, according to one or more standards such as, for example, Global System for Mobile Communications (GSM) , Code Division Multiple Access (CDMA) , Evolution Data Optimized (EVDO) , Long-term Evolution (LTE) or the like. Additionally, or alternatively, the communications module 140 may allow the example computing device 105 to communicate using near-field communication (NFC) , via Wi-FiTM, using BluetoothTM or via some combination of one or more networks or protocols. Contactless payments may be made using NFC. In some embodiments, all or a portion of the communications module 140 may be integrated into a component of the example computing device 105. For example, the communications module may be integrated into a communications chipset.
[0048] Software comprising instructions is executed by the processor 100 from a computer-readable medium. For example, software may be loaded into random-access memory from persistent storage of memory 110. Additionally, or alternatively, instructions may be executed by the processor 100 directly from read-only memory of memory 110.
[0049] FIG. 1B depicts a simplified organization of software components stored in memory 110 of the example computing device 105. As illustrated, these software components include an operating system 180 and application software 170.
[0050] The operating system 180 is software. The operating system 180 allows the application software 170 to access the processor 100, the memory 110, the input interface module 120, the output interface module 130 and the communications module 140. The operating system 180 may be, for example, Apple iOSTM, Google’s AndroidTM, LinuxTM, Microsoft WindowsTM, or the like.
[0051] The application software 170 adapts the example computing device 105, in combination with the operating system 180, to operate as a device performing particular functions. While a single application software 170 is illustrated in FIG. 1B, in operation, the memory 110 may include more than one application software 170 and different application software 170 may perform different operations.
[0052] Image-to-Video Generation Diffusion Models
[0053] FIG. 2 illustrates a video generation diffusion model 200 in accordance with example embodiments of the present application. The diffusion model 200 builds upon traditional diffusion models but is specifically designed to handle the spatiotemporal dynamics of video data. In particular, the diffusion model 200 incorporates specialized components to model the sequential frames of a video while preserving temporal consistency and spatial detail. The diffusion model 200 relies on a neural network, such as a U-Net architecture. In some implementations, the diffusion model 200 may have an encoder-decoder diffusion architecture with multi-scale attention modules designed to produce coherent videos from a conditioned input, such as text or image (s) . The diffusion model 200 is composed of a plurality of modules, including Residual Blocks or ResBlocks (in U-Net architecture implementations) , self-attention (SA) blocks, cross-attention (CA) blocks, and temporal attention (TA) blocks. An “attention function” is a mechanism used in machine learning to focus on most relevant parts of input data when making predictions, and forms the core of self-attention, cross-attention, and temporal attention layers. The attention function works by computing a weighted sum of “Value” vectors based on their relevance, as determined by “Query” and “Key” vectors, and is often represented as:
[0054] where Q, K, V are Query, Key, and Value vectors based on the layer input. The Query is a vector representing what is currently being focused on or what information is being sought. The Key is a vector representing each piece of information in the input data, used to determine its relevance to the Query. The Value is a vector containing the actual information or data being attended to.
[0055] The diffusion model 200 backbone may provide hierarchical feature extraction and reconstruction for each frame, capturing multi-scale details while managing spatial and temporal coherence across frames. In the case of U-Net architecture, the encoder processes each frame using a series of ResBlocks. Each ResBlock consists of convolutional layers with skip connections, allowing for deep feature extraction without gradient loss. The decoder mirrors the encoder, applying up-sampling ResBlocks to reconstruct the denoised frames at the original resolution.
[0056] Self-attention blocks cooperate with ResBlocks to refine spatial feature extraction by allowing long-range dependencies within each frame. Self-attention refines spatial features and ensures pixel-wise consistency within each frame. In particular, self-attention computes relationships between pixels within a single frame, providing a global context that enhances spatial coherence and produces high quality frame outputs.
[0057] Cross-attention blocks are included to condition each frame generation on external inputs, such as text prompts, previous frames, or style images. The external input is often encoded as a query (e.g., text embeddings or style features) that interacts with keys and values from the frame features. Cross-attention can be added at any position, and typically after each self-attention block, allowing conditioning on the initial frame generation and throughout diffusion steps for consistency.
[0058] Temporal-attention blocks ensure smooth transitions and continuity across frames, which is crucial for generating coherent video sequences. Temporal attention blocks can be added at any position, and typically after each cross-attention block. Temporal blocks are designed to capture dependencies from adjacent frames, enabling the model to learn motion patterns and temporal coherence.
[0059] The core component of the proposed system in FIG. 2 is an “animation adapter” , which comprises four learnable modules: a self-attention (SA) -Adapter, paired with each SA block; a cross-attention (CA) -Adapter, paired with each CA block; MotionLoRA modules, paired with each temporal-attention (TA) block; and text embeddings, one per adapter.
[0060] The SA-Adapter is a required module of the animation adapter, while the other three modules may be optionally integrated with the animation adapter. For example, users can opt to exclude the CA-Adapter, the MotionLoRA layers, and / or the text embedding for a trade-off between computational cost and performance. It should be noted that all learnable parameters of the modules of the animation adapter may be under 5%of the size of the base diffusion model.
[0061] During training, given a training dataset (e.g., size at 101 scale) , the system may be trained with denoising loss while keeping the constituent modules of the diffusion model 200 frozen.
[0062] A cross-attention layer captures the interactions between the visual features and the conditioning signal (i.e., a text prompt) . For the base diffusion model (which may be a T2I or T2V) , the text prompt describes details of the image or video to generate. However, the use of a text prompt should be largely altered in the I2V generation setting, since the input image I provides a vivid, and even more precise, description of the visual content in the generated video. Instead, the value of a text prompt lies in the information content beyond I , referred to as inter-frame dependency, e.g. the motion of objects and the camera.
[0063] In existing video generation diffusion models, cross-attention is defined by: CA (f, c; WQ, K, V, O) =Att (fWQ, cWK, cWV) WO
[0064] where f is an embedding of an input image (i.e., a single frame) , c is a text embedding, W*are projection matrices. Q= fWQ encodes the visual features f, while K, V encode the text features c. In particular, the text prompt is encoded into embeddings used to condition each frame during the diffusion process. FIG. 3 diagrammatically illustrates a process flow of a cross-attention layer in a conventional video generation diffusion model, which processes a video frame tensor x and text embedding c and produces an output video frame tensor x*.
[0065] The cross-attention equation indicates that each frame f is processed independently of other frames, which may result in overlooking inter-frame dependency. Given an input video tensor previous techniques reshape x into consisting of B × F token sequences (corresponding to f above) , each having HW C-dimensional tokens. In this case, a token sequence corresponds to an individual frame, and therefore cannot model cross-frame relationships.
[0066] To obtain the Query tensor, the model projects each spatial token sequence into a query embedding space, through a learnable linear layer represented by WQ (see, for example, FIG. 4A) . The Query tensor can then be used in the cross-attention mechanism to determine relevance to the conditional input. Each frame has its own Query tensor generated from its spatial tokens. The queries from each frame are then used to interact with keys and values derived from the conditional input, such as a text prompt.
[0067] The present application proposes modifications to the cross-attention layers of a pre-trained video generation diffusion model, represented by the CA-Adapter (i.e., cross-attention adapter module) .
[0068] In at least some implementations, the CA-Adapter changes from token sequences of single frames to that of the entire video That is, given an input video tensor x, it is reshaped into The spatial token sequence may be converted into a spatiotemporal token sequence (which represents features across both spatial and temporal dimensions) by pooling. For example, a pooling operation may be applied across the frames for each spatial location (path) in the sequence of spatial tokens. The pooling options may include, for example, mean pooling, max pooling, and learnable pooling. By making the token sequence spatiotemporal, the Query includes information for the entire video, and not only for single frames, thus enabling cross-attention to model cross-frame information, i.e., information regarding motion depicted in the video.
[0069] Optionally, as shown in FIG. 4B, a spatial down-sampling layer (turning the frame resolution H×W into Hs×Ws, ) can be inserted before the projection WQ to save computational cost, resulting in For example, the down-sampling may be implemented by an interpolate () operation in Python (numpy / PyTorch) , turning a tensor of (B, C, F, H, W) into or vice versa. A paired up-sampling layer may be required to be inserted after the cross-attention operation, in this scenario.
[0070] In some implementations, the text embedding c is made learnable so that the I2V model can learn the connection between the prompt word and the subject’s movement. For example, a text embedding layer may be integrated and trained along with the rest of the model. In this way, the model is able to learn text representations that are optimized for specific tasks. The text prompt may be set as a word or a short phrase corresponding to the subject to move (e.g. “waterfall” and “clouds” in the examples above) where we fine-tune the embedding of these words c. One of the ways in which c can be fine-tuned is known as textual inversion, and may be implemented by the CA-Adapter. The details of implementing textual inversion may be found, for example, in “An image is worth one word: Personalizing text-to-image generation using textual inversion” (Gal, Rinon, et al., 2022: arXiv: 2208.01618) , the contents of which are incorporated herein by reference.
[0071] In some implementations, each token in Query, Key, and Value tensors may be integrated with a respective “positional embedding” , a tensor specifying the location and frame number associated with the token. A positional embedding module may be integrated into a cross-attention layer (either before or after the projection WQ in FIG. 4B) to process token sequences. The design of the positional embedding function implemented by the module may be absolute positional embedding or rotary positional embedding (RoPE) . These designs are described in detail in “Attention is All You Need” (Vaswani, Ashish, et al., 2017, arXiv: 1706.03762) and “RoFormer: Enhanced Transformer with Rotary Position Embedding” (Siu, Jianlin, et al., 2021, arXiv: 2104.09864) , the contents of which are incorporated herein by reference.
[0072] Low-Rank Adaptation (LoRA) is a technique used to fine-tune large pre-trained machine learning models. It involves adapting only a small subset of the model’s parameters rather than the entire model, by leveraging low-rank matrix factorization. In some implementations, LoRA modules ΔWK, V, Omay be attached to fixed, pre-trained weight matrices WK, V, O in order to fine-tune the text conditioning mechanism. The LoRA modules are employed to fine-tune WK, WV, WQ, and the modification is represented by the equation: CA (x; WQ, K, V, O, ΔWK, V, O, c) =Att (xWQ, c (WK+ΔWK) , c (WV+ΔWV) ) · (WO+ΔWO)
[0073] The text embedding c is placed at the end of the parameters to emphasize that c has been part of the adapter’s parameters. As a result, x becomes the only input at inference time.
[0074] In some implementations, multiple animation adapters may be combined by concatenating the Key and Value tensors given by each adapter, using the respective text embeddings, at inference time. The combination of adapters may be represented by the equation:
[0075] The parallelism of multiple animation adapters is illustrated in FIG. 5. Each CA-Adapter i is represented by consisting of 3 LoRA modules and the learned text embedding ci.
[0076] In order to boost the generated video quality, we propose to a technique that can be applied in the inference stage of I2V model. Specifically, we perform adaptive instance normalization (AdaIN) to the latent video frames after the last iteration of sampling so that each frame’s pixel statistics (mean and standard deviation) align with the first frame. The technique does not require any training or optimization, with little additional computational cost.
[0077] Reference is now made to FIG. 6 which shows, in flowchart form, an example method 600 for generating a video based on an input image, using a pre-trained video generation diffusion model. The operations of method 600 may be performed by a computing system, such as the computing device 105 of FIG. 1A. Specifically, a computing device that is configured to load and execute a video generation diffusion model may perform all of parts of the method 600.
[0078] In operation 602, the computing system receives an input text prompt indicating a first motion type. The input text prompt may be part of a request to the pre-trained diffusion model, which may be, for example, a text-to-video (T2V) model or a text-to-image (T2I) model. The diffusion model includes, at least, spatial blocks (including self-attention and cross-attention blocks) and a temporal attention block. The input text prompt may indicate a first subject that is identifiable in the input image and a movement associated with the first subject. In particular, the input text prompt may include a description of the first subject and / or a user-specified movement that is desired to be animated in connection with the first subject.
[0079] In operation 604, the computing system trains a first image animation adapter comprising a plurality of neural networks to learn the first motion type. The first image animation adapter may be an instance of the proposed animation adapter described in the present application. More particularly, the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model. In at least some implementations, the first image animation adapter includes (in addition to the self-attention adapter module) one or more of: a cross-attention adapter module that is paired with a cross-attention layer of the pre-trained diffusion model; a low-rank approximation layer that is paired with a temporal attention layer of the pre-trained diffusion model; or a text embedding comprising a feature vector based on the input text prompt. In some implementations, each cross-attention layer and temporal attention layer may be associated with multiple LoRA modules (for example, attaching to WK, WV, WQ, WO respectively) .
[0080] The attention layers of the diffusion model process input video tensors, which are multi-dimensional arrays representing video data in a format suitable for processing by the model. An input video tensor encodes spatial, temporal, and potentially additional features (e.g., color channels) of a video.
[0081] In some implementations, the cross-attention adapter module is configured to, for a given input video tensor, convert spatial token sequences associated with frames of the input video to a spatiotemporal token sequence by temporal pooling in order to obtain a query tensor in the cross-attention layer. The cross-attention adapter module may be further configured to apply spatial down-sampling operation on the input video frame prior to projecting the spatiotemporal token sequence into a query embedding space.
[0082] In some implementations, the cross-attention adapter module is configured to associate each token of a query tensor in the cross-attention layer with a respective positional embedding representing at least one of a location or a frame number associated with the token.
[0083] In some implementations, the cross-attention adapter module is configured to fine-tune text conditioning by the cross-attention layer based on modifying encodings of the text embedding using low-rank adaptation modules.
[0084] In operation 606, the computing system deploys the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model. In particular, the learnable attention adapter modules, including the SA-Adapter, CA-Adapter, MotionLoRA layer (s) may be implemented and combined with the SA layer, CA layer, TA layer, respectively, and a text embedding module may be implemented for each animation adapter. The computing system processes the input image, by the modified diffusion model, to output the video (operation 608) . The output video includes an animation of the identified first subject so as to depict the user-specified movement of the first subject.
[0085] In some implementations, the input text prompt may also indicate a second motion type. The computing system is further configured to: train a second image animation adapter comprising a second cross-attention adapter module to learn the second motion type; and deploy a combination of the trained first image animation adapter and the trained second image animation adapter to obtain the modified diffusion model. The combination of the trained first image animation adapter and the second image animation adapter may be deployed by concatenating key and value tensors associated with each cross-attention adapter module using respective text embeddings, as described in the present disclosure. The output video is obtained by processing the input image by the modified diffusion model.
[0086] More generally, the computing system is configured to support a plurality of motion patterns. In particular, the proposed system can handle multiple input text prompts that specify different motion types.
[0087] The proposed technique can be combined with other existing inference techniques (e.g. shared-noise sampling, DCT-Init, histogram equalization, etc. ) . Users can easily adjust the movement intensity by tuning one or more of the parameters.
[0088] The methods and / or processes described above, and steps thereof, may be realized in hardware, software or any combination of hardware and software suitable for a particular application. The hardware may include a general-purpose computer and / or dedicated computing device or specific computing device or particular aspect or component of a specific computing device. The processes may be realized in one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors or other programmable devices, along with internal and / or external memory. The processes may also, or instead, be embodied in an application specific integrated circuit, a programmable gate array, programmable array logic, or any other device or combination of devices that may be configured to process electronic signals. It will further be appreciated that one or more of the processes may be realized as a computer executable code capable of being executed on a machine-readable medium.
[0089] The computer executable code may be created using a structured programming language such as C, an object oriented programming language such as C++, or any other high-level or low-level programming language (including assembly languages, hardware description languages, and database programming languages and technologies) that may be stored, compiled or interpreted to run on one of the above devices, as well as heterogeneous combinations of processors, processor architectures, or combinations of different hardware and software, or any other machine capable of executing program instructions.
[0090] Thus, in one aspect, each method described above, and combinations thereof may be embodied in computer executable code that, when executing on one or more computing devices, performs the steps thereof. In another aspect, the methods may be embodied in systems that perform the steps thereof and may be distributed across devices in a number of ways, or all of the functionality may be integrated into a dedicated, standalone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and / or software described above. All such permutations and combinations are intended to fall within the scope of the present disclosure.
Claims
1.A computer-implemented method for generating a video based on an input image using a pre-trained diffusion model, the method comprising:receiving an input text prompt indicating a first motion type;training a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model;deploying the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; andprocessing the input image, by the modified diffusion model, to output the video.2.The method of claim 1, wherein the first image animation adapter further includes one or more of:a cross-attention adapter module that is paired with a cross-attention layer of the pre-trained diffusion model;a low-rank approximation layer that is paired with a temporal attention layer of the pre-trained diffusion model; ora text embedding comprising a feature vector based on the input text prompt.3.The method of claim 2, wherein the cross-attention adapter module is configured to:for a given input video tensor, convert spatial token sequences associated with frames of the input video to a spatiotemporal token sequence by temporal pooling in order to obtain a query tensor in the cross-attention layer.4.The method of claim 3, wherein the cross-attention adapter module is further configured to apply spatial down-sampling operation on the input video frame prior to projecting the spatiotemporal token sequence into a query embedding space.5.The method of claim 2, wherein the cross-attention adapter module is further configured to associate each token of a query tensor in the cross-attention layer with a respective positional embedding representing at least one of a location or a frame number associated with the token.6.The method of claim 2, wherein the cross-attention adapter module is further configured to fine-tune text conditioning by the cross-attention layer based on modifying encodings of the text embedding using low-rank adaptation modules.7.The method of claim 2, wherein the input text prompt also indicates a second motion type and wherein the method further comprises:training a second image animation adapter comprising a second cross-attention adapter module to learn the second motion type; anddeploying a combination of the trained first image animation adapter and the trained second image animation adapter to obtain the modified diffusion model,wherein the output video is obtained by processing the input image by the modified diffusion model.8.The method of claim 7, wherein deploying the combination of the trained first image animation adapter and the second image animation adapter comprises concatenating key and value tensors associated with each cross-attention adapter module using respective text embeddings.9.The method of claim 2, wherein the input text prompt indicates a first subject that is identifiable in the input image and a movement associated with the first subject and wherein text embedding is learnable.10.The method of claim 1, wherein the pre-trained diffusion model comprises a text-to-video (T2V) model or a text-to-image (T2I) model.11.A computing system, comprising:a processor; andmemory coupled to the processor, the memory storing computer-executable instructions that, when executed by the processor, configure the processor to:receive an input text prompt indicating a first motion type;train a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model;deploy the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; andprocess the input image, by the modified diffusion model, to output the video.12.The computing system of claim 11, wherein the first image animation adapter further includes one or more of:a cross-attention adapter module that is paired with a cross-attention layer of the pre-trained diffusion model;a low-rank approximation layer that is paired with a temporal attention layer of the pre-trained diffusion model; ora text embedding comprising a feature vector based on the input text prompt.13.The computing system of claim 12, wherein the cross-attention adapter module is configured to:for a given input video tensor, convert spatial token sequences associated with frames of the input video to a spatiotemporal token sequence by temporal pooling in order to obtain a query tensor in the cross-attention layer.14.The computing system of claim 13, wherein the cross-attention adapter module is further configured to apply spatial down-sampling operation on the input video frame prior to projecting the spatiotemporal token sequence into a query embedding space.15.The computing system of claim 12, wherein the cross-attention adapter module is further configured to associate each token of a query tensor in the cross-attention layer with a respective positional embedding representing at least one of a location or a frame number associated with the token.16.The computing system of claim 12, wherein the cross-attention adapter module is further configured to fine-tune text conditioning by the cross-attention layer based on modifying encodings of the text embedding using low-rank adaptation modules.17.The computing system of claim 11, wherein the input text prompt also indicates a second motion type and wherein the method further comprises:training a second image animation adapter comprising a second cross-attention adapter module to learn the second motion type; anddeploying a combination of the trained first image animation adapter and the trained second image animation adapter to obtain the modified diffusion model,wherein the output video is obtained by processing the input image by the modified diffusion model.18.The computing system of claim 17, wherein deploying the combination of the trained first image animation adapter and the second image animation adapter comprises concatenating key and value tensors associated with each cross-attention adapter module using respective text embeddings.19.The computing system of claim 12, wherein the input text prompt indicates a first subject that is identifiable in the input image and a movement associated with the first subject and wherein text embedding is learnable.20.A non-transitory, computer-readable medium storing computer-executable instructions that, when executed by a processor, configure the processor to:receive an input text prompt indicating a first motion type;train a first image animation adapter comprising a plurality of neural networks to learn the first motion type, wherein the first image animation adapter includes at least a self-attention adapter module that is paired with a self-attention layer of the pre-trained diffusion model;deploy the trained first image animation adapter by attaching to the pre-trained diffusion model to obtain a modified diffusion model; andprocess the input image, by the modified diffusion model, to output the video.