Vector graphic animation generation method and system based on implicit neural representation and text-to-video diffusion model

By combining implicit neural representations and a text-to-video diffusion model, efficient and high-quality vector animations are generated, solving the problems of low efficiency and low quality in existing technologies. This technology is applicable to fields such as animation production, game development, and advertising design.

WO2026124131A1PCT designated stage Publication Date: 2026-06-18PEKING UNIV

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
PEKING UNIV
Filing Date
2025-11-17
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing technologies suffer from inefficiency, low animation quality, and weak shape and color constraints in vector animation production, especially in automated processing where it is difficult to generate high-quality vector animations.

Method used

A method based on implicit neural representation and text-to-video diffusion model is adopted. The initial static video is generated through a hierarchical implicit neural representation network, motion information is extracted by combining the text-to-video diffusion model, and the animation video is refined by optical flow calculation to generate high-quality fine-grained animation.

🎯Benefits of technology

It improves the efficiency and quality of vector animation generation, maintains consistency in color and shape, lowers the barrier to entry for animation production, and is applicable to fields such as animation production, game development, and advertising design.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025135470_18062026_PF_FP_ABST
    Figure CN2025135470_18062026_PF_FP_ABST
Patent Text Reader

Abstract

The present invention belongs to the field of computer graphics. Disclosed are a vector graphic animation generation method and system based on implicit neural representation and a text-to-video diffusion model. The method comprises: on the basis of n hierarchical implicit neural representation networks distributed side by side, generating an initial static video corresponding to a vector graphic; using a text-to-video diffusion model to extract action information from a text prompt, so as to generate a coarse-grained animation video corresponding to the initial static video; and refining the coarse-grained animation video, so as to obtain a fine-grained animation video. The present invention improves the efficiency and quality of vector animation generation, and lowers the threshold of animation production.
Need to check novelty before this filing date? Find Prior Art

Description

Vector graphics animation generation method and system based on implicit neural representation and text-to-video diffusion model TECHNICAL FIELD

[0001] The present application relates to the field of computer graphics, in particular to the technology of automatic generation of vector graphics animation, and more particularly to a vector graphics animation generation method and system based on implicit neural representation and text-to-video diffusion model. BACKGROUND

[0002] Vector graphics are widely used in industrial design, animation production, and web design due to their scalability and user-friendliness. Compared with traditional raster images, vector graphics are composed of well-defined graphical elements such as ellipses, straight lines, and Bezier curves. This structure ensures that vector graphics maintain their clarity when scaled, making them suitable for a variety of display devices and resolutions. However, the process of vector animation production often requires extensive manual operations, including the establishment of complex skeletons, the setting of motion trajectories, and time control, which results in low efficiency and a huge workload in animation production.

[0003] Some existing methods attempt to automate vector graphics animation through computers, but most of these methods suffer from insufficient flexibility, low animation quality, and poor constraints on color and shape. For example, some methods convert vector graphics into raster images before producing animations, resulting in the loss of the original properties of the graphics and the lack of consistency in shape and color in the generated animations. Therefore, a new automated method is needed to effectively generate high-quality vector animations to reduce the production threshold and improve efficiency. SUMMARY

[0004] To address the above problems, the present application discloses a vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model, which improves the efficiency and quality of vector animation generation and reduces the threshold of animation production.

[0005] To achieve the above-mentioned purposes, the technical solutions of the present application include the following contents.

[0006] A vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model, the method comprising:

[0007] Generating an initial static video corresponding to the vector graphics based on n layer-by-layer distributed implicit neural representation networks;

[0008] Extracting motion information from the text prompt words using a text-to-video diffusion model to generate a coarse-grained animation video corresponding to the initial static video;

[0009] Refining the coarse-grained animation video to obtain a fine-grained animation video.

[0010] Further, the n-parallel distribution-based hierarchical implicit neural representation network generates an initial static video corresponding to the vector graphics, including:

[0011] parsing the vector graphics to obtain hierarchical color information c l and shape regions with corresponding colors of each layer l∈[1,L], L is the total number of layers of the hierarchical implicit neural representation network;

[0012] optimizing the hierarchical implicit neural representation network using the mean square error loss of each layer, so that the color opacity m l of the output of each layer is consistent with the shape region with the corresponding color of the input vector graphics;

[0013] obtaining a background color c0, and filling the background color c0 on the canvas;

[0014] In each layer, color filling is performed based on the color opacity m l , the hierarchical color information c l , and the shape region to generate the i-th initial video frame V i in the initial static video.

[0015] Further, the action information in the text prompt is extracted using the text-to-video diffusion model to generate a coarse-grained animation video corresponding to the initial static video, including:

[0016] adding noise ε to the video frame to obtain the noise-added video frame after the t-th time, where t is a natural number, is the i-th video frame in the initial static video;

[0017] inputting the text prompt and the noise-added video frame into the text-to-video diffusion model to obtain the predicted noise where c represents the text prompt, denotes the text-to-video diffusion model;

[0018] optimizing the i-th hierarchical implicit neural representation network based on the difference between the predicted noise and the noise ε to make the i-th hierarchical implicit neural representation network output the video frame

[0019] Let t=t+1, and re-execute the step of adding noise ε to the video frame to obtain the noise-added video frame​​ until the i-th hierarchical implicit neural representation network converges, thereby obtaining the i-th video frame in the coarse-grained animated video

[0020] Further, the text-to-video diffusion model comprises a noise prediction model based on a transformer architecture.

[0021] Further, the refining of the coarse-grained animated video comprises:

[0022] calculating the optical flow of the vector graphics to the i-th video frame in the coarse-grained animated video. Further, the refining of the coarse-grained animated video comprises:

[0023] Based on the optical flow calculation result, the parameters of the vector graphics are displaced to obtain the i-th video frame in the fine-grained animated video.

[0024] A vector graphics animation generation system based on implicit neural representation and text-to-video diffusion model, the system comprises:

[0025] A static video generation module for generating an initial static video corresponding to a vector graphics based on n hierarchical implicit neural representation networks distributed side by side.

[0026] A coarse-grained video generation module for extracting action information in a text prompt word using a text-to-video diffusion model to generate a coarse-grained animated video corresponding to the initial static video.

[0027] A fine-grained video generation module for refining the coarse-grained animated video to obtain a fine-grained animated video.

[0028] An electronic device comprising a processor and a memory storing computer program instructions; the processor executes the computer program instructions to implement the vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model as described in any of the above.

[0029] A computer readable storage medium having computer program instructions stored thereon, the computer program instructions being executed by a processor to implement the vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model as described in any of the above.

[0030] ​Compared with existing technologies, this invention significantly improves the efficiency and quality of vector animation generation, can handle complex graphics and motion expressions, and maintains consistency in color and shape, thereby lowering the threshold for animation production. It can be widely used in animation production, game development, advertising design and other fields. Users only need to provide basic vector graphics and text prompts to automatically generate high-quality animation effects, providing convenience for creation. Attached Figure Description

[0031] Figure 1 is a flowchart of a vector graphics animation generation method based on implicit neural representation and a text-to-video diffusion model. Detailed Implementation

[0032] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below through specific implementations and in conjunction with the accompanying drawings.

[0033] The purpose of this invention is to provide a vector graphics animation generation method based on implicit neural representation and a text-to-video diffusion model. This method reconstructs the input vector graphics using hierarchical implicit neural representation and automatically generates animations using text prompts. The process is shown in Figure 1.

[0034] Step 1: Generate the initial static video corresponding to the vector graphics based on n side-by-side hierarchical implicit neural representation networks.

[0035] This step can be divided into two parts: vector graphics reconstruction and still video generation.

[0036] In vector graphics reconstruction, a vector graphics file input by the user is received, and the color information of each layer and the shape region with the corresponding color in each layer are parsed. Let there be L layers, and let the layer color information be (c1, c2, ..., c...). L Then, a hierarchical implicit neural representation is used for image reconstruction. Traditionally, the input to an image's implicit neural representation is coordinates (X, Y), and the output is the corresponding color (R, G, B). To adapt to the hierarchical nature of vector graphics files, the output of the hierarchical implicit neural representation is the color opacity of each layer (m1, m2, ..., m). L The image represents the result of applying colors layer by layer according to their opacity. Specifically, when rendering the image, the canvas is first filled with a background color c0, and then for each layer, the opacity m of the color at each coordinate is calculated using a neural network. l According to opacity m l Overlay the corresponding color c on the image. l The neural network with hierarchical implicit neural representation is optimized by using the mean squared error loss of each layer, so that the represented graph can reconstruct the vector graphics of the input.

[0037] In static video generation, the present application directly copies the reconstructed hierarchical implicit neural representation network multiple times to adapt to the number of video frames, generating an initial static video.

[0038] Step 2: Extract action information in the text prompt word using a text-to-video diffusion model to generate a coarse-grained animated video corresponding to the initial static video.

[0039] The present application inputs the initial video frame into a pre-trained fixed-parameter text-to-video diffusion model to perform multiple rounds of repeated optimization on the parameters of the hierarchical implicit neural representation neural network. The text-to-video diffusion model used is a transformer-based noise prediction model that has been pre-trained on natural video data and does not require further fine-tuning. Let the initial video frame be V, in the optimization process, add noise ε to get noisy video V' = α t V+σ t , where t is the number of added noise steps, α t and σ t are parameters related to t, and the text-to-video diffusion model is used to obtain the predicted noise , where c is the text prompt word, and the score distillation sampling loss calculation method is the difference between the added noise and the predicted noise , where w(t) is a parameter related to t, and backpropagation is performed to optimize all hierarchical implicit neural representation networks. After multiple rounds of optimization, the initial static video gradually produces motion, and the difference between the added noise and the predicted noise gradually decreases, thereby extracting action information from the diffusion model.

[0040] In one embodiment, the process of generating a coarse-grained animated video includes the following steps.

[0041] Step 2.1: Add noise ε to the video frame to obtain the noisy video frame after the tth addition of noise , where t is a natural number, is the ith video frame in the initial static video;

[0042] Step 2.2: Input the text prompt word and the noisy video frame into the text-to-video diffusion model to obtain the predicted noise , where c represents the text prompt word, represents the text-to-video diffusion model;

[0043] Step 2.3: Based on the difference between the predicted noise and the noise ε , optimize the ith hierarchical implicit neural representation network to make the ith hierarchical implicit neural representation network output video frame

[0044] Step 2.4: Let t = t + 1, and re-execute the adding noise ε in the video frame to obtain the noise-added video frame after the tth time of adding noise until the i th hierarchical implicit neural representation network converges, thereby obtaining the i th video frame in the coarse-grained animation video

[0045] Step 3: Refining the coarse-grained animation video to obtain a fine-grained animation video.

[0046] The present application uses an optical flow method to calculate the optical flow from the original graphics to each frame of animation, and displaces the parameter points of the original vector graphics according to the optical flow to ensure smooth transition of the animation. When necessary, the mean square error loss is used to directly optimize each frame, enhancing flexibility.

[0047] As mentioned above, there is a huge data domain gap in the middle of directly optimizing the parameters of the vector graphics using the text-to-video diffusion model, while implicit neural representation has many good properties, that is, implicit neural representation not only has the characteristics of infinite resolution like vector graphics, but also provides the operability of shape, which can produce relatively free motion under the condition of constraining shape and color. Therefore, the present application innovatively makes implicit neural representation serve as a bridge between the parameter domain of vector graphics and the diffusion model, thereby generating a coarse-grained animation video with higher quality.

[0048] Based on the same concept, the present application also discloses a vector graphics animation generation system based on implicit neural representation and text-to-video diffusion model, which comprises:

[0049] a static video generation module for generating an initial static video corresponding to the vector graphics based on n hierarchical implicit neural representation networks distributed side by side;

[0050] a coarse-grained video generation module for extracting motion information in the text prompt words using a text-to-video diffusion model to generate a coarse-grained animation video corresponding to the initial static video;

[0051] a fine-grained video generation module for refining the coarse-grained animation video to obtain a fine-grained animation video.

[0052] Based on the same concept, the present application also discloses an electronic device, characterized in that the electronic device comprises a processor and a memory storing computer program instructions; the processor executes the computer program instructions to realize the vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model according to any one of the above.

[0053] Based on the same concept, the application further discloses a computer readable storage medium, characterized in that computer program instructions are stored on the computer readable storage medium, and the computer program instructions are executed by a processor to implement the vector graphic animation generation method based on implicit neural representation and text-to-video diffusion model.

[0054] The above embodiments are only used to illustrate the technical solutions of the present application but not to limit the present application, and the technical solutions of the present application can be modified or replaced by equivalents without departing from the spirit and scope of the present application, and the protection scope of the present application should be subject to the description in the claims.

Claims

1. A vector graphics animation generation method based on implicit neural representation and a text-to-video diffusion model, characterized in that, The method includes: Based on n side-by-side hierarchical implicit neural representation networks, an initial static video corresponding to vector graphics is generated; The text-to-video diffusion model is used to extract action information from text prompts in order to generate a coarse-grained animated video corresponding to the initial static video. The coarse-grained animation video is refined to obtain a fine-grained animation video.

2. The method according to claim 1, characterized in that, The generation of the initial static video corresponding to the vector graphics, based on n side-by-side hierarchical implicit neural representation networks, includes: Parse the vector graphics to obtain layered color information c l and each layer has a shape area with a corresponding color. L represents the total number of layers in the hierarchical implicit neural network; The hierarchical implicit neural representation network is optimized using mean squared error loss for each layer, so that the color opacity m of the output of each layer is optimized. l The shape area corresponding to the color of the input vector graphic. Consistent; Get the background color c0, and fill the canvas with the background color c0; In each layer, based on the color opacity m l The layered color information c l and the shape region s l l Perform color filling to generate the i-th initial video frame V in the initial still video. i .

3. The method according to claim 1, characterized in that, The step of extracting action information from text prompts using a text-to-video diffusion model to generate a coarse-grained animated video corresponding to the initial static video includes: In video frames Adding noise ε to the video frame, we obtain the noisy video frame V after the t-th addition of noise. i ′ (t) Where t is a natural number, This refers to the i-th video frame in the initial static video; The text prompt and the noisy video frame V i ′ (t) The text-to-video diffusion model is input to obtain the predicted noise. Where 'c' represents a text prompt word, This represents a text-to-video diffusion model; Based on predicted noise The difference between the noise ε and Optimize the i-th hierarchical implicit neural representation network so that it outputs video frames. Let t = t + 1, and re-execute the above steps in the video frame. Adding noise ε to the video frame, we obtain the noisy video frame V after the t-th addition of noise. i ′ (t) This continues until the i-th hierarchical implicit neural representation network converges, thus obtaining the i-th video frame in the coarse-grained animation video.

4. The method according to claim 1, characterized in that, The text-to-video diffusion model includes a noise prediction model based on the transformer architecture.

5. The method according to claim 1, characterized in that, The process of refining the coarse-grained animation video to obtain a fine-grained animation video includes: Calculate the vector graphics to video frames Optical flow, the video frame Let i be the i-th video frame in the coarse-grained animation video; Based on the optical flow calculation results, the parameters of the vector graphics are shifted to obtain the i-th video frame in the fine-grained animation video.

6. A vector graphics animation generation system based on implicit neural representation and a text-to-video diffusion model, characterized in that, The system includes: The static video generation module is used to generate initial static videos corresponding to vector graphics based on n side-by-side hierarchical implicit neural representation networks. The coarse-grained video generation module is used to extract action information from text prompts using a text-to-video diffusion model in order to generate a coarse-grained animated video corresponding to the initial static video. The fine-grained video generation module is used to refine the coarse-grained animation video to obtain a fine-grained animation video.

7. An electronic device, characterized in that, The electronic device includes: a processor and a memory storing computer program instructions; when the processor executes the computer program instructions, it implements the vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions, which, when executed by a processor, implement the vector graphics animation generation method based on implicit neural representation and text-to-video diffusion model as described in any one of claims 1-5.