Vector animation generation method and system based on sparse state modeling and rendering perception reinforcement
By using sparse state modeling and rendering-aware reinforcement learning, we have solved the problems of context explosion and topological instability in vector animation generation, and achieved efficient, topologically isomorphic and non-rigid deformation vector animation generation, which improves generation efficiency and semantic alignment accuracy in complex interactive scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIHANG UNIV
- Filing Date
- 2026-04-16
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for vector animation generation suffer from problems such as context explosion, topological instability, rigidity bottlenecks in motion expression, and lack of geometric deformation capabilities, resulting in low generation efficiency, topological structure destruction, and difficulty in achieving non-rigid deformation.
We employ a sparse state modeling and rendering-aware reinforcement approach. Through a sparse state update mechanism and rendering-aware reinforcement learning, we reconstruct the animation generation process and introduce identity-priority motion planning and rendering-aware reinforcement learning training pipelines to ensure topological isomorphism and non-rigid deformation capabilities.
It achieves efficient vector animation generation, solves the problems of context explosion and topological instability, ensures high fidelity of topological isomorphism and non-rigid deformation, and improves generation efficiency and semantic alignment accuracy in complex interactive scenarios.
Smart Images

Figure CN122289474A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the interdisciplinary fields of computer graphics, artificial intelligence, and natural language processing, and more specifically to a vector animation generation method and system based on sparse state modeling and enhanced rendering perception. Background Technology
[0002] In the cutting-edge fields of modern digital experiences, user interface (UI) design, and web engineering, Scalable Vector Graphics (SVG) has become the de facto graphics standard due to its resolution independence, compact file size, and, most importantly, structural editability. Unlike traditional bitmaps (raster images) composed of pixel arrays, SVG is built on Extensible Markup Language (XML), defining visual graphics through mathematical formulas, geometric paths (such as Bézier curves), polygon combinations, and transformation matrices. This allows designers and developers to precisely manipulate geometric components on devices with varying resolutions without noticeable visual distortion. With the increasing richness of digital media formats, vector animation plays a crucial role in micro-interactions, loading indicators, narrative illustrations, and data visualization. Because SVG documents are essentially XML documents, web browsers' DOM (Document Object Model)-based APIs can directly interact with these images, achieving animation effects by changing the graphic's position, shape, or attributes on the page.
[0003] However, despite the rapid advancements in artificial intelligence over the past few years, particularly in pixel-based video generation (such as Sora, Hunyuan-Video, and Seedance) where it has achieved revolutionary breakthroughs in visual fidelity, the automatic generation of text-to-SVG animation faces a significant technological gap, lagging far behind other generational fields. Currently, the creation of high-quality vector animation still heavily relies on time-consuming and labor-intensive traditional animation production processes. This requires professional designers to manually create keyframes, set easing curves, and assign layer levels in vector editing software (such as Adobe Illustrator or dedicated animation tools like SVGator). In fast-paced product iterations, constrained by strict timelines and team schedules, this cumbersome production method often leads to animations being downgraded or completely abandoned during the design phase.
[0004] Extending existing large-scale static image generation models or general natural language generation paradigms directly to time-dimensional vector animation suffers from a severe "representation mismatch" problem. Based on in-depth analysis of existing technical literature and practical cases, the field currently faces three main core technical shortcomings and bottlenecks:
[0005] First, there's the issue of context explosion and the catastrophic problem of identity drift. Existing autoregressive large language models (such as GPT-5.2, Gemini 3 Pro, and DeepSeek) encounter the problem of context window overflowing rapidly when attempting to generate SVG code frame by frame. This is due to the verbosity of SVG syntax (containing numerous floating-point coordinates, style definitions, and path control points). Repeatedly outputting the complete Document Object Model (DOM) tree for each frame causes this to happen rapidly. Experimental data shows that generating a simple 24-frame vector animation would consume over 86,000 tokens if the complete code were rewritten for each frame, severely limiting the generation of long-term animations. Even more critically, regenerating static attributes at each time step introduces the inherent random inconsistencies of autoregressive decoding. Between consecutive frames, static features of background elements or objects that shouldn't change (such as color and the geometry of stationary parts) may be unexpectedly modified by the model, causing the object's identity to flicker, misalign, or even collapse on the timeline. This phenomenon is known in academia as "identity drift."
[0006] Secondly, there are issues with topological instability and extremely high inference latency. To avoid the difficulty of directly generating long code, some optimization-based methods (such as LiveSketch and AniClipart) attempt to iteratively optimize static vector sketches using differentiable rasterization techniques and fractional distillation sampling (SDS) of pre-trained text-to-video diffusion models (such as Stable Video Diffusion). However, this paradigm reduces vector graphics to a collection of independent strokes, lacking global semantic awareness of closed shapes, fill rules, and complex occlusions. During motion deformation, this easily disrupts the original topological structure, causing previously closed fill regions to break, overlapping layers to merge, or lines to break. Furthermore, the reliance on hundreds or even thousands of steps in the SDS iterative optimization process results in a single generation time cost reaching minutes or even longer, completely failing to meet the industry's demands for real-time or interactive applications.
[0007] Finally, there is the bottleneck of rigidity in motion representation and the lack of geometric deformation capabilities. Existing declarative animation generation frameworks (such as Keyframer) and current mainstream general-purpose large language models tend to exhibit a strong inductive bias toward rigid motion when processing SVG animations.1 To avoid complex coordinate-level spatial reasoning, these systems can usually only synthesize and output simple animation rules based on CSS (Cascading Style Sheets) or SMIL (Synchronous Multimedia Integration Language). Mathematically, these representations are strictly limited to affine transformations, that is, they can only achieve translation, rotation, and uniform scaling of the entire DOM node. However, in actual high-level animation design, complex non-rigid deformations—such as waving flags, folded paper, deformed droplets, or the joint movements of organic organisms—require algorithms that can directly delve into the path definition of the SVG. <path>The d attribute within the tag (i.e., the set of control points defining the Bézier curve) performs fine-grained, point-by-point temporal operations. This geometric-level shape plasticity and temporal evolution capability exceeds the expressive limits of affine transformations, representing a blind spot that existing automated methods based on large models cannot reach.
[0008] In summary, existing multimodal large language models and computer vision optimization algorithms have failed to fundamentally solve the problems of structure preservation and non-rigid deformation in vector animation generation. There is an urgent need for a novel generative paradigm that can overcome the physical limitations of language model context length, ensure perfect topological isomorphism between long temporal frames, and endow the model with the ability to understand continuous visual feedback and perform low-level geometric path editing. Summary of the Invention
[0009] In view of the above problems, this invention is proposed to provide a vector animation generation method and system based on sparse state modeling and enhanced rendering perception to overcome or at least partially solve the above problems. This invention aims to reconstruct the animation generation process from the traditional "frame-by-frame full sequence rewriting" to a sparse state update (SSU) mechanism based on a persistent SVG DOM tree through underlying representation innovation. This mechanism breaks the context constraints of long-term generation by greatly compressing redundant information and mathematically guarantees the topological isomorphism and identity persistence of elements not involved in motion.
[0010] Another objective of this invention is to introduce an "Identification-First Motion Planning" mechanism inspired by human cognition. Given that directly outputting low-level coordinates from a large language model can easily lead to misinterpretations, this invention proposes an explicit Structure-Bound CoT to guide the model to accurately align high-level semantic intent with the globally unique identifier (ID) of specific SVG nodes before generating low-level code. This ensures the physical plausibility of motion logic and the accuracy of object operations in complex multi-object interaction scenarios.
[0011] A further objective of this invention is to overcome the visual blind spots caused by training large language models in pure text space and the non-differentiability of the SVG rendering process, proposing a pioneering Rendering-Aware Reinforcement Learning training pipeline. Utilizing the Group Relative Policy Optimization (GRPO) algorithm and a state-of-the-art video-aware encoder (PE-Core), this invention constructs a hybrid reward closed loop connecting discrete code updates and continuous visual dynamic feedback. This endows the model with the ability to escape the local optima of simple affine transformations, enabling it to actively explore and directly manipulate the underlying path geometry (such as Bézier curve control points) to achieve high-fidelity non-rigid deformation.
[0012] To achieve the above objectives, the present invention adopts the following technical solution: In a first aspect, embodiments of the present invention provide a vector animation generation method based on sparse state modeling and enhanced rendering perception, comprising: S1: Receive user input of natural language instructions, initial static SVG code, and its corresponding initial bitmap image; perform structured parsing on the initial static SVG code to generate a persistent document object model tree structure, and assign globally unique identifiers to the nodes in the document object model tree structure; convert the initial bitmap image into a visual embedding vector sequence through a visual encoder, and perform feature fusion with a text embedding sequence containing the natural language instructions and the document object model tree structure to obtain multimodal fusion features; S2: Based on the multimodal fusion features, a structure-bound thought chain is generated using a multimodal large language model; the structure-bound thought chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which are used to map the abstract visual entities in the natural language instruction prompts to the globally unique identifiers of the nodes in the document object model tree structure, and to describe the temporal logic and motion causal relationship of the objects based on the globally unique identifiers to generate a motion blueprint. S3: Using the motion blueprint as a constraint, generate attribute difference sequences for the target nodes involved in the motion blueprint using the multimodal large language model, wherein each attribute difference term is associated with a globally unique identifier of the corresponding target node; and serialize the attribute difference sequence into a sparse state update sequence. S4: Based on the initial static SVG code and the sparse state update sequence, construct the final vector animation sequence.
[0013] Preferably, S1 includes: S1.1: Structured Parsing and Document Object Model Tree Structure Construction Perform structured parsing operations on the received initial static SVG code to construct a persistent document object model tree structure with hierarchical nesting relationships; Based on the topological traversal algorithm, a globally unique identifier is assigned to each node with a physical form in the document object model tree structure; the lifecycle of the globally unique identifier spans the entire animation sequence and is used to anchor the attribute changes of specific objects in subsequent steps; S1.2: Multimodal Embedding Vector Generation The initial bitmap image is encoded using a pre-trained visual encoder to extract its high-dimensional visual representation features, which are then projected into the token space of a multimodal large language model to generate a continuous sequence of visual embedding vectors. Simultaneously, the natural language instruction prompts and the document object model tree structure containing globally unique identifiers generated in S1.1 are serialized into a discrete text embedding sequence; S1.3: Dense Cross-Modal Feature Alignment and Fusion The visual embedding vector sequence and the text embedding sequence are input into the Transformer decoder of the multimodal large language model; By employing a dense cross-modal self-attention mechanism, the attention weights between the visual embedding vector and each token in the text embedding sequence are calculated. The continuous visual representation features in the initial bitmap image are deeply semantically aligned with the discrete hierarchical structure and globally unique identifier in the document object model tree structure, and the multimodal fusion feature that integrates visual perception information and code structure information is output.
[0014] Preferably, S2 includes: S2.1: Mind Chain Hint Construction and Context Injection A structured prompt template containing few sample examples is constructed, and the multimodal fusion features, the natural language instruction prompts, and the globally unique identifiers and hierarchical relationships of each node in the document object model tree structure are injected into the input context window of the multimodal large language model. S2.2: Entity Recognition and Explicit ID Mapping Guide the multimodal large language model to execute the entity recognition sub-step in the thought chain reasoning process: The semantic subject in the natural language instruction prompt is analyzed to identify the abstract visual entity referred to by the instruction; Guided by the multimodal fusion features, the identified abstract visual entities are matched with nodes in the document object model tree structure that have corresponding visual representations, and the globally unique identifier of the node is explicitly output to establish a binding relationship between natural language entities and globally unique identifiers. S2.3: Visual Dynamic Programming and Motion Blueprint Generation Guiding a multimodal large language model to execute visual dynamic programming sub-steps in thought chain reasoning: Based on the binding relationship, the implicit time logic and motion causality in the instructions are analyzed to define the motion type required for each node bound to a globally unique identifier. The motion type includes rigid translation, rotation and scaling, or non-rigid path deformation. For nodes requiring non-rigid path deformation, the evolution trajectory of their geometric control points is further planned; A structured motion blueprint is generated, which is described in natural language and combined with pseudocode. The blueprint explicitly lists the expected state changes and attribute modification strategies of each globally unique identifier node at each time step, serving as constraints for the subsequent generation of attribute difference sequences.
[0015] Preferably, S3 includes: S3.1: Keyframe State Extraction and Baseline Alignment Analyze the motion blueprint to identify the key time steps defined therein and their corresponding target state descriptions; Read the initial attribute values of each node in the document object model tree structure and set them as the baseline state at time step t=0; For each globally unique identifier node specified in the motion blueprint, extract its expected attribute values at each key time step to form a set of state changes to be processed. S3.2: Strict Attribute Difference Calculation and Sparse State Update Generation The set of state changes to be processed is traversed in ascending order of time steps, and normalization and strict attribute difference operations are performed on each candidate attribute of each globally unique identifier node. Specifically, the attribute values are first normalized, and the normalization process includes at least: removing redundant whitespace characters, unifying numerical precision, unifying color expression format, unifying coordinate representation format, and unifying the order of transformation parameters. Establish the target state table for the current time step t and the previous time step t. 1. Historical state table; For any attribute `attr` of any node ID, the normalized attribute value at time step `t` is denoted as `v`. t This attribute was applied at the previous time step t. 1. The attribute value after normalization is denoted as v. t 1; if v t With v t If the values are not equal, then the attribute is determined to have changed effectively at time step t, and an attribute difference term (ID, attr, v) is generated. t If the two are equal, no difference term is generated; Aggregate all attribute difference items that have been determined to have undergone effective changes within the same time step t into a time step increment set Δ. t , where Δ t ={(ID, attr, v t ) | v t ≠v t 1}; where Δ t This represents the set of attribute increments at time step t, where ID represents the globally unique identifier of the node, attr represents the attribute name, and v t The attribute attr representing the node ID is normalized at time step t. The attribute includes at least geometric attributes, transformation attributes, and appearance attributes. The path definition attribute d in the geometric attributes is used to represent non-rigid geometric deformation. The transformation attribute transform is used to represent translation, rotation, or scaling. The appearance attributes fill, stroke, opacity, and fill-opacity are used to represent fill color, stroke color, overall opacity, and fill opacity, respectively.
[0016] S3.3: Serialization Representation and Command Output Based on Special Control Tags The hierarchical incremental sequence combination is converted into a one-dimensional serialized string according to a predetermined order, the predetermined order including at least: time steps are arranged in ascending order, nodes within the same time step are arranged according to a globally unique identifier or DOM traversal order, and attributes under the same node are arranged according to a preset attribute priority. During the serialization process, a time step control label is inserted for each time step, a node control label is inserted for each node, and an attribute name and attribute value separator is inserted for each attribute to form a linear token sequence that can be directly processed by a large language model. The one-dimensional serialized string includes at least a time step control label <|time=t|> and a node control label <|ID=ID|>, and the corresponding attribute difference items are sequentially concatenated after the corresponding control labels; Escape or encapsulate reserved characters in attribute values to avoid ambiguity between attribute value content and control tags; This yields a sparse state update sequence, which, together with the initial static SVG code, is input into subsequent animation building blocks to restore the complete animation state sequence step by step over time.
[0017] Preferably, the multimodal large language model is obtained by end-to-end reinforcement learning fine-tuning based on a group relative policy optimization algorithm, and the fine-tuning process includes: The policy network generates policies in parallel using an autoregressive approach based on comprehensive contextual conditions. There are 3 parallel candidate output sequence groups, each containing a reasoning chain and a sparse state update sequence. The integrated headless browser rendering proxy based on the web technology stack is invoked to inject each candidate sparse state update sequence into the virtual DOM environment in real time, reconstruct and render the segments into continuous two-dimensional raster video clips at fixed time steps. Using a pre-trained video-aware encoder, the video visual feature vectors of the generated two-dimensional raster video segments and the text feature vectors of the natural language instruction prompts are extracted from the intermediate layers of the network, and the cosine similarity between the two is calculated to construct a semantic alignment reward. A binary reward for format validity is introduced to evaluate code syntax parsability, target sequence length matching degree, and topological validity of globally unique identifier references; The semantic alignment reward and the format validity binary reward are fused to construct a hybrid reward function. The relative reward of candidate sequences within the group is used to estimate the advantage function, and the parameters of the multimodal large language model are updated under the condition of applying KL divergence constraints.
[0018] Preferably, the hybrid reward function is:
[0019]
[0020]
[0021] in, This represents a mixed reward function. This represents a semantic alignment reward. This indicates a binary reward for format validity. Represents the visual feature vector of the video. Let D represent the text feature vector, and D be the sparse state update sequence. The weight parameters for the semantic alignment reward item. is the weight parameter for the format validity binary reward item; CosineSim(·,·) is the cosine similarity calculation function, used to calculate the similarity between the video visual feature vector and the text feature vector; T is the time series length parameter, representing the total number of time steps or total number of frames corresponding to the video segment rendered after the candidate sparse state update sequence is injected into the virtual DOM environment.
[0022] Preferably, the advantage function is estimated using the relative reward of candidate sequences within the group, and the parameters of the multimodal large language model are updated under the condition of applying KL divergence constraints, including: For the same input sample, the current policy network is used to generate a set of multiple candidate sparse state update sequences through parallel sampling. Calculate the total reward score for each candidate sparse state update sequence in the set, and normalize the scores of all candidate sparse state update sequences in the set by calculating the mean and standard deviation within the set to obtain the relative advantage value of each candidate sparse state update sequence relative to other sequences in the set. If the reward score of a candidate sparse state update sequence is higher than the average level in the group, its relative advantage value is positive, and vice versa. This is used to estimate the advantage level of the action without relying on the independent value network. Construct a policy optimization objective function, which consists of two parts: a truncated importance sampling term and a KL divergence penalty term. In the importance sampling term of the truncation, the probability ratio of the new strategy and the old strategy generating the same candidate sparse state update sequence is calculated, and a near-end proportional pruning operation is performed on the ratio to limit the range of the ratio to within a preset threshold range, so as to generate a pruned probability ratio; the product of the pruned probability ratio and the relative advantage value is used as the main driving term of the strategy update to prevent the strategy from collapsing due to excessive single-step update amplitude; In the KL divergence penalty term, the KL divergence between the new policy currently being optimized and the base reference policy that has not been fine-tuned by reinforcement learning is calculated. This KL divergence value is then multiplied by a preset penalty coefficient and deducted from the policy optimization objective function to constrain the behavioral logic of the new policy from deviating from the grammatical and structural validity mastered during the base supervision fine-tuning stage.
[0023] Secondly, embodiments of the present invention provide a vector animation generation system based on sparse state modeling and enhanced rendering perception, comprising: The data receiving and multimodal fusion module is used to receive user input natural language instruction prompts, initial static SVG code and its corresponding initial bitmap image; perform structured parsing on the initial static SVG code to generate a persistent document object model tree structure, and assign globally unique identifiers to the nodes in the document object model tree structure; convert the initial bitmap image into a visual embedding vector sequence through a visual encoder, and perform feature fusion with a text embedding sequence containing the natural language instruction prompts and the document object model tree structure to obtain multimodal fusion features; The structure binding thought chain generation module is used to generate a structure binding thought chain based on the multimodal fusion features and using a multimodal large language model. The structure binding thought chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which is used to map the abstract visual entities in the natural language instruction prompts to the globally unique identifiers of the nodes in the document object model tree structure, and to describe the temporal logic and motion causal relationship of the objects based on the globally unique identifiers to generate a motion blueprint. The sparse state update sequence generation module is used to generate attribute difference sequences for target nodes involved in the motion blueprint using the multimodal large language model, with the motion blueprint as a conditional constraint, wherein each attribute difference item is associated with a globally unique identifier of the corresponding target node; and serialize the attribute difference sequence into a sparse state update sequence. The vector animation building module is used to construct the final vector animation sequence based on the initial static SVG code and the sparse state update sequence.
[0024] Thirdly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a vector animation generation method based on sparse state modeling and enhanced rendering perception.
[0025] Fourthly, embodiments of the present invention provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements a vector animation generation method based on sparse state modeling and enhanced rendering perception.
[0026] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a vector animation generation method and system based on sparse state modeling and enhanced rendering perception, which has the following beneficial effects: 1. Completely solves the context explosion problem of large models, achieving ultimate inference efficiency: Traditional autoregressive methods suffer from extremely high static information redundancy when generating animations due to the inherent defects of their representation architecture (i.e., full code output frame by frame). This invention reconstructs the underlying animation representation method and proposes a "Sparse State Update (SSU)" mechanism. By treating SVG as a persistent DOM tree and predicting only attribute difference sequences at time steps, this invention can filter out more than 85% of static redundant syntax.
[0027] 2. Provides mathematical-level topological isomorphism and absolute identity persistence guarantees, completely eliminating identity drift: Existing large models are prone to "illusioning" irrelevant code or accidentally modifying the color and layer of background elements during sequence generation, causing originally static objects to flicker, misalign, or completely disappear, i.e., "identity drift." The sparse state modeling mechanism introduced in this invention forces all dynamic updates to be anchored exclusively to a specific persistent ID in the initial SVG DOM tree. Nodes not selected to participate in motion in the "identity-first motion planning" blueprint will never be touched by the model decoder in the generated sequence. Therefore, nodes that do not participate in motion remain absolutely static throughout the entire time series, which mathematically eliminates "identity drift" or structural damage to non-participating objects. This ensures that the generated animation has perfect topological isomorphism, and its fidelity and structural stability far surpass those of schemes such as LiveSketch that rely on fractional distillation sampling (SDS) optimization, whether dealing with closed shapes, fill attributes, or complex occlusion relationships.
[0028] 3. Overcoming the rigidity bottleneck of affine transformations, large models directly master high-precision editing capabilities for geometrically non-rigid deformations: Existing intelligent animation tools are often limited by the rigid transformation syntax of CSS or SMIL (i.e., matrix translation, rotation, and scaling). To enable large models to handle complex physical deformations (e.g., paper folding inward, biological tentacles waving, fluid edge blending), this invention pioneers a Rendering-Aware RL training pipeline. Through the group-relative policy evaluation mechanism of the GRPO algorithm, the system can efficiently utilize the high-quality continuous visual feedback extracted by the PE-Core video-aware encoder, directly using it as a reward signal to adjust the discrete output of the model. This "semantic alignment reward" successfully drives the large language model to break out of the local optimum solution (i.e., "conservative motion bias") using only affine transformations, actively exploring and mastering how to modify SVG with fine granularity. <path>The complex Bézier curve attributes (path data) in the tags enable the system to accurately perform complex non-rigid physical deformations, greatly expanding the expressive boundaries and realism of text-generated vector animations.
[0029] 4. Improved semantic alignment and instruction compliance accuracy in complex multi-object interaction scenarios, achieving a 100% success rate: To address the illusion of large language models under complex prompts, this invention employs an "identity-first motion planning" mechanism to force the model to perform explicit thought chain logic reasoning and entity binding before generating the final underlying code. Based on the feature of multimodal alignment, the large language model explicitly binds abstract concepts in natural language to specific code identifiers. Attached Figure Description
[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0031] Figure 1 This is a schematic diagram of a vector animation generation method based on sparse state modeling and enhanced rendering perception provided in an embodiment of the present invention; Figure 2 This is a framework diagram of the vector animation generation method based on sparse state modeling and enhanced rendering perception provided in the embodiments of the present invention; Figure 3 This is a flowchart of the automatic topology normalization and two-stream annotation pipeline for constructing the first large-scale high-quality vector animation benchmark dataset (SVGAnim-134k) provided in this embodiment of the invention; Figure 4 This is a graph comparing the curves of quantitative analysis of full autoregression and sparse state update (SSU) over time series token consumption compression efficiency provided in this embodiment of the invention. Figure 5 This is a schematic diagram of a vector animation generation system based on sparse state modeling and enhanced rendering perception provided in an embodiment of the present invention. Detailed Implementation
[0032] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0033] This invention discloses a vector animation generation method based on sparse state modeling and enhanced rendering perception, such as... Figure 1 and Figure 2 As shown, it includes: S1: Receive user input of natural language instructions, initial static SVG code and its corresponding initial bitmap image; perform structured parsing on the initial static SVG code to generate a persistent document object model tree structure, and assign globally unique identifiers to the nodes in the document object model tree structure; convert the initial bitmap image into a visual embedding vector sequence through a visual encoder, and perform feature fusion with the text embedding sequence containing natural language instructions and document object model tree structure to obtain multimodal fusion features; S2: Based on multimodal fusion features, a structure-bound thought chain is generated using a multimodal large language model. The structure-bound thought chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which are used to map the abstract visual entities in the natural language instruction prompts to the globally unique identifiers of the nodes in the document object model tree structure, and to describe the temporal logic and motion causal relationship of the objects based on the globally unique identifiers to generate a motion blueprint. S3: Using the motion blueprint as a constraint, a multimodal large language model is used to generate attribute difference sequences for the target nodes involved in the motion blueprint, where each attribute difference term is associated with a globally unique identifier of the corresponding target node; and the attribute difference sequence is serialized into a sparse state update sequence. S4: Based on the initial static SVG code and sparse state update sequence, construct the final vector animation sequence.
[0034] In this embodiment, the invention reconstructs the underlying animation representation method. The initial static SVG is viewed as a persistent Document Object Model (DOM) tree in three-dimensional space and time. For any node in the DOM tree with a physical form (such as...) <path>(path), <g>(Grouping) <rect>(Rectangle)), the system assigns a globally unique identifier (ID) with a lifecycle that runs through the entire animation sequence during the code standardization stage by using a topological traversal algorithm, such as ID=01, ID=05, etc.
[0035] Specifically, S1 of the present invention includes: S1.1: Structured Parsing and Document Object Model Tree Structure Construction Perform structured parsing operations on the received initial static SVG code to build a persistent document object model tree structure with hierarchical nesting relationships; Based on the topological traversal algorithm, a globally unique identifier is assigned to each node with a physical form in the document object model tree structure. Nodes include path nodes, grouping nodes, rectangle nodes, or special effects filter nodes. The lifecycle of the globally unique identifier spans the entire animation sequence and is used to anchor the attribute changes of specific objects in subsequent steps. S1.2: Multimodal Embedding Vector Generation The initial bitmap image is encoded using a pre-trained visual encoder, its high-dimensional visual representation features are extracted, and then projected into the token space of a multimodal large language model to generate a continuous sequence of visual embedding vectors. Simultaneously, the natural language instruction prompts and the document object model tree structure containing globally unique identifiers generated by S1.1 are serialized into discrete text embedding sequences; S1.3: Dense Cross-Modal Feature Alignment and Fusion The visual embedding vector sequence and the text embedding sequence are input into the Transformer decoder of the multimodal large language model; By employing a dense cross-modal self-attention mechanism, the attention weights between the visual embedding vector and each token in the text embedding sequence are calculated. The continuous visual representation features in the initial bitmap image are deeply semantically aligned with the discrete hierarchical structure and globally unique identifier in the document object model tree structure, resulting in the output of a multimodal fusion feature that integrates visual perception information and code structure information.
[0036] In this embodiment, to achieve cross-modal mapping from abstract textual intent to precise underlying coordinate modification, the system employs a decoupled design during the inference process. This invention formalizes the generation task as a process of solving a joint probability distribution. Given an input set x=(I0, S0, P), where I0 represents the initial bitmap image corresponding to the initial static SVG code S0, S0 represents the input initial static SVG code, and P represents the natural language description instruction prompt; the ultimate goal is to autoregressively generate a sparse state update sequence D={Δt|t=1,2,...,T}.
[0037] To bridge the gap between abstract semantics and tens of thousands of underlying code parameters, this invention innovatively introduces an explicit, intermediate transitional latent variable: Structure-Bound Chain-of-Thought (CoT). Therefore, the joint probability distribution of the above-mentioned difference sequences generated by the multimodal large language model is deconstructed into a two-stage coarse-to-fine paradigm inference process:
[0038] In the above formula, p θ (C|x) represents the conditional probability of generating a structure-bound thought chain C given the input set x, corresponding to the global planning stage; p θ (D|C, x) represents the conditional probability of generating a sparse state update sequence D under the joint constraints of the structure-bound thought chain C and the input set x, corresponding to the underlying execution stage.
[0039] After receiving the input set x=(I0, S0, P), the multimodal large language model first generates a structure-binding thought chain, and then sequentially completes entity recognition and visual dynamic planning to output a motion blueprint for each globally unique identifier node. Among them, the first frame SVG DOM tree obtained by parsing S0 is only used as a structural representation within the system to participate in context construction, node localization and attribute verification, and is not used as an independent input object alongside I0, S0 and P.
[0040] Entity Identification: Explicitly binds the visual elements described in the input instructions to specific globally unique identifiers (IDs) in the first frame of the SVG DOM tree, parsed from the initial static SVG code S0. Through powerful multimodal alignment capabilities, the model outputs something like: "The 'orange game console ball' in natural language corresponds to the persistent identifier ID=06 in the code structure, and the shell corresponds to ID=02."
[0041] Visual Dynamic Planning: Based on a specific anchored ID, it uses logical natural language to describe the subsequent temporal motion logic and causal constraints of the object. For example, the output is: "In order to achieve the text prompt 'The sphere gradually shrinks and falls below,' the node with ID=06 needs to generate a rigid downward translation on the vertical axis, while its path needs to be edited to show a non-rigid shrinking deformation in subsequent frames, and it needs to disappear in frame 15 (opacity=0)."
[0042] Specifically, S2 includes: S2.1: Mind Chain Hint Construction and Context Injection Construct a structured prompt template containing few sample examples, and inject multimodal fusion features, natural language instruction prompts, and globally unique identifiers and hierarchical relationships of each node in the document object model tree structure into the input context window of the multimodal large language model; S2.2: Entity Recognition and Explicit ID Mapping Guide the multimodal large language model to execute the entity recognition sub-step in the thought chain reasoning process: Analyze the semantic subject in natural language instruction prompts and identify the abstract visual entity referred to by the instruction; Guided by multimodal fusion features, the identified abstract visual entities are matched with nodes in the document object model tree structure that have corresponding visual representations, and the globally unique identifier of the node is explicitly output to establish a binding relationship between natural language entities and globally unique identifiers. S2.3: Visual Dynamic Programming and Motion Blueprint Generation Guiding a multimodal large language model to execute visual dynamic programming sub-steps in thought chain reasoning: Based on the binding relationship, the implicit time logic and motion causality in the instructions are analyzed to define the motion type required for each node bound to a globally unique identifier. The motion type includes rigid translation, rotation and scaling, or non-rigid path deformation. For nodes requiring non-rigid path deformation, the evolution trajectory of their geometric control points is further planned; A structured motion blueprint is generated, which is described in natural language and combined with pseudocode. The blueprint explicitly lists the expected state changes and attribute modification strategies of each globally unique identifier node at each time step, serving as constraints for the subsequent generation of attribute difference sequences.
[0043] In this embodiment, let Indicates the first At frame time, it is the set of all node attributes with dynamic potential. We define the first... Sparse State Update (SSU) of a Frame For the strict attribute difference set that occurred relative to the previous frame:
[0044] In the above formula, ID represents the globally unique identifier of the DOM node, attr represents the name of the changed attribute, and v t This represents the attribute value after normalization at time step t; the attributes include the path definition attribute d for non-rigid deformation, the transformation attribute transform for rigid motion, and fill, stroke, opacity, or fill-opacity for appearance changes. Thus, the entire complex animation sequence is redefined as a highly compressed combination of incremental sequences.
[0045] Furthermore, the sequence combination is uniformly represented as (S0, Δ1, ..., ΔT), where S0 represents only the initial static SVG code as the basic input for all subsequent sparse state updates; Δ1 to ΔT represent the attribute difference updates that occur at each time step relative to the previous frame. The first frame SVG DOM tree obtained by parsing S0 is only used as a structural representation for node localization and attribute verification.
[0046] Serialized Representation Mechanism: When constructing the input and output hint specifications for a large language model, this system utilizes special control tags to perform one-dimensional serialization of the hierarchical differential tree. Specifically, `<|time=t|>` is used to mark the current frame time step boundaries, and `<|ID=ID|>` is used to explicitly anchor the persistent SVG DOM nodes to be modified.
[0047] Combination Figure 4 It is known that, under the same 24-frame animation generation task, if the full SVG code is output frame by frame, the cumulative token consumption will increase rapidly with the number of frames, approaching approximately 86,000 tokens in the later part of the sequence, which is very likely to cause context window overflow. However, after adopting the sparse state update representation of the present invention, the model only needs to output the initial static SVG code S0 and the difference terms Δ1 to ΔT that change at each time step, with a cumulative token consumption of approximately 9,200 tokens and a compression ratio of approximately 9.86 times. This shows that the present invention, by retaining the repetitive static topology in the persistent SVG DOM and sparsely modeling only the dynamic attributes, can significantly reduce the context burden of long-term animation generation, allowing the model's representation ability to focus on the real motion state transition, thereby improving the stability, scalability, and execution efficiency of long-term vector animation generation.
[0048] Specifically, S3 includes: S3.1: Keyframe State Extraction and Baseline Alignment Analyze the motion blueprint and identify the key time steps defined therein and their corresponding target state descriptions; Read the initial attribute values of each node in the document object model tree structure and set them as the baseline state at time step t=0; For each globally unique identifier node specified in the motion blueprint, extract its expected attribute values at each key time step to form a set of state changes to be processed. S3.2: Strict Attribute Difference Calculation and Sparse State Update Generation Iterate through the set of state changes and calculate the attribute values at the current time step t and the previous time step t. Strict property difference operation of 1: The set of state changes to be processed is traversed in ascending order of time steps, and normalization and strict attribute difference operations are performed on each candidate attribute of each globally unique identifier node. Specifically, the attribute values are first normalized, which includes at least the following: removing redundant whitespace characters, unifying numerical precision, unifying color representation format, unifying coordinate representation format, and unifying the order of transformation parameters. Establish the target state table for the current time step t and the previous time step t. 1. Historical state table; For any attribute `attr` of any node ID, the normalized attribute value at time step `t` is denoted as `v`. t This attribute was applied at the previous time step t. 1. The attribute value after normalization is denoted as v. t 1; if v t With v t If the values are not equal, then the attribute is determined to have changed effectively at time step t, and an attribute difference term (ID, attr, v) is generated. t If the two are equal, no difference term is generated.
[0049] Aggregate all attribute difference items that have been determined to have undergone effective changes within the same time step t into a time step increment set Δ. t , where Δ t ={(ID, attr, v t ) | v t ≠v t 1}; where Δ t This represents the set of attribute increments at time step t, where ID represents the globally unique identifier of the node, attr represents the attribute name, and v t The attribute attr, representing the node ID, is normalized at time step t; the path definition attribute d in the geometric attributes is used to represent non-rigid geometric deformation, the transformation attribute transform is used to represent translation, rotation, or scaling, and the appearance attributes fill, stroke, opacity, and fill-opacity are used to represent fill color, stroke color, overall opacity, and fill opacity, respectively.
[0050] Furthermore, following the hierarchical relationship of "time step - node identifier - attribute item", the entire set of time step increments is organized to obtain a hierarchical combination of increment sequences D = {Δ1, Δ2, …, Δ T }, where each Δt contains only information relative to time step t. 1. Add or change attribute values, without repeatedly outputting nodes and attributes that have not changed.
[0051] S3.3: Serialization Representation and Command Output Based on Special Control Tags The hierarchical incremental sequence combination is converted into a one-dimensional serialized string according to a predetermined order, which includes at least the following: time steps are arranged in ascending order, nodes within the same time step are arranged according to the globally unique identifier or DOM traversal order, and attributes under the same node are arranged according to the preset attribute priority. During the serialization process, a time step control label is inserted for each time step, a node control label is inserted for each node, and an attribute name and attribute value separator is inserted for each attribute to form a linear token sequence that can be directly processed by a large language model. A one-dimensional serialized string must include at least a time step control label <|time=t|> and a node control label <|ID=ID|>, and the corresponding attribute difference items are concatenated sequentially after the corresponding control labels; Escape or encapsulate reserved characters in attribute values to avoid ambiguity between attribute value content and control tags; This results in a sparse state update sequence, which is used together with the initial static SVG code as input to subsequent animation building blocks to restore the complete animation state sequence step by step.
[0052] In this embodiment, since the invention relates to the pre-training and fine-tuning of large models, its effectiveness is highly dependent on the quality, topological accuracy, and construction method of the dataset. Existing public datasets (such as DeepSVG or Icon Sketch datasets) either contain only static images or are in bitmap format, lacking structured dynamic labels, hierarchical grouping, and closed shapes. Therefore, this invention includes a dedicated automated data construction and preprocessing pipeline, successfully constructing the first large-scale benchmark dataset covering 134,000 high-quality vector animations: SVGAnim-134k, such as... Figure 3 As shown.
[0053] Step a: Lottie Data Acquisition and Topological Canonicalization: The system acquires a large number of high-quality parametric Lottie animation files (usually in JSON format) from professional platforms (such as the UI and web design resource library Flaticon). To transform them into an explicit sequence structure that can be learned by large models, the system expands each sequence frame using a built-in Node.js-based renderer. To ensure the learnability and generalization of subsequent neural networks, the following rigorous normalization process is performed: Standardize the bounding boxes of all input canvases to a uniform resolution.
[0054] By applying a coordinate transformation matrix, all absolute coordinates in the file are parsed and remapped to relative coordinates, which greatly reduces the spatiotemporal difficulty of large model regression prediction of dense floating-point values and effectively compresses the sequence length.
[0055] By using a tree-structured hash algorithm, it is guaranteed that the SVG DOM tree of all expanded frames in the same sequence must maintain perfect topological invariance. At this point, for all tags with dynamic tags (such as...) <path> , <rect> , <g>Layer grouping and special effects filters, such as <fecolormatrix>Assign a persistent global ID.
[0056] Step b: Dynamic difference sequence extraction and dual-stream annotation: Utilizing cutting-edge, powerful multimodal language models (such as the Doubao-Seed-1.6 VLM fine-tuned with customized instructions used in this embodiment), multi-view analysis is performed on the standardized extracted bitmap dynamic rendering video sequence V (representing the temporal video observation results obtained by rasterizing the SVG animation sequence) and the corresponding initial static SVG code S0. The first frame SVG DOM tree parsed from S0 is used only as an internal structural representation for node localization, hierarchical constraints, and consistency checks, and is not used as an independent input object for the multimodal language model at this point. Based on this, the system automatically generates two high-quality supervised text streams: User-Centric Prompt ): Simulates the tone of a human user on the terminal, only describing the visual phenomenology of the animation presented on the screen (e.g., "The water droplet in the middle of the screen is falling, and after hitting the bottom, it creates ripples that spread outwards"), providing semantic conditional input.
[0057] Structure-Bound CoT ): As the core logical truth value, according to the format of Example 2, it bridges the differences between the visual language and the underlying code nodes.
[0058] To fundamentally eliminate potential illusions during VLM annotation (such as writing non-existent IDs into the thought chain or misidentifying polygon types), the system implements an extremely strict ID consistency filter algorithm before data entry: the system engine automatically scans the generated thought chains. Use regular expressions to extract all IDs that it references, and in the initial static SVG code. Cross-table joint cross-validation is performed within the DOM parsing tree. If a thought chain references an invalid ID or has a type mismatch, that sample and all its sequence frames are immediately and silently discarded by the system. This mechanism ensures that 100% of the supervised training data in this dataset is perfectly anchored in physical structure, thus laying an unbreakable foundation for the next step of reinforcement learning.
[0059] Step c, Stage I - Structured Supervised Fine-Tuning (SFT) Training Method: Using a subset of the constructed SVGAnim-134k (e.g., 123k SFT training samples), fully parameter-supervised fine-tuning is performed on the backbone network of the initial multimodal large language model (e.g., a model with approximately 8 billion parameters). A causal language modeling paradigm is adopted, and the objective function is fine-tuned by minimizing the cross-entropy loss (maximizing the joint likelihood) to adjust the model parameters. Initialization of "Cold-start":
[0060] At this stage, the model has mastered basic SVG syntax (SVG Literacy), DOM tree hierarchy awareness, the ability to output logical formats based on thought chains, and the ability to follow YAML / JSON differential prediction.
[0061] In this embodiment, in order to bridge the semantic gap between "discrete character code generation" and "continuous smooth visual perception", and to enable the large language model to learn to actively manipulate path data instead of simply applying matrix translation, this embodiment creatively proposes and applies a model-free reinforcement learning framework based on Group Relative Policy Optimization (GRPO).
[0062] In traditional reinforcement learning, the commonly used PPO (Proximal Policy Optimization) algorithm requires training a value network (Critic Model) of a scale comparable to the large policy model to predict the baseline value of each state. This often leads to GPU memory crashes during fine-tuning of large language models with tens or hundreds of billions of parameters, resulting in extremely high computational costs. The GRPO algorithm used in this invention revolutionarily abandons the massive Critic Model. Instead, it directly samples multiple different candidate action combinations for the same cue word instruction to form an "exploration group," and directly compares the relative performance scores of the policies of peers within these groups to estimate the baseline of the Advantage Function. This significantly reduces the GPU memory pressure on large language models and allows for a smoother and more efficient guidance of the model to explore the optimal policy with geometrical deformation within a vast action space.
[0063] Sub-step 4.1 Candidate code group sampling and establishment of visual rendering closed loop based on headless browser: Before reinforcement learning, for each input, the comprehensive contextual condition x=(I0, S0, P) is used. At this point, the large language model, acting as the policy network, generates a set of G candidate output sequences {o1, ..., oG} through parallel sampling based on the current probability distribution. Each candidate output oi=(Ci, Di), where Ci represents the reasoning chain and Di represents a sparse state update sequence instance from the i-th candidate output. The system uses a headless browser to inject each Di into a virtual DOM and reconstruct it into continuous video clips for subsequent reward calculation.
[0064] Sub-step 4.2 Design and calculation metrics of the dual-channel fusion hybrid reward mechanism: The innovative reinforcement learning signal designed for this system must consider the two most important metrics for large language models in code generation tasks: the semantic accuracy of the final video footage and the fault tolerance of the underlying code syntax. Therefore, this system designs a dual-objective reward function: A. Semantic Alignment Reward To enable the model to precisely control complex Bézier curve deformations (e.g., the instruction "the elephant raises its trunk, and the tip curls upward"), the system needs to provide extremely sensitive, high-fidelity visual similarity feedback. This embodiment employs the industry's most advanced multimodal video perceptual encoder structure (such as the Perception Encoder, PE-Core model family, developed based on Meta). Existing conventional visual feature extractors often only extract abstract classification features from the network output layer, while extensive experiments and cognitive theory have proven that perfect visual embedding for subtle spatial displacements, dynamic tracking, and fine-grained pixel textures is actually hidden within the intermediate hidden layers of the neural network. Therefore, this invention uses the aforementioned rendered two-dimensional raster video segments... The actual visual features corresponding to the expected instructions are fed into the PE-Core intermediate encoding network, which truncates the output layer, to extract deep, high-dimensional video visual feature vectors. At the same time, the user's natural language prompts will be displayed. Text feature vectors are also extracted through the text encoding side of this architecture. Finally, the system calculates the cosine similarity between these two sets of continuous vectors, which serves as the dominant perceptual reward value:
[0065] This unique visual signal reward directly prompts the policy model to abandon the safe but rigid affine transformation matrix (transform property) and instead actively and boldly explore the high-dimensional geometric space with its extreme dimensionality curse, finely altering the control point path d property to win higher perceptual reward scores.
[0066] B. Format Validity Validation and Constraint Reward : To prevent large language models from generating degenerate code, broken XML tags, or invalid JSON difference files during the reinforcement exploration process in pursuit of high scores, this embodiment incorporates a strict binary constraint penalty term based on rule judgment:
[0067] This negative feedback signal forcibly constrains the development direction of the model exploration team within a "manifold geometric space with code executability and structural validity." In summary, the overall hybrid reward value obtained from a single action sequence is defined using a linear weighted fusion mechanism as follows:
[0068] Typically, the weighting coefficients can be set proportionally (e.g.) and ), This represents a mixed reward function. This represents a semantic alignment reward. This indicates a binary reward for format validity. Represents the visual feature vector of the video. Let D represent the text feature vector, and D be the sparse state update sequence. The weight parameters for the semantic alignment reward item. is the weight parameter for the format validity binary reward item; CosineSim(·,·) is the cosine similarity calculation function, used to calculate the similarity between the video visual feature vector and the text feature vector; T is the time series length parameter, representing the total number of time steps or total number of frames corresponding to the video segment rendered after the candidate sparse state update sequence is injected into the virtual DOM environment.
[0069] Sub-step 4.3 Gradient update mechanism based on Group Relative Policy Optimization (GRPO) This invention employs a pre-defined objective function maximization rule to update the weight parameters of a multimodal large language model (i.e., a policy network). This update process is based on the Group Relative Policy Optimization (GRPO) algorithm, whose objective function aims to maximize the advantage estimate of the policy network, and incorporates key constraints to ensure the stability and security of training.
[0070] Advantage evaluation: The model first samples a set of candidate sparse state update sequences in parallel for the same input. By calculating the total reward score of each sparse state update sequence and comparing and normalizing it relative to the performance of other sequences in the group (i.e., subtracting the mean and dividing the standard deviation), the relative advantage level of the action is efficiently estimated. This relative evaluation mechanism avoids the computationally intensive value network of traditional methods.
[0071] Policy gradient optimization and constraints: Policy parameter updates are based on a modified proximal policy optimization (PPO) objective function. This objective function contains two core constraints to ensure that the learning of the new policy is both efficient and safe: Proximal Proportional Clipping (PPO-style Clipping): The objective function includes a clipping operation to limit the magnitude of change of the new policy relative to the old policy. Specifically, in the truncated importance sampling term, the probability ratio of the new policy and the old policy generating the same candidate sparse state update sequence is calculated, and a proximal proportional clipping operation is performed on this ratio to limit its variation within a preset threshold range, generating a clipped probability ratio. The product of the clipped probability ratio and the relative advantage value is used as the main driving term for policy updates. This ensures that when the policy explores new, high-reward action spaces, each step does not take too large a leap, thus preventing irreversible collapse of policy performance during model training.
[0072] KL Divergence Constraint Penalty: The objective function also introduces a crucial KL divergence (Kullback-Leibler Divergence) penalty term. This term rigorously measures the degree of difference between the new policy being optimized and the untrained baseline policy. By penalizing this divergence (controlled by the coefficient $\beta$), it enforces constraints on the new model's behavioral logic, preventing it from deviating significantly from the safe and correct syntax and structural validity acquired during the basic supervised fine-tuning phase due to the stimulation of reward signals in reinforcement learning, thus ensuring the compliance of the final code output.
[0073] By executing several rounds of gradient updates and loop closures in full-process rendering-aware reinforcement learning, the large model ultimately gained a profound intuitive understanding of complex spatial geometric and physical motion laws, and successfully constructed a cross-modal seamless mapping mechanism from text semantic features to physical node displacement features.
[0074] To visually demonstrate and prove the groundbreaking technical effects and outstanding novelty of this invention in computer graphics and text-generated animation tasks, the following uses structured data comparison tables and in-depth technical analysis to present a comparison of its core evaluation indicators with current mainstream cutting-edge technologies under the standardized SVGAnim-Test test validation set: Table 1: Comprehensive Comparison of Core Techniques and Evaluation Performance of Vector Animation Generation Methods in Different Categories
[0075] Deep technical insights into data and metrics: Regarding the fundamental solution to identity drift caused by context information overload and memory decay in long sequence models: According to the comparative data in Table 1, advanced large language models such as GPT-5.2 have fundamental inherent defects in autoregressive sequence output. When faced with generating extremely long plain text code streams, the Transformer cross-attention module usually diverges severely with a sharp increase in time steps. This often leads to it incorrectly encoding blue static object elements from the previous frame as red or changing their hierarchical nesting order in the code prediction of the next frame, which is "identity drift". The invention exclusively adopts and proposes a large-scale data compression and architectural constraint technology—the Sparse State Update (SSU) mechanism—which describes only those points or variables that actually change over time with extremely short code segments, while directly stripping the baseline static constants out of the prediction network's field of vision. This mechanism eliminates and cuts off the possibility of prediction network interference and tampering with non-target DOM nodes from the underlying engine physical mechanism, thus achieving a revolutionary breakthrough with a topological structure hierarchy fidelity of up to 100%.
[0076] The reinforcement learning algorithm GRPO plays a decisive role in enabling the algorithm to overcome local minima and converge. Observing the data from the ablation experiment on "VAnim (SFT only) without reinforcement learning feedback loops": Although the use of CoT and SSU significantly reduced the crash rate during code generation (success rate jumped to 95.2%), and achieved a semantic score of 0.268, superior to general large language models (such as GPT-5.2's 0.234), through extensive qualitative experimental comparisons and result rendering, the inventors found that networks trained at this stage generally exhibit a relatively stubborn "lazy motion / conservative motion bias." For example, when a user instructs in the abstract prompt "to smoothly open the left door of the cabinet to its maximum angle," a model that only learns the language probability distribution may stop modifying the control point after decoding a weak, valid code snippet like "opening the door a small crack (slight rotation operation)." This is because it fears that significant code sequence modifications would severely deviate from the safety threshold of the supervised corpus. Based on the reinforcement feedback module specifically mounted and trained iteratively in the closed loop within the engine architecture, when the GRPO feature group network reward based on continuous spatial visual comparison is introduced (using the similarity score reward given by the multimodal coding feature PE-Core), the policy model, driven and tempted by the powerful score incentive, is forced and guided to break out of the "conservative minimum range" in the traditional probabilistic statistical sense. Under the dual constraints of ensuring grammatical correctness, it deeply explores and successfully masters the mathematical underlying laws of how to move the control points of the Bézier curves more significantly, more precisely, and non-linearly. This innovation further increases the rate at which the model successfully executes instructions that conform to the amplitude of realistic motion and realistic physical logic, ultimately achieving an unprecedented and competitive state-of-the-art peak score of 0.281 in perceptual similarity evaluation, while capping the grammatical compliance rate at a flawless 100.0%.
[0077] Based on the same inventive concept, embodiments of the present invention also provide a vector animation generation system based on sparse state modeling and enhanced rendering perception, such as... Figure 5 As shown, it includes: The data receiving and multimodal fusion module is used to receive user input natural language instructions, initial static SVG code and its corresponding initial bitmap image; perform structured parsing on the initial static SVG code to generate a persistent document object model tree structure, and assign globally unique identifiers to the nodes in the document object model tree structure; convert the initial bitmap image into a visual embedding vector sequence through a visual encoder, and perform feature fusion with the text embedding sequence containing natural language instructions and document object model tree structure to obtain multimodal fusion features; The structure binding thought chain generation module is used to generate structure binding thought chains based on multimodal fusion features and a multimodal large language model. The structure binding thought chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which are used to map abstract visual entities in natural language instructions to globally unique identifiers of nodes in the document object model structure, and describe the temporal logic and motion causal relationship of objects based on the globally unique identifiers to generate motion blueprints. The sparse state update sequence generation module is used to generate attribute difference sequences for target nodes involved in the motion blueprint, with the motion blueprint as a conditional constraint and a multimodal large language model. Each attribute difference term is associated with a globally unique identifier of the corresponding target node. The module then serializes the attribute difference sequence into a sparse state update sequence. The vector animation building block is used to construct the final vector animation sequence based on the initial static SVG code and the sparse state update sequence.
[0078] Since the principle behind the problem solved by the above system is similar to that of the aforementioned method, the implementation of the system can refer to the implementation of the aforementioned method, and the repeated parts will not be repeated.
[0079] This embodiment provides a computer device, including a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the computer program, it implements a vector animation generation method based on sparse state modeling and enhanced rendering perception.
[0080] This embodiment provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements a vector animation generation method based on sparse state modeling and enhanced rendering perception.
[0081] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0082] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit the scope of protection of the present invention; any equivalent substitutions or modifications made by those skilled in the art without departing from the concept of the present invention should fall within the scope of protection of the present invention.< / fecolormatrix> < / g> < / rect> < / path> < / rect> < / g> < / path> < / path> < / path>
Claims
1. A method for vector-based animation generation based on sparse state modeling and rendering perception reinforced, characterized in that, include: S1: Receives natural language instructions from the user, initial static SVG code, and its corresponding initial bitmap image; The initial static SVG code is structured and parsed to generate a persistent Document Object Model (DOM) tree structure, and a globally unique identifier is assigned to each node in the DOM tree structure. The initial bitmap image is converted into a visual embedding vector sequence through a visual encoder, and then fused with a text embedding sequence containing the natural language instruction prompts and the DOM tree structure to obtain multimodal fusion features. S2: Based on the aforementioned multimodal fusion features, a structure-bound thought chain is generated using a multimodal large language model; The structure-bound thinking chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which are used to map the abstract visual entities in the natural language instruction prompts to the globally unique identifiers of the nodes in the document object model tree structure, and to describe the temporal logic and motion causal relationship of the objects based on the globally unique identifiers to generate a motion blueprint. S3: Using the motion blueprint as a constraint, generate attribute difference sequences for the target nodes involved in the motion blueprint using the multimodal large language model, wherein each attribute difference term is associated with a globally unique identifier of the corresponding target node; and serialize the attribute difference sequence into a sparse state update sequence. S4: Based on the initial static SVG code and the sparse state update sequence, construct the final vector animation sequence.
2. The method of claim 1, wherein, S1 includes: S1.1: Structured Parsing and Document Object Model Tree Structure Construction Perform structured parsing operations on the received initial static SVG code to construct a persistent document object model tree structure with hierarchical nesting relationships; Based on the topological traversal algorithm, a globally unique identifier is assigned to each node with a physical form in the document object model tree structure; the lifecycle of the globally unique identifier spans the entire animation sequence and is used to anchor the attribute changes of specific objects in subsequent steps; S1.2: Multimodal Embedding Vector Generation The initial bitmap image is encoded using a pre-trained visual encoder to extract its high-dimensional visual representation features, which are then projected into the token space of a multimodal large language model to generate a continuous sequence of visual embedding vectors. Simultaneously, the natural language instruction prompts and the document object model tree structure containing globally unique identifiers generated in S1.1 are serialized into a discrete text embedding sequence; S1.3: Dense Cross-Modal Feature Alignment and Fusion The visual embedding vector sequence and the text embedding sequence are input into the Transformer decoder of the multimodal large language model; By employing a dense cross-modal self-attention mechanism, the attention weights between the visual embedding vector and each token in the text embedding sequence are calculated. The continuous visual representation features in the initial bitmap image are deeply semantically aligned with the discrete hierarchical structure and globally unique identifier in the document object model tree structure, and the multimodal fusion feature that integrates visual perception information and code structure information is output.
3. The method of claim 1, wherein, S2 includes: S2.1: Mind Chain Hint Construction and Context Injection A structured prompt template containing few sample examples is constructed, and the multimodal fusion features, the natural language instruction prompts, and the globally unique identifiers and hierarchical relationships of each node in the document object model tree structure are injected into the input context window of the multimodal large language model. S2.2: Entity Recognition and Explicit ID Mapping Guide the multimodal large language model to execute the entity recognition sub-step in the thought chain reasoning process: The semantic subject in the natural language instruction prompt is analyzed to identify the abstract visual entity referred to by the instruction; Guided by the multimodal fusion features, the identified abstract visual entities are matched with nodes in the document object model tree structure that have corresponding visual representations, and the globally unique identifier of the node is explicitly output to establish a binding relationship between natural language entities and globally unique identifiers. S2.3: Visual Dynamic Programming and Motion Blueprint Generation Guiding a multimodal large language model to execute visual dynamic programming sub-steps in thought chain reasoning: Based on the binding relationship, the implicit time logic and motion causality in the instructions are analyzed to define the motion type required for each node bound to a globally unique identifier. The motion type includes rigid translation, rotation and scaling, or non-rigid path deformation. For nodes requiring non-rigid path deformation, the evolution trajectory of their geometric control points is further planned; A structured motion blueprint is generated, which is described in natural language and combined with pseudocode. The blueprint explicitly lists the expected state changes and attribute modification strategies of each globally unique identifier node at each time step, serving as constraints for the subsequent generation of attribute difference sequences.
4. The method of claim 3, wherein, S3 includes: S3.1: Keyframe State Extraction and Baseline Alignment Analyze the motion blueprint to identify the key time steps defined therein and their corresponding target state descriptions; Read the initial attribute values of each node in the document object model tree structure and set them as the baseline state at time step t=0; For each globally unique identifier node specified in the motion blueprint, extract its expected attribute values at each key time step to form a set of state changes to be processed. S3.2: Strict Attribute Difference Calculation and Sparse State Update Generation The set of state changes to be processed is traversed in ascending order of time steps, and normalization and strict attribute difference operations are performed on each candidate attribute of each globally unique identifier node. Specifically, the attribute values are first normalized, and the normalization process includes at least: removing redundant whitespace characters, unifying numerical precision, unifying color expression format, unifying coordinate representation format, and unifying the order of transformation parameters. Establish the target state table for the current time step t and the previous time step t.
1. Historical state table; For any attribute `attr` of any node ID, the normalized attribute value at time step `t` is denoted as `v`. t This attribute was applied at the previous time step t.
1. The attribute value after normalization is denoted as v. t 1; if v t With v t If the values are not equal, then the attribute is determined to have changed effectively at time step t, and an attribute difference term (ID, attr, v) is generated. t If the two are equal, no difference term is generated; Aggregate all attribute difference items that have been determined to have undergone effective changes within the same time step t into a time step increment set Δ. t , where Δ t ={(ID, attr, v t ) | v t ≠v t 1}; where Δ t This represents the set of attribute increments at time step t, where ID represents the globally unique identifier of the node, attr represents the attribute name, and v t The attribute attr representing the node ID is normalized at time step t; the attribute includes at least geometric attributes, transformation attributes, and appearance attributes. The path definition attribute d in the geometric attributes is used to represent non-rigid geometric deformation. The transformation attribute transform is used to represent translation, rotation, or scaling. The appearance attributes fill, stroke, opacity, and fill-opacity are used to represent fill color, stroke color, overall opacity, and fill opacity, respectively. S3.3: Serialization Representation and Command Output Based on Special Control Tags The hierarchical incremental sequence combination is converted into a one-dimensional serialized string according to a predetermined order, the predetermined order including at least: time steps are arranged in ascending order, nodes within the same time step are arranged according to a globally unique identifier or DOM traversal order, and attributes under the same node are arranged according to a preset attribute priority. During the serialization process, a time step control label is inserted for each time step, a node control label is inserted for each node, and an attribute name and attribute value separator is inserted for each attribute to form a linear token sequence that can be directly processed by a large language model; the one-dimensional serialized string includes at least a time step control label and a node control label, and the corresponding attribute difference terms are sequentially concatenated after the corresponding control labels. Escape or encapsulate reserved characters in attribute values to avoid ambiguity between attribute value content and control tags; This yields a sparse state update sequence, which, together with the initial static SVG code, is input into subsequent animation building blocks to restore the complete animation state sequence step by step over time.
5. The method according to any one of claims 1-4, characterized in that, The multimodal large language model is obtained by end-to-end reinforcement learning fine-tuning based on the group relative policy optimization algorithm. The fine-tuning process includes: The policy network generates policies in parallel using an autoregressive approach based on comprehensive contextual conditions. There are 3 parallel candidate output sequence groups, each containing a reasoning chain and a sparse state update sequence. The integrated headless browser rendering proxy based on the web technology stack is invoked to inject each candidate sparse state update sequence into the virtual DOM environment in real time, reconstruct and render the segments into continuous two-dimensional raster video clips at fixed time steps. Using a pre-trained video-aware encoder, the video visual feature vectors of the generated two-dimensional raster video segments and the text feature vectors of the natural language instruction prompts are extracted from the intermediate layers of the network, and the cosine similarity between the two is calculated to construct a semantic alignment reward. A binary reward for format validity is introduced to evaluate code syntax parsability, target sequence length matching degree, and topological validity of globally unique identifier references; The semantic alignment reward and the format validity binary reward are fused to construct a hybrid reward function. The relative reward of candidate sequences within the group is used to estimate the advantage function, and the parameters of the multimodal large language model are updated under the condition of applying KL divergence constraints.
6. The method as described in claim 5, characterized in that, The hybrid reward function is: in, This represents a mixed reward function. This represents a semantic alignment reward. This indicates a binary reward for format validity. Represents the visual feature vector of the video. Let D represent the text feature vector, and D be the sparse state update sequence. The weight parameters for the semantic alignment reward item. is the weight parameter for the format validity binary reward item; CosineSim(·,·) is the cosine similarity calculation function, used to calculate the similarity between the video visual feature vector and the text feature vector; T is the time series length parameter, representing the total number of time steps or total number of frames corresponding to the video segment rendered after the candidate sparse state update sequence is injected into the virtual DOM environment.
7. The method as described in claim 5, characterized in that, The advantage function is estimated using the relative reward of candidate sequences within a group, and the parameters of the multimodal large language model are updated under the condition of applying KL divergence constraints, including: For the same input sample, the current policy network is used to generate a set of multiple candidate sparse state update sequences through parallel sampling. Calculate the total reward score for each candidate sparse state update sequence in the set, and normalize the scores of all candidate sparse state update sequences in the set by calculating the mean and standard deviation within the set to obtain the relative advantage value of each candidate sparse state update sequence relative to other sequences in the set. If the reward score of a candidate sparse state update sequence is higher than the average level in the group, its relative advantage value is positive, and vice versa. This is used to estimate the advantage level of the action without relying on the independent value network. Construct a policy optimization objective function, which consists of two parts: a truncated importance sampling term and a KL divergence penalty term. In the importance sampling term of the truncation, the probability ratio of the new strategy and the old strategy generating the same candidate sparse state update sequence is calculated, and a near-end proportional pruning operation is performed on the ratio to limit the range of the ratio to within a preset threshold range, so as to generate a pruned probability ratio; the product of the pruned probability ratio and the relative advantage value is used as the main driving term of the strategy update to prevent the strategy from collapsing due to excessive single-step update amplitude; In the KL divergence penalty term, the KL divergence between the new policy currently being optimized and the base reference policy that has not been fine-tuned by reinforcement learning is calculated. This KL divergence value is then multiplied by a preset penalty coefficient and deducted from the policy optimization objective function to constrain the behavioral logic of the new policy from deviating from the grammatical and structural validity mastered during the base supervision fine-tuning stage.
8. A vector animation generation system based on sparse state modeling and enhanced rendering perception, characterized in that, include: The data receiving and multimodal fusion module is used to receive user input natural language instructions, initial static SVG code and its corresponding initial bitmap image; The initial static SVG code is structured and parsed to generate a persistent Document Object Model (DOM) tree structure, and a globally unique identifier is assigned to each node in the DOM tree structure. The initial bitmap image is converted into a visual embedding vector sequence through a visual encoder, and then fused with a text embedding sequence containing the natural language instruction prompts and the DOM tree structure to obtain multimodal fusion features. The structure-bound thought chain generation module is used to generate a structure-bound thought chain based on the multimodal fusion features and using a multimodal large language model. The structure-bound thinking chain includes an entity recognition sub-step and a visual dynamic planning sub-step, which are used to map the abstract visual entities in the natural language instruction prompts to the globally unique identifiers of the nodes in the document object model tree structure, and to describe the temporal logic and motion causal relationship of the objects based on the globally unique identifiers to generate a motion blueprint. The sparse state update sequence generation module is used to generate attribute difference sequences for target nodes involved in the motion blueprint using the multimodal large language model, with the motion blueprint as a conditional constraint, wherein each attribute difference item is associated with a globally unique identifier of the corresponding target node; and serialize the attribute difference sequence into a sparse state update sequence. The vector animation building module is used to construct the final vector animation sequence based on the initial static SVG code and the sparse state update sequence.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the vector animation generation method based on sparse state modeling and rendering perception enhancement as described in any one of claims 1 to 8.
10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the vector animation generation method based on sparse state modeling and rendering perception enhancement as described in any one of claims 1 to 8.