Vector graphics generation method and system based on closed-loop rendering visual self-feedback

By employing a closed-loop rendering visual self-feedback method, and utilizing a multimodal large language model and rendering and verification gating mechanisms, the problems of geometric illusion and occlusion errors in existing vector graphics generation are solved. The generated SVG is more compact and editable, and is suitable for text and image-driven vector graphics generation.

CN122244266APending Publication Date: 2026-06-19BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2026-04-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing vector graphics generation methods cannot effectively observe the intermediate canvas, leading to frequent geometric illusions and occlusion errors. They lack incremental mapping capabilities, are prone to repeated drawing and occlusion during the generation process, and lack an effective inference stage gating mechanism, resulting in bloated SVG structures and poor editability.

Method used

We adopt a closed-loop rendering-based visual self-feedback method, which generates the current candidate vector code fragment through a multimodal large language model and uses a rendering and verification gating mechanism to detect its effectiveness. We use the intermediate visual canvas state as context to construct a multi-round visual instruction sequence for training and generation, and combine fine-grained path decomposition and autoregressive prediction loss to optimize the model.

Benefits of technology

It significantly improves the model's ability to perceive local geometry and occlusion relationships, and generates more compact, complete and editable SVGs. It is suitable for text-to-vector and image-to-vector graphics generation, with high data efficiency and wide applicability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244266A_ABST
    Figure CN122244266A_ABST
Patent Text Reader

Abstract

This invention discloses a vector graphics generation method and system based on closed-loop rendering visual self-feedback. The method includes: constructing an initial current context based on user conditional input and inputting it into a multimodal large language model to generate current candidate vector code fragments; performing validity detection on the current candidate vector code fragments through a rendering and verification gating mechanism, and incorporating the detected fragments into a cumulative vector code sequence to form a current cumulative vector code sequence; rendering the current cumulative vector code sequence as the current intermediate visual canvas state; visually encoding the current intermediate visual canvas state and using it, along with the user conditional input and the current cumulative vector code sequence, to construct the context for the next round; outputting the vector graphics code corresponding to the current cumulative vector code sequence when a termination condition is met, otherwise regenerating the current candidate vector code fragments. This invention significantly improves the model's ability to perceive local geometric structures, layer order, and occlusion relationships.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer graphics, computer vision, natural language processing, multimodal large language models and code generation, and more specifically to a vector graphics generation method and system based on closed-loop rendering visual self-feedback. Background Technology

[0002] Scalable Vector Graphics (SVG) offers advantages such as resolution independence, compact file size, clear geometric structure, and ease of subsequent editing and front-end deployment, and has been widely used in icon design, interface illustration, web animation, brand graphics, and digital content production. In recent years, multimodal large language models have continuously improved in code generation and visual understanding, and directly using large models to generate SVG code has become an emerging research direction.

[0003] However, most existing mainstream vector graphics generation methods adopt an open-loop "blind drawing" paradigm, where the model starts from a cue word or reference image and directly outputs the complete SVG code via autoregression, without being able to observe the intermediate canvas it has drawn during the entire generation process. This type of method incorrectly degenerates the originally visually dependent graphics generation problem into a pure text sequence prediction problem, exhibiting the following prominent drawbacks:

[0004] 1. The invisibility of the intermediate canvas leads to frequent geometric illusions and occlusion errors. The model cannot perceive the visual differences between the already drawn local primitives and the current remaining target, and it also has difficulty handling front and back occlusion, layer order, and local structural closure relationships, easily producing path code that is syntactically correct but visually incorrect.

[0005] 2. Pre-trained models lack the incremental mapping ability to "write a stroke based on the current canvas". Even if the intermediate canvas image is simply inserted into the context, the untrained model cannot stably map the current visual state to the next local geometric code, and performance degradation will occur instead.

[0006] 3. Generating a complete path in one go is difficult, and visual supervision is sparse. Original SVG files are usually stored in compressed form, and a single path often contains multiple visually independent elements. If the model is trained directly at the original path granularity, the changes in intermediate visual states are sparse, which is not conducive to the model establishing fine-grained local rendering capabilities.

[0007] 4. Lack of an effective gating mechanism in the inference phase. Existing autoregressive methods are prone to getting stuck in degenerate loops during inference, such as repeated drawing, invalid overlay, complete occlusion drawing, or drawing out of bounds. This not only wastes computation and tokens, but also results in a bloated SVG structure with poor editability.

[0008] 5. Reinforcement learning schemes that rely solely on scalar rewards have insufficient information utilization. Some schemes attempt to optimize the output of large models through scalar rewards such as rendering effectiveness and visual semantic alignment. However, scalar signals highly compress rich visual structural information and cannot replace the advantage of dense visual context brought by the model directly "seeing" the intermediate canvas.

[0009] Therefore, there is an urgent need for a new vector graphics generation scheme that enables multimodal large language models to continuously observe their own drawn results during the generation process and generate the next step of code accordingly. At the same time, it should be supplemented by a special training strategy for intermediate visual states and a gating mechanism for the inference stage to improve geometric accuracy, occlusion handling ability and structural editability. Summary of the Invention

[0010] In view of the above problems, the present invention proposes a vector graphics generation method and system based on closed-loop rendering visual self-feedback to overcome or at least partially solve the above problems.

[0011] To achieve the above objectives, the present invention adopts the following technical solution: In a first aspect, the present invention provides a vector graphics generation method based on closed-loop rendering visual self-feedback, comprising: S1. Receive user conditional input and construct an initial current context based on the user conditional input; S2. Input the current context into the trained multimodal large language model and generate the current candidate vector code fragment through autoregression; S3. The current candidate vector code segment is validated using a rendering and validation gating mechanism. If the current candidate vector code segment passes the validation, it is merged into the cumulative vector code sequence to form the current cumulative vector code sequence. S4. Use the rendering engine to render the current accumulated vector code sequence into the current intermediate visual canvas state; S5. After visually encoding the current intermediate visual canvas state, construct the context for the next round together with the user condition input and the current cumulative vector code sequence; S6. Determine whether the termination condition is met. If so, output the vector graphic code corresponding to the current cumulative vector code sequence. Otherwise, return to execute S2 based on the context of the next round.

[0012] Furthermore, the training steps of the multimodal large language model include: Obtain the original vector graphics sample, and perform deduplication and structural standardization on the original vector graphics sample; Based on the processed original vector graphic samples, extract the corresponding user condition information; Fine-grained path decomposition is performed on the vector paths in the processed original vector graphics sample to obtain multiple atomic sub-paths expanded in the rendering order, and each atomic sub-path is converted into a local vector code fragment; Following the cumulative drawing order, each local vector code fragment is superimposed step by step, and the result after each superposition is rendered on an intermediate canvas to generate an intermediate visual canvas image corresponding to the cumulative local vector code fragments at the current time step. The user condition information, local vector code fragments, and corresponding intermediate visual canvas images are interwoven and arranged to construct a multi-round visual instruction sequence; The multimodal large language model is trained using visual self-feedback based on the multi-round visual instruction sequence. During the training process, autoregressive prediction loss is applied only to the local vector code fragments and the generated end markers to train the multimodal large language model's ability to predict the next local vector code fragment and output the end marker based on the current intermediate visual canvas image.

[0013] Furthermore, the process of performing fine-grained path decomposition on the vector paths in the processed original vector graphics sample to obtain multiple atomic sub-paths expanded in the rendering order includes: The drawing commands in the path data attributes of the vector path are parsed, and multiple candidate sub-paths are extracted with non-continuous start commands as boundaries. Each candidate sub-path is mapped to a two-dimensional geometric region, and a topological dependency graph is established based on inclusion, intersection and transparency superposition constraints. Based on the topological dependency graph, candidate sub-paths located in the same connected component are re-merged into an atomic sub-path that maintains consistent rendering semantics.

[0014] Furthermore, the multi-round visual instruction sequence adopts the following interleaving structure:

[0015] in, S Indicates a multi-round visual instruction sequence; P Indicates user condition information; N This indicates the total number of steps involved in the complete generation process; This represents a local vector code fragment generated at time step N; This represents the intermediate visual canvas image obtained by accumulating code rendering up to time step N; <end>This indicates the generation of an end marker.

[0016] Furthermore, the autoregressive prediction loss is expressed as:

[0017] in, This represents the loss predicted by the autoregression; Represents the total length of the tokens in a multi-round visual instruction sequence; j Indicates the index when summing all tokens; This represents the total number of output tokens used in the loss calculation; Indicates the index of the currently predicted token; Indicates the first in a multi-turn visual instruction sequence One Token; This indicates that the multimodal large language model, given a preorder token, performs a certain operation on the first... Predicted probability of each token; To output the mask, Indicates the first Each token belongs to a local vector code snippet or an end marker. Indicates the first Each token belongs to either user conditional input or an intermediate visual canvas image.

[0018] Furthermore, the validity check of the current candidate vector code fragment through the rendering and verification gating mechanism specifically includes: After incorporating the current candidate vector code fragment into the cumulative vector code sequence, the rendering engine is used to generate the predicted canvas state; Calculate the pixel-level visual difference between the predicted canvas state and the intermediate visual canvas state of the previous time step; When the pixel-level visual difference value is lower than the preset visual increment threshold, the current candidate vector code fragment is determined to be a degenerate primitive, a completely occluded primitive, an invalid out-of-bounds primitive, or a redundant primitive, and is considered to have failed the validity detection. Calculate the structural similarity or string similarity between the current candidate vector code fragment and the candidate vector code fragment that passed the validity test in the previous time step; When the structural similarity or string similarity is higher than the preset repetition threshold, the current candidate vector code fragment is determined to trigger a repetitive drawing loop and is considered to have failed the validity test.

[0019] Further, in step S3, if the current candidate vector code fragment fails the validity check, then at least one of the following operations is performed based on the current iteration information: Resampling is used to obtain new candidate vector code snippets. Skip the current processing step to proceed to the next step; Trigger early termination to end the generation process.

[0020] Furthermore, the user conditional input is a natural language prompt in the text-to-vector graphics task and a reference image in the image-to-vector graphics task; in both the text-to-vector graphics task and the image-to-vector graphics task, the intermediate visual canvas state is used as visual feedback.

[0021] Secondly, the present invention provides a vector graphics generation system based on closed-loop rendering visual self-feedback. Applying the aforementioned vector graphics generation method based on closed-loop rendering visual self-feedback, the system includes: A conditional input receiving module is used to receive user conditional input and construct an initial current context based on the user conditional input; The local code generation module is used to input the current context into the trained multimodal large language model and generate the current candidate vector code fragments through autoregression. The rendering and verification gating module is used to perform validity checks on the current candidate vector code fragments through a rendering and verification gating mechanism; and to incorporate the current candidate vector code fragments that pass the validity check into the cumulative vector code sequence to form the current cumulative vector code sequence. The intermediate canvas rendering module is used to render the current accumulated vector code sequence into the current intermediate visual canvas state using the rendering engine; The visual self-feedback injection module is used to visually encode the current intermediate visual canvas state and then combine it with the user condition input and the current cumulative vector code sequence to construct the context for the next round. The output assembly module is used to determine whether the termination condition is met. If so, it outputs the vector graphic code corresponding to the current cumulative vector code sequence; otherwise, it returns to the local code generation module based on the context of the next round.

[0022] Thirdly, the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above-described vector graphics generation method based on closed-loop rendering visual self-feedback.

[0023] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a vector graphics generation method and system based on closed-loop rendering visual self-feedback, which has the following beneficial effects: This invention significantly improves the model's ability to perceive local geometry, layer order, and occlusion relationships by directly injecting the intermediate canvas rendering results into the context of a multimodal large language model, thus avoiding the geometric illusions caused by existing open-loop methods that "only look at the prompts and not the canvas".

[0024] This invention employs a visual self-feedback training strategy, enabling the model to learn the difference between the current canvas and the target state, and map this difference to the next local code snippet. Experiments show that models not trained with VSF degrade on multiple metrics even when simply connected to an intermediate canvas, demonstrating that the training strategy of this invention is crucial for achieving closed-loop generation capabilities.

[0025] This invention, through fine-grained path decomposition and curriculum-based trajectory construction, enables the learning of higher-density visual-code correspondences with fewer samples. In experiments, using only approximately 0.85M training samples, it achieves or surpasses open-loop baseline models using larger-scale training data, demonstrating excellent data efficiency.

[0026] This invention verifies the visual increment and repetition of candidate segments, and can filter out degenerate generation steps such as repeated outlines, completely occluded paths, and invalid primitives outside the canvas, ensuring that the final SVG is more compact, complete and editable.

[0027] Because this invention unifies user conditions into input conditions that can participate in multimodal context modeling, the same framework can simultaneously support text-to-SVG generation and image-to-SVG reconstruction, thus having a wider range of applications.

[0028] For icons or illustrations with multiple layers of occlusion, overlapping foreground and background, or complex local combinations, this invention can naturally learn the drawing logic from background to foreground based on the intermediate canvas, thereby better solving the ambiguity of occlusion that is difficult to cover by simple text descriptions.

[0029] This invention does not rely on additional complex reinforcement learning training to play its main role. The core closed-loop generation capability can be obtained through fine-tuning of multiple rounds of visual instructions. During inference, only a lightweight rendering and verification module needs to be added, which is convenient for integration with existing multimodal large language model frameworks and SVG rendering engines. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0031] Figure 1 This is a schematic diagram of the vector graphics generation method based on closed-loop rendering visual self-feedback provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the framework of the vector graphics generation method based on closed-loop rendering visual self-feedback provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the fine-grained path decomposition process provided in an embodiment of the present invention. Detailed Implementation

[0032] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0033] This invention discloses a vector graphics generation method based on closed-loop rendering visual self-feedback, such as... Figure 1 As shown, it includes the following steps: S1. Receive user conditional input and construct the initial current context based on the user conditional input; S2. Input the current context into the trained multimodal large language model and generate the current candidate vector code fragments through autoregression; S3. The current candidate vector code fragment is validated through a rendering and validation gating mechanism. If the current candidate vector code fragment passes the validation, it is added to the cumulative vector code sequence to form the current cumulative vector code sequence. S4. Use the rendering engine to render the current cumulative vector code sequence as the current intermediate visual canvas state; S5. After visually encoding the current intermediate visual canvas state, construct the context for the next round together with the user's conditional input and the current cumulative vector code sequence; S6. Determine if the termination condition is met. If yes, output the vector graphic code corresponding to the current cumulative vector code sequence. Otherwise, return to execute S2 based on the context of the next round.

[0034] Next, each of the above steps will be explained in detail.

[0035] In step S1 above, user conditional input is received, and an initial current context is constructed based on the user conditional input; the user conditional input is a natural language description prompt and / or a reference image.

[0036] In step S2 above, the current context is input into the trained multimodal large language model, and the current candidate vector code fragment is generated by autoregression. The visual self-feedback offline training process of the above multimodal large language model includes: Step 1: Obtain the complete original vector graphics sample (i.e., the complete SVG code file), and perform deduplication and structural normalization on the original vector graphics sample; Step 2: Extract the corresponding user condition information based on the processed original vector graphics sample; Step 3: Perform fine-grained path decomposition on the vector paths in the processed original vector graphics sample to obtain multiple atomic sub-paths or local primitives expanded in the rendering order; specifically including: (1) Parse the drawing commands in the path data attributes (such as the d attribute) of the vector path, and extract multiple candidate sub-paths with non-continuous start commands as boundaries; (2) Map each candidate sub-path to a two-dimensional geometric region and establish a topological dependency graph based on inclusion relationship, intersection relationship and transparency superposition constraints; (3) Based on the topological dependency graph, candidate sub-paths located in the same connected component are re-merged into an atomic sub-path that maintains consistent rendering semantics, so as to avoid filling rule destruction, hole disappearance or color mixing errors caused by simple splitting; Each atomic sub-path or local primitive is then converted into a local vector code fragment; this local vector code fragment includes path fragments, primitive codes, attribute update codes, or a combination thereof; Step 4: Following the cumulative drawing order, progressively overlay each local vector code fragment, and render the result of each overlay step using an intermediate canvas to generate an intermediate visual canvas image corresponding to the cumulative local vector code fragments at the current time step; this intermediate visual canvas image is visual trajectory data reflecting the sequential rendering result of the current cumulative vector code sequence. Step 5: Interweave and arrange the user condition information, local vector code fragments, and corresponding intermediate visual canvas images to construct a multi-round visual instruction sequence; this multi-round visual instruction sequence adopts the following interweaving structure:

[0037] in, S Indicates a multi-round visual instruction sequence; P Indicates user condition information; N Indicates the total number of steps in the complete generation process; C1 to C N These represent the local vector code fragments generated in steps 1 to N, respectively, I1 to I N These represent the intermediate visual canvas images obtained by rendering the accumulated code up to the corresponding time step. <end>This indicates the generation of an end marker.

[0038] Step Six: Perform visual self-feedback training on the multimodal large language model based on multi-turn visual instruction sequences. During the training process, only apply autoregressive prediction loss to local vector code fragments and generated end markers to train the multimodal large language model's ability to predict the next local vector code fragment and output the end marker based on the current intermediate visual canvas image. No supervised loss is applied to user conditional inputs and inserted image tokens. In this invention, the aforementioned output end marker refers to the following: when the multimodal large language model, based on the comparison result between the target visual effect corresponding to the user's conditional input and the current intermediate visual canvas state, determines that the current cumulative vector code sequence has completed the rendering of the main visible structures and that continuing to generate it would lead to redundant overdraw, it outputs an end marker. <end>.

[0039] The above autoregressive prediction loss is expressed as:

[0040] in, This represents the loss predicted by the autoregression; Represents the total length of the tokens in a multi-round visual instruction sequence; j Indicates the index when summing all tokens; This represents the total number of output tokens used in the loss calculation; Indicates the index of the currently predicted token; Indicates the first in a multi-turn visual instruction sequence One Token; This indicates that the multimodal large language model, given a preorder token, performs a certain operation on the first... Predicted probability of each token; To output the mask, Indicates the first Each token belongs to a local vector code snippet or an end marker. Indicates the first Each token belongs to either a user conditional input or an intermediate visual canvas image. By applying the loss to the end token, the model can learn to compare the target visual effect corresponding to the user conditional input with the current intermediate visual canvas state, thereby learning when to stop generating tokens.

[0041] In the training process described above, independent atomic sub-paths are used as basic generation particles for step-by-step rendering, which effectively improves the density of intermediate visual states and forms a course-style training sample that progresses from easy to difficult.

[0042] In step S3 above, the current candidate vector code fragment is validated through a rendering and validation gating mechanism; if the current candidate vector code fragment passes the validation, it is merged into the cumulative vector code sequence to form the current cumulative vector code sequence. (1) The validity detection steps of the above rendering and verification gating mechanism specifically include: 1) In step t, the model first provides the code snippet of the current candidate vector. t The system does not immediately accept the candidate fragment, but instead first accepts the current candidate vector code fragment. t Incorporate into the cumulative vector code sequence (i.e., historically accepted fragments) C 1:t-1 In the process, the rendering engine is used to generate a prediction of the canvas state. :

[0043] Where R(·) represents the SVG rasterization rendering function, C 1:t-1 This represents the sequence of candidate vector code fragments that have passed the validity check and been accepted up to step t-1; It should be noted that the current candidate vector code fragment is combined with previously accepted fragments here only for constructing the predicted canvas state for verification; the candidate fragment is not formally incorporated into the cumulative vector code sequence before passing the validity check. In other words, the candidate vector code fragment will only be formally incorporated into the cumulative vector code sequence after it passes the validity check.

[0044] Next, the pixel-level visual difference between the predicted canvas state and the intermediate visual canvas state of the previous time step is calculated. :

[0045] Wherein, Δ(I t-1 Î t ) represents the intermediate visual canvas state I at the previous time step (i.e., step t-1). t-1 With prediction of canvas state t Average pixel difference between them; t This represents the predicted canvas state obtained by combining the current candidate vector code fragment with the historically accepted fragments; H and W represent the height and width of the canvas resolution, respectively. The height of the canvas resolution generated in step t-1 is denoted as . Width The intermediate visual canvas state; The height of the canvas resolution generated in step t is . Width The predicted canvas state.

[0046] When the pixel-level visual difference value is lower than the preset visual increment threshold, the current candidate vector code fragment is determined to be a degenerate primitive, a completely occluded primitive, an invalid out-of-bounds primitive, or a redundant primitive, and is considered to have failed the validity test and is rejected. 2) Calculate the structural similarity or string similarity between the current candidate vector code fragment and the candidate vector code fragment that passed the validity check (i.e., accepted) in the previous time step; when the structural similarity or string similarity is higher than the preset repetition threshold, it is determined that the current candidate vector code fragment triggers a repetition drawing loop, is regarded as failing the validity check, and is rejected.

[0047] (2) In this step, if the current candidate vector code fragment fails the validity check, then perform at least one of the following operations based on the current iteration information: 1) Resample to obtain new candidate vector code snippets; 2) Skip the current processing step to proceed to the next step; 3) Trigger early termination (e.g., trigger rejection multiple times in a row) to end the generation process.

[0048] In step S4 above, the current accumulated vector code sequence is rendered into the current intermediate visual canvas state using the rendering engine; In step S5 above, the current intermediate visual canvas state is used as the visual context. After visual encoding, it is used together with the user condition input and the current cumulative vector code sequence to construct the context for the next round, so as to form a closed-loop iterative generation process of "code generation - canvas rendering - visual feedback". In step S6 above, it is determined whether the termination condition is met. If so, the vector graphic code corresponding to the current cumulative vector code sequence is output; otherwise, S2 is executed based on the context of the next round.

[0049] The vector graphics generation method based on closed-loop rendering visual self-feedback provided by the present invention is applicable to text-driven vector graphics generation, image-driven vector graphics reconstruction, and automatic generation of structured editable SVG code. In the text-to-vector graphics task, user conditional input is a natural language prompt, and in the image-to-vector graphics task, it is a reference image; in both text-to-vector graphics and image-to-vector graphics tasks, the intermediate visual canvas state serves as visual feedback.

[0050] Based on the same inventive concept, embodiments of the present invention also provide a vector graphics generation system based on closed-loop rendering visual self-feedback. Applying the above-described vector graphics generation method based on closed-loop rendering visual self-feedback, the system includes: The conditional input receiving module is used to receive user conditional input and construct an initial current context based on the user conditional input; The local code generation module is used to input the current context into the trained multimodal large language model and generate the current candidate vector code fragments through autoregression. The rendering and verification gating module is used to perform validity checks on the current candidate vector code fragments through the rendering and verification gating mechanism; and to merge the current candidate vector code fragments that pass the validity check into the cumulative vector code sequence to form the current cumulative vector code sequence. The intermediate canvas rendering module is used to render the current accumulated vector code sequence into the current intermediate visual canvas state using the rendering engine; The visual self-feedback injection module is used to visually encode the current intermediate visual canvas state and then combine it with the user condition input and the current accumulated vector code sequence to construct the context for the next round. The output assembly module is used to determine whether the termination condition is met. If it is, the vector graphic code corresponding to the current cumulative vector code sequence is output; otherwise, the local code generation module is returned based on the context of the next round.

[0051] Since the principles underlying these systems are similar to the aforementioned vector graphics generation method based on closed-loop rendering visual self-feedback, the implementation of this system can be found in the implementation of the aforementioned method, and the repetitions will not be repeated.

[0052] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described vector graphics generation method based on closed-loop rendering visual self-feedback.

[0053] Based on the same inventive concept, embodiments of the present invention also provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the above-described vector graphics generation method based on closed-loop rendering visual self-feedback.

[0054] Next, a specific embodiment will be used to specifically illustrate the vector graphics generation method and approach based on closed-loop rendering visual self-feedback provided by the present invention.

[0055] 1. Overall Implementation Process: like Figure 2 As shown, the vector graphics generation process provided in this embodiment of the invention includes steps such as input condition reception, step-by-step local code generation, intermediate canvas rendering, visual feedback injection, rendering and verification gating, and final SVG output. Figure 2 Figure (a) shows a schematic diagram of the closed-loop rendering visual self-feedback generation process of the present invention, that is, after the model outputs local SVG code at each step, it renders the current intermediate canvas in real time and feeds the intermediate canvas back to the model to continue generating the next step of code; Figure 2 Figure (b) shows a schematic diagram of the traditional open-loop direct generation process, in which the model generates subsequent SVG code directly based solely on text history or previous code, lacking intermediate visual feedback. Therefore, it is more prone to geometric errors, occlusion relationship errors, and visual quality degradation.

[0056] Specifically, let the user's conditional input be... P In text-to-vector graphics tasks, P Provide natural language prompts; in image-to-vector graphics tasks, P For reference target image; in some hybrid embodiments, P It can also include both text prompts and reference images. The system first feeds the user's conditional input into a multimodal large language model as the initial current context, and the model outputs the first step of local vector code snippets. Then, using an SVG rendering engine, Rendered as an intermediate visual canvas image and will The visual context is input into the model again, and the model continues to generate the second step of local vector code snippets. This process continues until the model outputs an end marker. <end>Or the external termination condition is met.

[0057] Specifically, the local vector code fragment generated in step 1 can be denoted as C1, and the intermediate visual canvas image obtained by accumulating and rendering C1 can be denoted as I1; in any subsequent time step, the corresponding intermediate visual canvas image is obtained by rendering the local vector code fragments accumulated up to the current time step.

[0058] In this process, the effective context of the model at any time step t This can be expressed as:

[0059] in, Indicates time step The generated local vector code snippet; Indicates the time limit step The intermediate visual canvas image obtained from cumulative code rendering; Therefore, this invention is not based on the single-step translation paradigm of "outputting the complete SVG at once," but rather establishes a multi-round local drawing logic using intermediate visual states as a bridge. This mechanism is highly consistent with the human drawing process of "drawing a stroke, looking at it, and then drawing the next stroke," and is particularly suitable for solving geometric problems that must rely on the current canvas for judgment, such as occlusion, primitive stacking, and structural closure.

[0060] 2. Example 1: Construction of Visual Self-Feedback Training Data: like Figure 3 As shown, in order to enable the multimodal large language model to learn to predict the next local code fragment based on the current canvas state, the embodiments of the present invention first reconstruct the traditional "user conditional input - complete SVG code" sample into a multi-step visual trajectory sample interwoven with "user conditional input - local code fragment - intermediate canvas". Figure 3 In the middle (a), the diagram shows the intermediate drawing state without path splitting. At this time, multiple visually independent elements are compressed into a single complex path and appear at once, resulting in a sparse intermediate visual state. Figure 3 In the middle (b), the intermediate rendering state is shown after fine-grained path decomposition. At this time, different local structures are split into atomic sub-paths that are drawn sequentially. The intermediate visual feedback is more dense, which is conducive to the model gradually learning the spatial structure and rendering order.

[0061] In one embodiment, the training data comes from a publicly available SVG dataset. Considering that raw SVG files are often stored in compressed form, one <path>Long paths often contain multiple visually independent sub-regions. If training samples are constructed directly at the original path granularity, the intermediate visual states will be too sparse, making it difficult for the model to learn fine-grained incremental rendering behavior. Therefore, this embodiment of the invention first performs fine-grained path decomposition on long paths.

[0062] Specifically, the path decomposition process includes the following steps: (1) Sub-path extraction. Parse the drawing commands in the d attribute, and use the non-continuous starting commands as the dividing boundaries to decompose the long path into multiple candidate sub-paths.

[0063] (2) Topological dependency analysis. Candidate sub-paths are treated as two-dimensional polygonal regions. Their inclusion relationships, intersection relationships, and transparency overlay constraints are analyzed, and a dependency graph is constructed accordingly. If two sub-paths have a hole definition relationship, a non-transparent overlapping relationship, or a rendering semantic coupling relationship, a connection is established in the dependency graph.

[0064] (3) Connected component merging. Based on the topological dependency graph, candidate sub-paths located in the same connected component are merged again, so that regions with common rendering semantics are still retained as the same atomic path unit, thereby avoiding the destruction of the original filling rules, hole structure or local transparent overlay due to mechanical splitting.

[0065] After the above processing, each SVG sample is no longer composed of a small number of ultra-long paths, but is transformed into multiple local path units that are more suitable for gradual rendering and learning. In one specific embodiment, the average number of path units is increased from about 4 to about 6, and the density of intermediate canvas states is significantly improved.

[0066] After completing the path decomposition, the system constructs a training sequence according to the cumulative rendering order of the path units. Specifically, in step t, the first t local code snippets are accumulated and rendered into a canvas image. And insert it before the code snippet at step t+1 to form an interleaved multi-round sequence:

[0067] in, S Indicates a multi-round visual instruction sequence; P Indicates user condition information; This represents a local vector code fragment generated at time step N; This represents the intermediate visual canvas image obtained by accumulating code rendering up to time step N; <end>This indicates the generation of an end marker.

[0068] 3. Example 2: Visual self-feedback training method: After completing the construction of the interwoven visual trajectory dataset, this embodiment of the invention uses a multi-round visual instruction fine-tuning method to train a multimodal large language model.

[0069] In one embodiment, the model's visual encoder processes each intermediate visual canvas image. The model is encoded, and the resulting visual token is inserted into the text context. During training, the model needs to be within a given historical context. Predict the next local vector code snippet under the premise of and output at the appropriate time. <end>.

[0070] Furthermore, output the end marker. <end>This refers to the model determining, based on the difference between the target visual effect corresponding to the user's input conditions and the current intermediate visual canvas state, that the current cumulative canvas has completed the rendering of the main visible structures and that continuing to generate would lead to redundant overdraw, and then outputting an end marker.

[0071] Unlike traditional sentence-level supervision, the embodiments of this invention preferably apply loss only to the output portion that the model actually needs to generate, that is, to each local vector code fragment. and end mark <end>Apply autoregressive supervision while considering user conditional input. P No loss is imposed on the intermediate canvas image token. This training method forces the model to learn: (1) What structures are still missing between the current canvas and the target conditions; (2) How to convert the missing structure into the correct next step geometric code; (3) When to stop generating when the canvas is close enough to the target to avoid overdrawing.

[0072] In a preferred embodiment, the multimodal large language model is trained using the following loss function:

[0073] in, This represents the loss predicted by the autoregression; Represents the total length of the tokens in a multi-round visual instruction sequence; j Indicates the index when summing all tokens; This represents the total number of output tokens used in the loss calculation; Indicates the index of the currently predicted token; Indicates the first in a multi-turn visual instruction sequence One Token; This indicates that the multimodal large language model, given a preorder token, performs a certain operation on the first... Predicted probability of each token; To output the mask, Indicates the first Each token belongs to a local vector code snippet or an end marker. Indicates the first Each token belongs to either a user conditional input or an intermediate visual canvas image. By applying the loss to the end token, the model can learn to compare the target visual effect corresponding to the user conditional input with the current intermediate visual canvas state, thereby learning when to stop generating tokens.

[0074] It is worth emphasizing that this invention discovers that simply providing an intermediate canvas to an untrained base model is insufficient to improve generation quality and may even lead to performance degradation. This is because the pre-trained model has not yet established an intrinsic mapping between the "visual intermediate state and incremental geometric code." Therefore, the aforementioned visual self-feedback training strategy is a key component in realizing the closed-loop rendering paradigm.

[0075] 4. Example 3: Closed-loop rendering inference method: After completing the visual self-feedback training, this invention employs a closed-loop rendering generation method during the inference phase.

[0076] For text-to-vector graphics generation tasks, the system first receives natural language prompts, such as "a mountain-style icon with a blue sky and sun." Based on the prompts, the model outputs the first step of local primitive code, such as a rectangle or path corresponding to the sky background. The current canvas is then rendered and re-injected into the model context. The model then continues to generate subsequent local code snippets such as the sun, mountains, and clouds based on the new context.

[0077] For image-to-vector graphics reconstruction tasks, the system first receives a target reference image. The model then needs to progressively reconstruct vector graphics code consistent with the reference image using the same closed-loop method. In this task, the intermediate canvas serves not only as feedback on the content already drawn by the model but also as a basis for visual comparison with the reference target.

[0078] The closed-loop reasoning of this invention has the following advantages: (1) The model can see the image it has generated at each step, so it is easier to follow the natural drawing order from background to foreground and from large outline to local details; (2) When certain areas have been correctly drawn or completely occluded, the model can avoid meaningless redrawing based on the current canvas; (3) For complex illustrations with multiple layers and partial occlusion, the model can rely on the intermediate visual state to determine where the next stroke should be placed, rather than blindly relying on the long text history.

[0079] 5. Example 4: Rendering and Verification Gating Mechanism: To prevent duplicate outlining, invalid overlay, complete occlusion of primitives, or out-of-bounds primitives during closed-loop inference, this invention introduces a rendering-and-verify (RaV) gating mechanism in the inference phase.

[0080] In step t, the model first provides the code snippet of the current candidate vector. t The system does not immediately accept the candidate fragment, but first compares it with the cumulative vector code sequence (i.e., historically accepted fragments) C. 1:t-1 Combine and render the predicted canvas state :

[0081] Two types of detection are then performed: (1) Visual difference detection: The system calculates the intermediate visual canvas state I of the previous time step. t-1 With the predicted canvas state Pixel-level visual difference values ​​between :

[0082] Wherein, Δ(I t-1 Î t ) represents the intermediate visual canvas state I at the previous time step (i.e., step t-1). t-1 With prediction of canvas state t Average pixel difference between them; t This represents the predicted canvas state obtained by combining the current candidate vector code fragment with the historically accepted fragments; H and W represent the height and width of the canvas resolution, respectively. The height of the canvas resolution generated in step t-1 is denoted as . Width The intermediate visual canvas state; The height of the canvas resolution generated in step t is . Width The predicted canvas state.

[0083] If the pixel-level visual difference value is lower than the preset visual increment threshold ε, it means that the current candidate segment has almost no effective visual contribution to the canvas, which may fall under one of the following situations: 1) it is drawn repeatedly with an existing path; 2) the drawn content is completely occluded by an existing foreground; 3) the drawing position is outside the visible canvas; 4) it only introduces a very small and meaningless noise change. In this case, the candidate segment is rejected.

[0084] (2) Repeatability test: The system further compares the current candidate vector code snippets. t The candidate vector code fragment C that passed the validity check (i.e., accepted) at the previous time step t-1 Structural similarity or string similarity. If the similarity is higher than the preset repetition threshold τ. sim If the model is considered to have entered a repetitive generation loop, such as repeatedly redrawing the same leaf outline, the same boundary, or the same local shadow, then the candidate segment is rejected.

[0085] In a preferred embodiment, if a candidate fragment is rejected, the system may trigger any of the following strategies: 1) Resample to obtain new candidate vector code snippets; 2) Skip the current processing step to proceed to the next step; 3) Trigger early termination (e.g., trigger rejection multiple times in a row) to end the generation process.

[0086] By leveraging the aforementioned gating mechanism, this invention can significantly improve the stability of closed-loop generation and the simplicity of the final SVG code without introducing additional training burden.

[0087] 6. Example 5: System Implementation Method: This invention also provides a vector graphics generation system based on closed-loop rendering visual self-feedback, including but not limited to the following modules: The conditional input receiving module is used to receive user conditional input and construct an initial current context based on the user conditional input; The local code generation module is used to input the current context into the trained multimodal large language model and generate the current candidate vector code fragments through autoregression. The rendering and verification gating module is used to perform validity checks on the current candidate vector code fragments through the rendering and verification gating mechanism; and to merge the current candidate vector code fragments that pass the validity check into the cumulative vector code sequence to form the current cumulative vector code sequence. The intermediate canvas rendering module is used to render the current accumulated vector code sequence into the current intermediate visual canvas state using the rendering engine; The visual self-feedback injection module is used to visually encode the current intermediate visual canvas state and then combine it with the user condition input and the current accumulated vector code sequence to construct the context for the next round. The output assembly module is used to determine whether the termination condition is met. If it is, the vector graphic code corresponding to the current cumulative vector code sequence is output; otherwise, the local code generation module is returned based on the context of the next round.

[0088] In practical applications, this system can be deployed within a multimodal large language model inference framework with image encoding and text generation capabilities, and intermediate canvas rendering is achieved through a standard SVG rasterization engine. A fixed-resolution intermediate canvas, such as 224×224, can be used during the training phase; higher resolution rendering can be switched during the inference phase depending on the application scenario. Preset visual increment threshold ε and preset repetition threshold τ are used. sim It can be set as an empirical value or obtained through optimization through a validation set, depending on the target scenario.

[0089] 7. Example 6: Effect Description: (1) The present invention verified in the Illustration subset experiment that the Naive VSF scheme that directly connects to the intermediate canvas without VSF training will cause LPIPS to deteriorate from 0.345 to 0.388. This indicates that simply connecting the intermediate visual canvas to the basic model cannot bring performance improvement. It is necessary to establish the mapping relationship of "current canvas state - next local code segment" through visual self-feedback training.

[0090] (2) The embodiments of the present invention verified the text-to-vector graphics generation and image-to-vector graphics reconstruction capabilities on the standard MMSVGBench benchmark. Specifically, in the Text-to-SVG Icon subset experiment, the present invention, using only about 0.85M training samples, compared with OmniSVG using 2M training samples, reduced the FID from 128.80 to 127.64 and increased the CLIP Score from 0.291 to 0.293; at the same time, the average value after fine-grained path decomposition was... <path>The increase in the number of training samples from 4 to 6 demonstrates the advantages of this invention in terms of data efficiency and intermediate visual state density. Experimental results show that this invention can achieve or exceed the performance of a larger-scale open-loop baseline model using only about 0.85M training samples.

[0091] For example, in the task of converting icon-type text to vector graphics, the method of this invention achieves a lower FID index and excellent results in aesthetic scoring and human preference scoring; in the task of converting complex illustration-type images to vector graphics, the method of this invention achieves higher DINO similarity, lower LPIPS and MSE, indicating that it has significant advantages in maintaining semantic structure consistency and local geometric fidelity.

[0092] Furthermore, ablation experiments showed that: (1) If visual self-feedback training is not performed, the model will have difficulty understanding the intermediate canvas correctly, and may easily miss key local structures or produce semantic illusions. (2) If the rendering and verification gating mechanism is not executed, the model is prone to getting stuck in the repeated generation loop, resulting in incomplete graphics or redundant code. (3) The complete solution of the present invention is superior to the traditional open-loop solution in terms of data efficiency, structural coherence, occlusion handling and editability.

[0093] In summary, this invention provides a vector graphics generation technology that truly leverages the visual prior capabilities of multimodal large language models, possessing clear innovation and engineering application value.

[0094] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0095] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.< / path> < / end> < / end> < / end> < / end> < / path> < / end> < / end> < / end> < / end>

Claims

1. A vector graphics generation method based on closed-loop rendering visual self-feedback, characterized in that, include: S1. Receive user conditional input and construct an initial current context based on the user conditional input; S2. Input the current context into the trained multimodal large language model and generate the current candidate vector code fragment through autoregression; S3. The validity of the current candidate vector code fragment is checked through a rendering and verification gating mechanism; If the current candidate vector code fragment passes the validity check, it is incorporated into the cumulative vector code sequence to form the current cumulative vector code sequence; S4. Use the rendering engine to render the current accumulated vector code sequence into the current intermediate visual canvas state; S5. After visually encoding the current intermediate visual canvas state, construct the context for the next round together with the user condition input and the current cumulative vector code sequence; S6. Determine whether the termination condition is met. If so, output the vector graphic code corresponding to the current cumulative vector code sequence. Otherwise, return to execute S2 based on the context of the next round.

2. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 1, characterized in that, The training steps of the multimodal large language model include: Obtain the original vector graphics sample, and perform deduplication and structural standardization on the original vector graphics sample; Based on the processed original vector graphic samples, extract the corresponding user condition information; Fine-grained path decomposition is performed on the vector paths in the processed original vector graphics sample to obtain multiple atomic sub-paths expanded in the rendering order, and each atomic sub-path is converted into a local vector code fragment; Following the cumulative drawing order, each local vector code fragment is superimposed step by step, and the result after each superposition is rendered on an intermediate canvas to generate an intermediate visual canvas image corresponding to the cumulative local vector code fragments at the current time step. The user condition information, local vector code fragments, and corresponding intermediate visual canvas images are interwoven and arranged to construct a multi-round visual instruction sequence; The multimodal large language model is trained using visual self-feedback based on the multi-round visual instruction sequence. During the training process, autoregressive prediction loss is applied only to the local vector code fragments and the generated end markers to train the multimodal large language model's ability to predict the next local vector code fragment and output the end marker based on the current intermediate visual canvas image.

3. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 2, characterized in that, The process involves fine-grained path decomposition of the vector paths in the processed original vector graphics sample to obtain multiple atomic sub-paths expanded in the rendering order, including: The drawing commands in the path data attributes of the vector path are parsed, and multiple candidate sub-paths are extracted with non-continuous start commands as boundaries. Each candidate sub-path is mapped to a two-dimensional geometric region, and a topological dependency graph is established based on inclusion, intersection and transparency superposition constraints. Based on the topological dependency graph, candidate sub-paths located in the same connected component are re-merged into an atomic sub-path that maintains consistent rendering semantics.

4. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 2, characterized in that, The multi-round visual instruction sequence adopts the following interleaving structure: in, S Indicates a multi-round visual instruction sequence; P Indicates user condition information; N This indicates the total number of steps involved in the complete generation process; This represents a local vector code fragment generated at time step N; This represents the intermediate visual canvas image obtained by accumulating code rendering up to time step N; <end> This indicates the generation of an end marker.< / end> 5. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 2, characterized in that, The autoregressive prediction loss is expressed as: in, This represents the loss predicted by the autoregression; Represents the total length of the tokens in a multi-round visual instruction sequence; j Indicates the index when summing all tokens; This represents the total number of output tokens used in the loss calculation; Indicates the index of the currently predicted token; Indicates the first in a multi-turn visual instruction sequence One Token; This indicates that the multimodal large language model, given a preorder token, performs a certain operation on the first... Predicted probability of each token; To output the mask, Indicates the first Each token belongs to a local vector code snippet or an end marker. Indicates the first Each token belongs to either user conditional input or an intermediate visual canvas image.

6. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 1, characterized in that, The validity detection of the current candidate vector code fragment through the rendering and verification gating mechanism specifically includes: After incorporating the current candidate vector code fragment into the cumulative vector code sequence, the rendering engine is used to generate the predicted canvas state; Calculate the pixel-level visual difference between the predicted canvas state and the intermediate visual canvas state of the previous time step; When the pixel-level visual difference value is lower than the preset visual increment threshold, the current candidate vector code fragment is determined to be a degenerate primitive, a completely occluded primitive, an invalid out-of-bounds primitive, or a redundant primitive, and is considered to have failed the validity detection. Calculate the structural similarity or string similarity between the current candidate vector code fragment and the candidate vector code fragment that passed the validity test in the previous time step; When the structural similarity or string similarity is higher than the preset repetition threshold, the current candidate vector code fragment is determined to trigger a repetitive drawing loop and is considered to have failed the validity test.

7. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 1, characterized in that, In step S3, if the current candidate vector code fragment fails the validity check, then at least one of the following operations is performed based on the current iteration information: Resampling is used to obtain new candidate vector code snippets. Skip the current processing step to proceed to the next step; Trigger early termination to end the generation process.

8. The vector graphics generation method based on closed-loop rendering visual self-feedback as described in claim 1, characterized in that, The user input conditions are natural language prompts in the text-to-vector graphics task and reference images in the image-to-vector graphics task; in both the text-to-vector graphics task and the image-to-vector graphics task, the intermediate visual canvas state is used as visual feedback.

9. A vector graphics generation system based on closed-loop rendering visual self-feedback, characterized in that, The vector graphics generation method based on closed-loop rendering visual self-feedback as described in any one of claims 1-8, the system comprising: A conditional input receiving module is used to receive user conditional input and construct an initial current context based on the user conditional input; The local code generation module is used to input the current context into the trained multimodal large language model and generate the current candidate vector code fragments through autoregression. The rendering and verification gating module is used to perform validity checks on the current candidate vector code fragments through a rendering and verification gating mechanism; and to incorporate the current candidate vector code fragments that pass the validity check into the cumulative vector code sequence to form the current cumulative vector code sequence. The intermediate canvas rendering module is used to render the current accumulated vector code sequence into the current intermediate visual canvas state using the rendering engine; The visual self-feedback injection module is used to visually encode the current intermediate visual canvas state and then combine it with the user condition input and the current cumulative vector code sequence to construct the context for the next round. The output assembly module is used to determine whether the termination condition is met. If so, it outputs the vector graphic code corresponding to the current cumulative vector code sequence; otherwise, it returns to the local code generation module based on the context of the next round.

10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the vector graphics generation method based on closed-loop rendering visual self-feedback as described in any one of claims 1 to 7.