A document parsing method and model training method and device based on diffusion trajectory preference optimization
By optimizing the diffusion trajectory through a sliding diffusion block mechanism and online Monte Carlo trajectory generation, the problems of slow inference speed and low diffusion decoding accuracy of autoregressive models are solved, achieving efficient, accurate and stable document parsing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, autoregressive models suffer from slow inference speed, and existing diffusion decoding methods suffer from low accuracy and unstable convergence in complex document parsing.
We employ a sliding diffusion block mechanism combined with autoregressive priors to achieve local parallel diffusion generation through a sliding window. We also introduce online Monte Carlo trajectory generation and diffusion trajectory preference optimization, and use a reinforcement learning framework to evaluate immediate accuracy and look-ahead consistency rewards to optimize the diffusion trajectory.
While maintaining high recognition accuracy, it significantly improves inference speed, reduces model computational overhead, increases decoding efficiency, and enhances trajectory stability and generation accuracy, making it particularly suitable for parsing complex structured data.
Smart Images

Figure CN122197861A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and image processing technology, and to a document parsing method, model training method, and apparatus based on diffusion trajectory preference optimization. Background Technology
[0002] Multimodal large language models excel in document parsing tasks that convert images into structured text such as Markdown or LaTeX. However, mainstream models primarily rely on an autoregressive (AR) decoding paradigm, which generates output word-by-word. This sequential prediction mechanism leads to significant latency bottlenecks, especially when processing long documents containing large amounts of text, complex tables, and formulas, where slow inference speed becomes a major obstacle to real-time deployment.
[0003] To address this issue, diffuse large language models (dLLMs) have emerged as a non-autoregressive alternative, achieving parallel lexical generation through iterative denoising. However, existing diffusion-based parallel decoding methods face two major challenges in document parsing tasks: first, fully parallel diffusion often results in recognition accuracy lower than the autoregressive baseline and struggles to handle high-density document structures; second, the performance of discrete diffusion heavily depends on the diffusion trajectory, and existing strategies based on fixed masks or heuristic n-gram priors cannot adapt to dynamically generated complex combinatorial spaces, leading to unstable decoding and low convergence efficiency. Summary of the Invention
[0004] This invention aims to address the slow inference speed of existing autoregressive models and the low accuracy and unstable convergence of existing diffusion decoding methods in complex document parsing. It proposes a hybrid decoding framework called look-ahead diffusion-autoregressive. This framework utilizes a sliding diffusion block mechanism to achieve local parallel diffusion generation while preserving autoregressive priors (high recognition accuracy). During training, online Monte Carlo trajectory generation is introduced, dynamically constructing training samples based on real-time model states to mitigate exposure bias. A diffusion trajectory preference optimization is proposed, employing a reinforcement learning framework that combines immediate accuracy rewards and look-ahead consistency rewards to incentivize the model to select trajectories that are not only currently correct but also beneficial for future generation.
[0005] To achieve the above objectives, this invention provides a document parsing method based on diffusion trajectory preference optimization, comprising the following steps: S1. Obtain the image features of the document to be parsed; S2. Based on the document image features, construct a sliding window input, which includes a stable context and a speculation window. The speculation window contains initialization mask words. S3. Based on the sliding window input, the probability distribution of word units at each position within the inference window is generated in parallel through a discrete denoising mechanism to obtain the generation sequence of the current window. S4. Verify the consistency between the generated sequence in the current window and the predicted sequence in the previous time step to determine the verified words. S5. Update the stable context based on the verified lexical units; based on the updated stable context, slide the sliding window to perform decoding generation for the next time step.
[0006] Preferably, S1 includes: concatenating a stable context with a speculation window, the speculation window containing a currently generated speculation trajectory fragment and initialization mask terms for filling mask positions.
[0007] Preferably, S2 includes: performing a model forward propagation, and using a discrete denoising mechanism to predict words in parallel at all positions within the prediction window, so as to simultaneously correct the predicted trajectory at the previous time step and fill in the new mask position.
[0008] Preferably, S4 includes: Starting from the beginning of the sliding window, the window sequence generated at the current time is compared word by word with the inferred sequence generated at the previous time until the first inconsistent word is found. Mark all tokens preceding the first inconsistent token as verified tokens.
[0009] Preferably, step S5 includes: determining the sliding step size based on the number of verified tokens, and moving the sliding window forward by the sliding step size; if the entire window passes verification, then moving it by a full window size.
[0010] This invention also provides a model training method, wherein the model trained by the method is used to execute the above-described analytical method during the inference phase, using online Monte Carlo trajectory generation, and the steps include: Obtain the target sequence and randomly select the starting point of the window; The sliding window is divided into a diffusion segment and an initial segment, and multiple candidate trajectories are generated by random sampling within the diffusion segment; Construct the training input based on the candidate trajectories and the initialized mask words; Using real sequences as the supervision target, the model is trained to recover correct results from incorrect inferences by minimizing cross-entropy loss.
[0011] Preferably, after the step of constructing training input based on candidate trajectories, a step of diffusion trajectory preference optimization is further included, including: Calculate the instant accuracy reward for each candidate trajectory, which is determined based on the length of the longest common prefix between the candidate trajectory and its corresponding ground truth label; Predict the content of the next adjacent window based on the candidate trajectory and calculate the ratio of the common prefix with the future true label as a look-ahead consistency reward. Based on the immediate accuracy reward and the prospective consistency reward, a loss function is calculated using a reinforcement learning framework to optimize the model's selection of diffusion trajectories that are beneficial to current and future generation.
[0012] The present invention also provides a document parsing apparatus, comprising: The feature extraction module is used to acquire the image of the document to be parsed and extract multimodal features; The sliding diffusion decoding module is used to execute the sliding diffusion block mechanism, which combines autoregressive priors to generate diffusion trajectories in parallel within the sliding window; The trajectory generation and optimization training module is used to perform online Monte Carlo trajectory generation during the training phase, and calculate the diffusion trajectory preference optimization loss based on immediate and prospective rewards to update the model parameters; The verification and output module is used to verify the validity of the generated diffusion trajectory and output the final parsed text.
[0013] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. The unity of high efficiency and accuracy: Through the sliding diffusion block mechanism, this invention achieves a significant improvement in inference speed (e.g., up to 2.3 times faster) while maintaining recognition accuracy comparable to or even higher than that of autoregressive expert models.
[0014] 2. Enhanced trajectory stability: Through online Monte Carlo trajectory generation, the training distribution is kept consistent with the dynamic distribution during inference, effectively solving the exposure bias problem in traditional non-autoregressive training.
[0015] 3. Forward-looking optimization: The diffusion trajectory preference optimization introduces a forward-looking consistency reward, which not only evaluates the accuracy of the current generation, but also its contribution to the generation of future windows, thereby significantly reducing the uncertainty of prediction and accelerating the convergence of the sliding window. It is particularly suitable for parsing structured data such as formulas and tables.
[0016] 4. Strong generalization ability: This method can be used as a plug-and-play module in various multimodal language models of different sizes and architectures. Attached Figure Description
[0017] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1This is a schematic diagram of the sliding diffusion block decoding process in an embodiment of the present invention; Figure 2 This is an overview diagram of the training process of the diffusion block decoding model in an embodiment of the present invention, including online Monte Carlo trajectory generation and diffusion trajectory preference optimization modules; Figure 3 This is a schematic diagram of the joint and optimization framework of reinforcement learning and instruction fine-tuning in an embodiment of the present invention. Detailed Implementation
[0019] The embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention. The step numbers in the following embodiments are set only for ease of explanation, and there is no limitation on the order between the steps. The execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
[0020] In the description of this invention, it should be understood that the orientation descriptions, such as up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention.
[0021] In the description of this invention, "several" means one or more, "more than" means two or more, "greater than," "less than," and "exceeding" are understood to exclude the stated number, while "above," "below," and "within" are understood to include the stated number. The use of "first" and "second" in the description is merely for distinguishing technical features and should not be construed as indicating or implying relative importance, or implicitly indicating the number of indicated technical features, or implicitly indicating the order of the indicated technical features.
[0022] Furthermore, in the description of this invention, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0023] In the description of this invention, unless otherwise explicitly defined, terms such as "set up," "install," and "connect" should be interpreted broadly, and those skilled in the art can reasonably determine the specific meaning of the above terms in this invention in conjunction with the specific content of the technical solution.
[0024] Example 1 This embodiment describes in detail the specific execution steps of the present invention in the model inference stage, aiming to solve the decoding latency problem in long document parsing.
[0025] The method in this embodiment is based on a multimodal large language model (MLLM), which is configured to receive document image features as input and output structured text (such as Markdown or LaTeX). Figure 1 The schematic diagram of the sliding diffusion block decoding process is shown below. The specific steps are as follows: S1. Obtain the image features of the document to be parsed.
[0026] At any time step during the decoding process, the model maintains a "stable context". This stable context consists of a sequence of historical lexical terms that has been verified as correct.
[0027] S2. Based on the document image features, construct a sliding window input. The sliding window input includes a stable context and a speculation window. The speculation window contains initialization mask words.
[0028] Set a fixed-length sliding window size. (Preferred in this embodiment) The model's input consists of two parts concatenated: The first part is the aforementioned stable context. ; The second part is the "Speculation Window". The speculation window contains the currently generated speculative trajectory fragment and the initialization mask terms (e.g., ` <init>`).
[0029] S3. Based on the sliding window input, the probability distribution of the word at each position in the inference window is generated in parallel through a discrete denoising mechanism to obtain the generation sequence of the current window.
[0030] The model performs one forward propagation and, through a discrete denoising mechanism, predicts in parallel within the window. The probability distribution of the lexical units at each position. This step simultaneously corrects the previous round of inferred trajectories and fills in the new mask positions.
[0031] S4. Verify the consistency between the generated sequence in the current window and the inferred sequence in the previous time step to determine the verified tokens.
[0032] like Figure 1 As shown in the verification lexical matching process, the window sequence generated in the current step is compared with the speculative sequence generated in the previous time step.
[0033] Verification logic: Start comparing each word from the beginning of the window until the first inconsistent word is found.
[0034] Truncation and Acceptance: Mark the sequence before the first inconsistent position as "verified term" and add it to the stable context.
[0035] Slide: Move the sliding window forward by a step equal to the number of validated tokens. If the entire window passes validation, move forward... step.
[0036] S5. Update the stable context based on the verified lexical units; based on the updated stable context, slide the sliding window to perform decoding generation for the next time step.
[0037] Through the above steps, this embodiment achieves a significant reduction in the number of forward propagations of the model while maintaining autoregressive accuracy, thereby accelerating the generation speed of decoded content.
[0038] Example 2 This embodiment describes in detail how to construct training samples to alleviate the "exposure bias" problem during the model training phase, based on the data construction method of online Monte Carlo trajectory generation.
[0039] Traditional training uses static truth values as input, while inference involves the model dealing with its own generated errors. To bridge this gap, a combination of... Figure 2 The model training process shown in this embodiment employs the following dynamic data construction method: 1. Window splitting: Given a target sequence y in the training set and a randomly selected window starting point k , with a length of The window is divided into two parts: one with a length of... lp The "diffusion segment" and its length are ln The "initial segment". (Split point) lp Within the scope Random sampling.
[0040] 2. Online sampling: Freeze the gradient updates of the current model to reflect the true historical context. Input model. Utilize multinomial sampling combined with temperature scaling to allow the model to automatically generate different candidate trajectories. These trajectories represent the model's current true capability distribution and may contain errors.
[0041] 3. Sample synthesis: For each generated candidate trajectory , and ln Initialization mask words` <init>By concatenating the data, the training input can be constructed. .
[0042] 4. Setting monitoring objectives: Although the input includes noisy trajectories generated by the model, the training objective is always set to the real sequence. By minimizing the cross-entropy loss (i.e., as stated in the specification) This forces the model to learn how to "recover" correct results from incorrect predictions.
[0043] Example 3 This embodiment employs a training method based on diffusion trajectory preference optimization, and describes in detail how to utilize reinforcement learning concepts to optimize the quality of diffusion trajectories through specific reward function design.
[0044] like Figure 3 The diagram illustrates a joint reinforcement learning and instruction fine-tuning optimization framework. This embodiment employs a group-relative policy optimization framework, eliminating the need for an additional evaluator network. The core lies in the reward function. : 1. Calculate the instant accuracy bonus ( ): Regarding the candidate trajectories generated in Example 2 Calculate its relationship with the actual label at the corresponding location. The longest common prefix length.
[0045] Where Length represents the length of the lexical sequence, and LCP represents the length of the longest common prefix between two lexical sequences.
[0046] This reward is used to ensure the local accuracy of the current window generation.
[0047] 2. Calculate the forward consistency reward ( ): To assess the impact of the current trajectory on the future, the following forward-looking steps are performed: Forward reasoning: Using the currently generated trajectory Treating the context as known, input it into the model and let the model predict the content of the next adjacent window (i.e., the future). ln (each word element), denoted as .
[0048] Future verification: computation With future real labels The ratio of the longest common prefix.
[0049] This reward is used to penalize tracks that, while currently correct, cause subsequent generation crashes (such as causing Markdown table column errors).
[0050] 3. Loss function calculation: The rewards for the same set of sampled trajectories are standardized to calculate the advantage value, and the policy gradient loss is calculated accordingly. The final total loss is a weighted sum of the advantages and disadvantages.
[0051] Example 4 This embodiment provides a document parsing apparatus, relating to a hardware or software module architecture for performing the above-described method. The apparatus includes: 1. Feature Extraction Module: A visual encoder based on the Transformer architecture (such as the visual tower of Qwen-VL or InternVL) is used to convert the input document image into a high-dimensional feature vector.
[0052] 2. Trajectory Sampling Module: Used to perform the online Monte Carlo sampling described in Example 2 during training.
[0053] 3. Reward Calculation Module: Used to perform the tasks described in Example 3 during training. and The calculations are then fed back to the optimizer.
[0054] 4. Sliding Decoding Controller: Used to execute the window sliding, truncation, and verification logic in Example 1 during inference.
[0055] 5. Memory and processor: The memory stores the computer program, and the processor executes the program to realize the functions of the above modules.
[0056] Example 5 To verify the practical effectiveness of the method proposed in this invention, a comparative experiment was conducted on a standard document parsing dataset. This section aims to demonstrate the beneficial effects of the invention and does not constitute a limitation on the technical solution itself.
[0057] 1. Experimental setup Datasets: OmniDocBench-1.5 (containing complex tables and formulas) and olmOCR-Bench (containing PDF to Markdown tasks) were used.
[0058] Baseline model: compared with existing techniques such as standard autoregressive (AR) models, Jacobi decoding, LADE, and EAGLE-3.
[0059] Hardware environment: NVIDIA A800 GPU, using Flash-Attention 2 acceleration.
[0060] 2. Experimental Results 1: OmniDocBench-1.5 Benchmark Test As shown in Table 1, on the OmniDocBench-1.5 dataset, the proposed method (DARL) outperforms the baseline model and other accelerated methods in all metrics.
[0061] Table 1 Results analysis: In terms of efficiency: This invention achieves a 2.31x inference speedup, the highest among all compared methods; at the same time, the average number of iterations is significantly reduced from 1091.5 times in the baseline to 485.6 times, a reduction of more than 55%, proving that the sliding diffusion block mechanism can effectively reduce the computational overhead of the model.
[0062] In terms of accuracy: While significantly accelerating the process, the overall accuracy of this invention reached 87.69, which not only did not result in performance loss, but was also slightly higher than the original autoregressive baseline (87.57), verifying the effectiveness of diffusion trajectory preference optimization in maintaining generation quality.
[0063] 3. Experimental Result 2: olmOCR-Bench Benchmark Test As shown in Table 2, the method of this invention also demonstrates significant advantages on the more challenging olmOCR-Bench dataset.
[0064] Table 2 Results analysis: In terms of efficiency: This invention maintains a high speedup ratio of 2.27 times, and the average number of iterations is reduced to 519.3 times, which is far superior to other parallel decoding methods (such as MSN's 614.1 times).
[0065] In terms of accuracy: the advantages of this invention are more pronounced in long and complex documents, with an overall accuracy of 81.7%, significantly higher than the autoregressive baseline of 78.9 and other accelerated methods of 77.9. This indicates that by introducing a prospective reward, this invention can generate more coherent and semantically accurate parsing results for complex documents.
[0066] 4. Experimental Conclusions The combined results of the two experiments demonstrate that the document parsing method based on diffusion trajectory preference optimization described in this invention improves inference speed by 2.2 to 2.3 times while reducing the number of forward propagations by more than 50%, and maintains or even significantly improves overall accuracy in complex document parsing tasks. This proves that this invention represents a significant technological advancement in balancing efficiency and accuracy.
[0067] In the foregoing description of this specification, references to terms such as "one embodiment," "another embodiment," or "some embodiments" indicate that a specific feature, structure, material, or characteristic described in connection with an embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0068] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
[0069] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.< / init> < / init>
Claims
1. A document parsing method based on diffusion trajectory preference optimization, characterized in that, Includes the following steps: S1. Obtain the image features of the document to be parsed; S2. Based on the document image features, construct a sliding window input, which includes a stable context and a speculation window. The speculation window contains initialization mask words. S3. Based on the sliding window input, the probability distribution of word units at each position within the inference window is generated in parallel through a discrete denoising mechanism to obtain the generation sequence of the current window. S4. Verify the consistency between the generated sequence in the current window and the predicted sequence in the previous time step to determine the verified words. S5. Update the stable context based on the verified lexical units; based on the updated stable context, slide the sliding window to perform decoding generation for the next time step.
2. The document parsing method based on diffusion trajectory preference optimization according to claim 1, characterized in that, S1 includes: concatenating a stable context with a speculation window, the speculation window containing the currently generated speculation trajectory fragment and initialization mask terms used to fill the mask positions.
3. The document parsing method based on look-ahead diffusion and homing as described in claim 1, characterized in that, S2 includes: performing a model forward propagation, and using a discrete denoising mechanism to predict words in parallel at all positions within the prediction window, so as to simultaneously correct the predicted trajectory at the previous time step and fill in the new mask position.
4. The document parsing method based on diffusion trajectory preference optimization according to claim 1, characterized in that, S4 includes: Starting from the beginning of the sliding window, the window sequence generated at the current time is compared word by word with the inferred sequence generated at the previous time until the first inconsistent word is found. Mark all tokens preceding the first inconsistent token as verified tokens.
5. The document parsing method based on diffusion trajectory preference optimization according to claim 4, characterized in that, S5 includes: determining the sliding step size based on the number of verified tokens, and moving the sliding window forward by the sliding step size; if the entire window passes verification, then moving it by a full window size.
6. A model training method, wherein the model trained by the method is used to execute the parsing method according to any one of claims 1-5 during the inference phase, characterized in that, Using online Monte Carlo trajectory generation, the steps include: Obtain the target sequence and randomly select the starting point of the window; The sliding window is divided into a diffusion segment and an initial segment, and multiple candidate trajectories are generated by random sampling within the diffusion segment; Construct the training input based on the candidate trajectories and the initialized mask words; Using real sequences as the supervision target, the model is trained to recover correct results from incorrect inferences by minimizing cross-entropy loss.
7. The document parsing method based on diffusion trajectory preference optimization according to claim 6, characterized in that, Following the step of constructing training inputs based on candidate trajectories, the next step is to optimize the diffusion trajectory preferences, including: Calculate the instant accuracy reward for each candidate trajectory, which is determined based on the length of the longest common prefix between the candidate trajectory and its corresponding ground truth label; Predict the content of the next adjacent window based on the candidate trajectory and calculate the ratio of the common prefix with the future true label as a look-ahead consistency reward. Based on the immediate accuracy reward and the prospective consistency reward, a loss function is calculated using a reinforcement learning framework to optimize the model's selection of diffusion trajectories that are beneficial to current and future generation.
8. A document parsing device, characterized in that, include: The feature extraction module is used to acquire the image of the document to be parsed and extract multimodal features; The sliding diffusion decoding module is used to execute the sliding diffusion block mechanism, which combines autoregressive priors to generate diffusion trajectories in parallel within the sliding window; The trajectory generation and optimization training module is used to perform online Monte Carlo trajectory generation during the training phase, and calculate the diffusion trajectory preference optimization loss based on immediate and prospective rewards to update the model parameters; The verification and output module is used to verify the validity of the generated diffusion trajectory and output the final parsed text.