A large model post-training acceleration method based on asynchronous speculative decoding

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By dynamically scheduling and decoupled speculative decoding in a distributed environment, the problems of long-tail latency and resource waste in reinforcement learning of large language models are solved, and cross-device asynchronous communication and adaptive step size adjustment are realized, which significantly improves inference efficiency.

CN122197975APending Publication Date: 2026-06-12BEIHANG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIHANG UNIV
Filing Date: 2026-03-16
Publication Date: 2026-06-12

Application Information

Patent Timeline

16 Mar 2026

Application

12 Jun 2026

Publication

CN122197975A

IPC: G06N3/0455; G06N5/04; G06N3/092; G06N3/0499

AI Tagging

Application Domain

Biological models Inference methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies suffer from long-tail latency and wasted computational resources during training of large language model reinforcement learning due to the huge differences in sample generation length. Especially under synchronous update requirements, GPU nodes are idle when short sequences are generated, while long sequences are generated slowly. Existing speculative decoding techniques cannot effectively utilize dynamic idle resources in a distributed environment.

⚗Method used

A decoupled asynchronous speculative decoding method with dynamic resource scheduling is adopted. The GPU resources released after the short sequence generation is completed are dynamically transformed into a draft model. Cross-device speculative decoding accelerates the generation of long sequences. The fast inference capability of the draft model is utilized to achieve asynchronous communication and adaptive step size adjustment to optimize resource utilization.

🎯Benefits of technology

It accelerates long-tail sequences in large model training, improves overall inference efficiency, significantly reduces resource waste, and achieves a baseline acceleration of 1.7x-2.6x.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122197975A_ABST

Patent Text Reader

Abstract

The application is a large model post-training acceleration method based on asynchronous speculative decoding, aiming to solve the problem of long time consumption of Rollout link in the prior art. The method can significantly accelerate inference while strictly ensuring that the model performance is not affected. The core scheme is to divide the Rollout task node into a draft node and a target node. During inference, the draft node executes the draft model to quickly generate a speculative result, and sends it to the target node through a communication framework asynchronously; after sending, it can continue to generate the draft without waiting, realizing non-blocking. The target node runs the target model to verify the received draft, and then asynchronously returns the result to update the output of the draft node. This mechanism of node heterogeneous division of labor, draft non-blocking generation and asynchronous execution of verification constitutes an efficient asynchronous speculative decoding process, greatly improving the decoding efficiency. Experiments show that on Qwen2.5-32B and Qwen3-235B-A22B models, the method realizes 2.0 times and 2.4 times Rollout acceleration respectively, which is significantly better than the prior art.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence technology, specifically relating to natural language processing (NLP) and reinforcement learning (RL) techniques, and particularly to a method for accelerating post-training of large language models based on heterogeneous inference decoding. Figure 1 More specifically, the present invention relates to a method for accelerating the training of a large language model that utilizes heterogeneous computing resources to deploy a draft model and a target model, and solves the long-tail problem of inference through speculative decoding.

[0002] The main approach is to accelerate the long-tail problem of inference in the training of large models after reinforcement learning by using heterogeneous speculative decoding methods. By deploying the draft model and the target model on different computing resources, the idle computing resources in the long-tail stage are fully utilized to accelerate inference. Background Technology

[0003] 1. Current Status of Post-Training and Reinforcement Learning for Large Language Models

[0004] With the rapid development of deep learning technology, large language models based on the Transformer architecture have demonstrated outstanding general capabilities in natural language processing tasks. However, models pre-trained solely on massive amounts of text often fail to fully align with human intent, safety protocols, or specific values. To address this issue, the post-training stage becomes crucial. Post-training typically includes supervised fine-tuning and reinforcement learning based on human feedback, as well as reinforcement learning with verifiable rewards, which has gained popularity since 2025.

[0005] In the RLHF (Reinforcement Learning from Human Feedback) phase, commonly used algorithms include Proximal Policy Optimization (PPO) and its variants, such as Group Relative Policy Optimization (GRPO). This process typically comprises four core components: a policy model (Actor), a critic model, a reward model, and a reference model. Its core flow is an iterative loop involving sampling, evaluation, and updating.

[0006] Sampling: The strategy model generates a response sequence based on the prompt words.

[0007] Evaluation: The reward model scores the generated responses, and the critic model estimates the state value.

[0008] Update: Calculate the advantage function based on the reward signal and value estimate, and update the parameters of the policy model and the critic model using gradient descent.

[0009] In practical large-scale reinforcement learning training, the sampling phase (Rollout) often occupies the majority of the entire training cycle, typically exceeding 80%. This is because the inference of large models is autoregressive, meaning that each token generation requires a complete model forward propagation. Limited by GPU memory bandwidth, its generation speed is far slower than the parallel gradient calculation speed. With the increase in the number of model parameters, such as scaling from 7B to 70B or even higher, and the expansion of context length, such as the introduction of Long Thought Chains (CoT), the problem of low sampling efficiency has become the primary bottleneck restricting the scalability of large-scale model training.

[0010] To address this challenge, various large-scale reinforcement learning training frameworks have emerged in the industry. Early frameworks, such as DeepSpeed-Chat, primarily employed a single massively parallel strategy, but lacked flexibility. Recently, a new generation of frameworks, represented by VeRL (Volcano Engine Reinforcement Learning), has received widespread attention. VeRL employs a hybrid programming model and is typically built upon distributed computing frameworks such as Ray. Its core innovation lies in decoupling training from inference:

[0011] Resource decoupling: VeRL allows the inference process and parameter update process of an Actor to be scheduled to different computing resource groups, or to be dynamically switched on the same resource group.

[0012] Data flow optimization: Synchronize model weights and transmit sampled data between the inference workflow and the training workflow through efficient communication primitives.

[0013] Although frameworks such as VeRL have improved resource utilization through architectural optimization, their underlying inference engines are still limited by the physical bottleneck of autoregressive generation, and the latency of a single inference is still relatively high. Especially in scenarios that require generating long texts (such as a MindChain with thousands of tokens), inference throughput remains a bottleneck of the system.

[0014] 2. High-performance large model inference frameworks: vLLM and SGLang

[0015] To alleviate the memory bottleneck and improve throughput during large model inference, academia and industry have proposed a series of system-level optimization schemes for the Transformer architecture, among which vLLM (Virtual Large Language Model) and SGLang are the most typical.

[0016] 2.1 vLLM and PagedAttention Technology

[0017] vLLM is one of the most widely used frameworks for accelerating large model inference. Its core contribution lies in solving the memory fragmentation problem in large model inference. In traditional inference systems, key-value caches are typically pre-allocated in the form of contiguous memory blocks. Since the length of the generated sequence is unknown before decoding is complete, the system often needs to reserve memory according to the maximum possible length, such as 2048 or 4096. This static allocation mechanism leads to serious memory waste: on the one hand, there is reserved but unused "internal fragmentation," and on the other hand, there is "external fragmentation" caused by the discontinuity of memory. This greatly limits the number of concurrent requests that the system can handle simultaneously.

[0018] vLLM introduces the PagedAttention algorithm, inspired by the virtual memory paging management mechanism of operating systems. PagedAttention divides contiguous key-value caches into non-contiguous memory blocks, each containing a fixed number of tokens, such as 16 or 32 key-value pairs. The system maintains a block table to record the mapping relationship between logically contiguous tokens and non-contiguous blocks in physical memory. This mechanism brings the following significant advantages:

[0019] Zero-waste memory management: Memory is dynamically allocated on demand, eliminating internal fragmentation and allowing the KV Cache to almost fill all available memory.

[0020] Efficient memory sharing: In scenarios such as parallel sampling or beam search, multiple sequences may share the same prefix. vLLM allows different sequences to share physical memory blocks through a reference counting mechanism, thereby significantly reducing memory usage.

[0021] Continuous Batching: vLLM employs iterative scheduling, meaning that if a sequence finishes generating prematurely within a batch, the system can immediately insert a new request without waiting for the longest sequence in the entire batch to complete.

[0022] While vLLM significantly improves overall system throughput by increasing batch size, its latency optimization for individual requests is limited. This is because for each specific token generation, the GPU still needs to load the complete model weights, and due to memory bandwidth constraints, the time for a single forward propagation is difficult to compress.

[0023] 2.2 SGLang and RadixAttention Techniques

[0024] As large-scale model applications become increasingly complex, such as multi-turn dialogues, thought chain reasoning, and agent-based complex task planning, inference requests often exhibit highly structured characteristics. The SGLang (Structured Generation Language) framework emerged to address this need; it is not only an inference backend but also includes a frontend language for describing complex interaction flows.

[0025] SGLang introduces RadixAttention technology in its inference backend, a key-value cache management mechanism based on a radix tree. Unlike vLLM, which mainly focuses on prefix sharing within the same batch, RadixAttention aims to achieve automatic reuse of the key-value cache across requests.

[0026] LRU cache eviction strategy: SGLang maintains a Radix Tree in GPU memory, reserving the KV cache of past requests as nodes in the tree. When a new request arrives, the system performs prefix matching in the tree based on the Prompt content. If a match is successful, the pre-computed KV cache is directly reused without recalculating the attention of the prefix part, thus significantly reducing the first-word latency.

[0027] Structured generation and compiler optimization: SGLang treats the complex prompt word generation process defined on the front end as a computational graph, interpreting and optimizing it through compiler technology. For example, it can use regular expressions to constrain the sampling space or automatically identify parallel-generated subtasks, thereby achieving more efficient operator scheduling at the underlying level.

[0028] In post-training scenarios of reinforcement learning, vLLM and SGLang are often integrated as underlying inference engines. However, both PagedAttention and RadixAttention primarily address memory management and multi-concurrency scheduling issues; they do not change the inherently serial nature of Transformer decoding—that is, the generation of the (N+1)th token depends on the computation result of the Nth token.

[0029] 3. Introduction to Speculative Decoding Techniques

[0030] To overcome the sequential limitations of autoregressive decoding and reduce inference latency, speculative decoding techniques have been proposed. This technique is based on a key observation: large models are not "difficult" at every step when generating text. For many common phrases, grammatical structures, or words easily inferred from context, a small-scale model often yields the same results as a large-scale model.

[0031] 3.1 Basic Principles and Process of Speculation Decoding

[0032] Speculative decoding employs a "draft-verification" paradigm, typically involving two models:

[0033] Target model: A large model with a huge number of parameters and powerful performance, but slow inference speed, such as a model with 70 parameters. Draft model: A model with a small number of parameters and extremely fast inference speed, but slightly lower accuracy on some complex tasks (such as a model with 7 parameters or a specially distilled SSM model).

[0034] The standard speculative decoding process includes the following steps:

[0035] (1) Speculation phase: Based on the current context, the draft model quickly generates K candidate tokens (Draft Tokens) through autoregression. Since the number of parameters in the draft model is small, the computational cost of this step is much smaller than that of the target model.

[0036] (2) Validation phase: The target model processes the K candidate tokens in parallel. Since the Attention mechanism in the Transformer architecture allows the parallel computation of the key and value of all tokens, the target model only needs to perform one forward propagation to calculate the probability distribution of what it considers to be correct at these K positions.

[0037] (3) Accept / Reject Decision: The algorithm compares the output of the draft model with the probability distribution of the target model. In the greedy decoding scenario, if the token generated by the draft model is consistent with the token with the highest probability of the target model, the token is accepted. In the kernel sampling scenario, rejection sampling or speculative sampling algorithms are usually used to ensure that the distribution of the final output strictly conforms to the distribution of the target model, thereby ensuring that the quality of the generated token will not decrease due to the introduction of the draft model.

[0038] (4) Correction and continuation: If the i-th Token is rejected, all subsequent speculations are discarded, and the real Token calculated by the target model in step i is used as correction, and then the next round of loop is entered.

[0039] 3.2 Advantages and limitations of speculative decoding techniques

[0040] The core advantage of speculative decoding lies in trading computation for time. It leverages the powerful parallel computing capabilities of modern GPUs, reducing the reliance on GPU memory bandwidth by adding only a small amount of computational load (running a draft model). If the draft model has a high prediction accuracy, the system can generate multiple tokens in a single forward propagation of the target model, thus achieving a 2x or even 3x end-to-end speedup.

[0041] However, existing speculative decoding techniques face severe challenges when applied to post-training of large-scale reinforcement learning models, which is precisely the technical pain point that this invention aims to address:

[0042] Competition arising from coupled computing resources: In traditional implementations, draft models and target models are typically deployed on the same GPU or the same group of GPU nodes. Since large models, especially Actor models in RL training, consume a significant amount of GPU memory, loading a draft model exacerbates memory pressure and can even lead to Out of Memory (OOM). Furthermore, the two models compete for GPU computing units during computation, causing pipeline stalls.

[0043] Static configurations cannot adapt to dynamic environments: Existing speculative decoding usually sets a fixed number of speculation steps K, such as 5. However, during RL training, as the model strategy is continuously updated, the distribution of generated content will change drastically, as will the batch size. A fixed K value cannot remain optimal during the "exploration" and "exploitation" phases, and speculative decoding may even be slower than direct decoding.

[0044] 4. The problem of "long-tail reasoning" and resource idleness in existing technologies

[0045] In current large-scale reinforcement learning, especially in training reasoning and thought processes, there is a significant difference in generation length, which leads to a serious long-tail inference phenomenon and a huge waste of computing resources.

[0046] 4.1 The Pareto Principle and Long Tail Phenomenon in Generation Length

[0047] During the Rollout phase, the Prompts received by the Actor models vary in difficulty. For simple instructions, the model may only need to generate a few dozen tokens to complete the task, with a very short response time. However, for complex logical reasoning or mathematical problems, the model often needs to generate extremely long thought chains, which may reach thousands or even tens of thousands of tokens. This difference results in most simple samples being completed quickly during the generation of the same batch, while a small number of difficult samples continue to be generated.

[0048] 4.2 Resource Idleness under Batch Processing Synchronization

[0049] Because reinforcement learning algorithms like PPO typically require synchronous updates, the system must wait for the last sample in a batch to finish generating before entering the Update phase. Once the short sequences in a batch are generated, the GPU computing units and memory bandwidth responsible for processing these sequences are essentially idle. In the existing vLLM / SGLang architecture, although memory can be released, computing resources are not effectively utilized. These GPU threads can only "idle" and wait until the longest "long-tail" sequence is generated. As generation progresses, the number of active requests in a batch gradually decreases, and the system's computing power utilization drops exponentially. In the latter half of generation, perhaps less than 10% of the GPUs are working, while the remaining 90% of GPU resources are waiting, constituting a huge waste of computing power.

[0050] 4.3 Shortcomings of existing technologies

[0051] Existing speculative decoding techniques primarily focus on "how to make a single model run faster," neglecting "how to utilize idle resources in the cluster." Traditional speculative decoding requires pre-allocating resources to the draft model, failing to utilize dynamically generated idle resources at runtime. Existing scheduling frameworks, such as VeRL, while capable of scheduling tasks, lack fine-grained computational resource borrowing mechanisms. In other words, they cannot dynamically launch a draft model on GPU A immediately after the task on GPU A completes, to assist in accelerating long-tail tasks running on GPU B.

[0052] In summary, existing technologies face a contradiction when processing long-tail inference in RLHF (Reference-Based High-Frequency Hierarchical) models: the coexistence of "idle resources for short sequences" and "slow generation of long sequences." How to dynamically convert the idle GPU resources released from short sequences into computational power for draft models, and accelerate the generation of long-tail sequences through speculative decoding, thereby breaking the bottleneck of batch synchronization, is a key technical challenge that current large-scale model training systems urgently need to solve. Summary of the Invention

[0053] The technical problem to be solved by this invention is that in the training process of reinforcement learning of existing large language models, there is a common problem of "long-tail delay" caused by huge differences in the length of generated samples, and the resulting serious waste of computing resources.

[0054] Specifically, in the rollout phase of large-scale reinforcement learning training, due to the complexity of the prompts and the randomness of the model's responses, the sequence lengths generated by different samples within the same batch often vary considerably. Since reinforcement learning algorithms typically require synchronous updates—meaning gradient calculations can only be performed after all samples in the batch have been generated—this leads to a significant bottleneck effect: GPU nodes processing simple, short sequences finish their tasks early and enter an idle waiting state, unable to release GPU memory or proceed to the next round of computation; while GPU nodes processing complex, long sequences become "laggards," still consuming significant time for autoregressive generation. While existing speculative decoding techniques can accelerate inference, they typically require the draft model and the target model to be statically deployed on the same device. This not only exacerbates the GPU memory pressure on a single card but also fails to utilize the dynamically available idle resources in the aforementioned distributed environment. Therefore, how to dynamically transform the fragmented idle computing power released after short sequences are completed into acceleration capabilities for long sequence generation, breaking the efficiency bottleneck of batch processing synchronization, is a key challenge that urgently needs to be addressed in the field of large-scale model training.

[0055] To address the aforementioned technical challenges, this invention proposes a decoupled asynchronous speculative decoding large-scale model training method based on dynamic resource scheduling. The core of this method lies in constructing a flexible computation graph. During the sampling phase of reinforcement learning, idle GPU resources, which are rendered unused due to the completion of short sequence generation, are identified and dynamically reconfigured as draft model worker nodes. The target model nodes, still running long sequence generation, are then assisted across devices via a network for speculative decoding, thereby achieving training acceleration by "trading idle computing power for overall time." The core process includes four stages: environment initialization, dynamic resource scheduling, decoupled speculative decoding execution, and adaptive step size adjustment.

[0056] Step 1: Initialize the training environment in the distributed GPU cluster

[0057] The method of this invention first initializes the large language model training environment in a distributed GPU cluster, initially configuring all GPU nodes as sampling workers running the complete policy model and initializing a draft model; then it puts them into "sleep" mode and releases the GPU memory they occupy. The purpose of this operation is to quickly switch between the draft model and the target model; finally, it executes the autoregressive generation task of the current batch of data.

[0058] Step Two: Dynamic Resource Scheduling Phase

[0059] During the generation process, the system monitors the generation status and memory usage of each GPU node in real time through a global scheduler. When the system detects that some GPU nodes have completed their assigned short sequence generation tasks, it does not allow them to enter a sleep or idle state, but immediately triggers a dynamic role switching mechanism. At this time, these idle GPU nodes will quickly load and wake up a draft model with fewer parameters or faster inference speed, thus transforming it into a "draft worker". At the same time, the system identifies one or more "target workers" with the longest remaining generation length and the longest expected time in the current batch, and establishes a high-bandwidth, low-latency communication link, such as Ray, between the draft worker and the target worker, forming a decoupled speculative decoding pair.

[0060] Step 3: Decoupled Speculative Decoding Execution Phase

[0061] Based on this, the present invention implements a decoupled speculative decoding process across devices. The draft worker, leveraging its inference speed advantage and the currently generated context information, rapidly generates a prediction sequence containing multiple candidate tokens via autoregression, and sends the token ID of this sequence to the corresponding target worker over the network. Upon receiving the candidate sequence, the target worker no longer performs serial generation of tokens one by one, but instead performs a parallel forward propagation calculation using the target model, while simultaneously verifying whether the probability distribution of this candidate token conforms to the sampling strategy of the target model (e.g., verification through rejection sampling or greedy matching). For subsequences that pass verification, the target worker directly adopts them and appends them to the final result, thereby generating multiple tokens in a single model iteration; for positions where verification fails, corrections are made based on the true distribution of the target model, and the latest corrected state is fed back to the draft worker for the next round of speculation. This process continues until the sequence generation of long-tail nodes ends or the maximum length limit is reached, at which point the binding relationship is released, and all nodes resynchronize to enter the parameter update phase.

[0062] Step 4: Adaptive Step Size Adjustment Stage

[0063] Furthermore, to maximize acceleration and adapt to the network environment, this invention employs an adaptive step size adjustment strategy. The system continuously monitors the network transmission latency between the draft worker and the target worker, as well as the prediction acceptance rate of the draft model. When high network latency or significant differences in model distribution lead to a decrease in acceptance rate, the step size for each inference is dynamically reduced to minimize unnecessary computation and communication overhead. Conversely, when network conditions are good and prediction accuracy is high, the step size is increased to fully utilize idle computing power. Moreover, the draft model can originate from a pre-distilled small-parameter model, a quantized copy of the policy model, or even an N-gram model trained using historical data, as long as its inference latency is significantly lower than that of the target model. Attached Figure Description

[0064] Figure 1 This is a schematic diagram of the large model post-training acceleration method based on asynchronous speculative decoding proposed in this invention. The diagram shows how two working nodes implement asynchronous speculative decoding to achieve a 3x inference acceleration.

[0065] Figure 2 This is a flowchart of the Rollout process in the method of this invention.

[0066] Figure 3 It is a timing diagram for cross-device speculative decoding. Detailed Implementation

[0067] The technical solution, experimental method, and test results of the present invention will be further described in detail below with reference to the accompanying drawings and specific experimental embodiments.

[0068] This invention relates to the field of artificial intelligence technology, specifically to the topic of natural language processing and reinforcement learning. It proposes a method for accelerating the training of large models based on speculative decoding, which includes three main steps: initializing the training environment, dynamically converting idle computing nodes into draft models, and dynamically adjusting the draft generation length.

[0069] This invention constructs a distributed large-scale model reinforcement learning training system that supports dynamic resource scheduling. The system is built on the Ray distributed computing framework, using vLLM as the inference engine at the bottom layer. The hardware environment consists of a high-performance computing cluster of 256 GPU nodes equipped with 8*NVIDIA H100 GPUs. To verify the effectiveness of the decoupled speculative decoding method proposed in this invention, rigorous comparative experiments were designed in this embodiment.

[0070] The experimental steps are explained in detail below:

[0071] Target models: Qwen2.5-32B and Qwen3-235B-A22B were used as reinforcement learning policy models to be trained.

[0072] Draft model: adopts the same architecture as the target model, Qwen2.5-0.5B, Qwen3-4B;

[0073] The baseline schemes are PPO, GRPO, and DAPO, representing the method without any speculative decoding techniques, the method using standard speculative decoding, and the method proposed in this invention, respectively. All GPUs synchronously wait for the longest sequence to be generated, based on VeRL, and for all implementations, a generation temperature of 1.0 is used.

[0074] Experimental results:

[0075] To test the acceleration effect of the method of the present invention, a comparative experiment was conducted with two other methods.

[0076] In the 32B model experiment, the maximum generation length was 20K, and in the 235B model experiment, the maximum generation length was 64K.

[0077] Table 1. Rollout time and speedup ratio of Qwen2.5-32B model at different time steps under GRPO.

[0078] Time step VeRL time (s) VeRL+Spec time (s) The time (s) required for this invention acceleration ratio 100 315 173 118 2.65 125 374 195 139 2.68 150 332 189 123 2.69 175 351 355 166 2.10 200 366 336 202 1.81

[0079] Table 2. Rollout time and speedup ratio of Qwen2.5-32B model at different time steps under DAPO.

[0080] Time step VeRL time (s) VeRL+Spec time (s) The time (s) required for this invention acceleration ratio 100 328 173 126 2.60 125 400 232 192 2.08 150 399 309 197 2.02 175 478 514 231 2.07 200 418 415 235 1.78

[0081] Table 3. Rollout time and speedup ratio of Qwen2.5-32B model at different time steps under PPO.

[0082] Time step VeRL time (s) VeRL+Spec time (s) The time (s) required for this invention acceleration ratio 100 313 176 119 2.63 125 318 183 131 2.42 150 322 261 203 1.58 175 328 213 163 2.01 200 335 225 183 1.83

[0083] Table 4. Rollout time and speedup ratio of Qwen3-235B-A22B model at different time steps under GRPO.

[0084] Time step VeRL time (s) VeRL+Spec time (s) The time (s) required for this invention acceleration ratio 1 1523 1123 855 1.78 2 1533 951 658 2.32 3 929 701 553 1.67 n+1 1707 1153 971 1.75 n+2 1824 1147 1010 1.80 n+3 1914 1285 947 2.02

[0085] The experimental results above show that the method of the present invention is superior to the existing methods in terms of acceleration effect, achieving an acceleration of 1.7x-2.6x compared to the baseline, which can intuitively verify the effectiveness of the present invention.

[0086] In summary, this invention proposes a method for accelerating large-scale model post-training training based on asynchronous speculative decoding. The Rollout task node is divided into draft nodes and target nodes. During inference, the draft node executes the draft model to quickly generate speculative results and sends them asynchronously to the target node via a communication framework. After sending, subsequent draft generation can continue without waiting, achieving non-blocking. The target node runs the target model to verify the received drafts, and then asynchronously sends the results back to update the draft node output. This invention can provide some reference for accelerating inference during future large-scale model post-training training.

Claims

1. A method for accelerating large model post-training based on asynchronous speculative decoding, characterized in that: By dividing the GPU nodes used in post-training of large models into draft nodes and target nodes, and assembling these two types of nodes in a pipeline during Rollout for inference and decoding, the time spent on validation is used to mask the time spent on drafting, thereby accelerating inference. This method primarily addresses the long-tail latency problem in the post-training sampling stage of large language model reinforcement learning. Its core process includes four stages: environment initialization, dynamic resource scheduling, decoupled inference and decoding execution, and adaptive step size adjustment. S1, Environment Initialization Phase The system deploys a large language model training environment in a distributed GPU cluster, pre-configures all GPU nodes in the cluster as sampling workers to run the target policy model, and keeps a predefined draft model in the GPU memory in "sleep" mode. Then, the nodes work together to start the autoregressive generation task of the current batch of data. S2, Dynamic Resource Scheduling Phase As the generation task progresses, the system monitors the memory usage, computing load, and sequence generation progress of each GPU node in real time through the global scheduler. Once it detects that some nodes have completed their assigned short sequence tasks and entered an idle state, it immediately triggers a dynamic role switching mechanism to wake up the idle nodes as "draft workers" running draft models, and establishes a communication link with the target model nodes that are still executing long sequence generation through a high-performance network, forming a heterogeneous decoupled speculative decoding pairing. S3, Decoupled Speculative Decoding Execution Phase The draft worker uses its computing power advantage to quickly generate multiple candidate tokens and send them asynchronously to the target worker. The target worker then uses parallel forward propagation to perform probabilistic verification on the candidate sequence. If the verification passes, it is directly appended to the generated result. If it fails, it is corrected and feedback is provided. The whole process runs in a non-blocking asynchronous pipeline. S4, Adaptive Step Size Adjustment Stage The system dynamically adjusts the inference step size K based on real-time network latency and predicted acceptance rate to maximize resource utilization and ensure end-to-end inference acceleration. The system establishes a multi-dimensional evaluation model that includes network transmission latency, draft model predicted acceptance rate, and the computing power ratio between devices. When network bandwidth fluctuations are detected, leading to increased transmission time, or when the acceptance rate is lower than the preset accuracy threshold for multiple consecutive inference periods due to model distribution differences, the system automatically reduces the inference step size K in a stepwise manner to reduce communication overhead and computing power rollback costs caused by verification failures. Conversely, when the system is in a stable generation phase and the acceptance rate remains at a high level, the system actively increases the step size K to increase the number of tokens generated in a single iteration using idle computing power, thereby approaching the optimal speedup ratio in real time according to the dynamic changes in the runtime environment. The proposed method is specifically optimized for the sampling phase (Rollout) in reinforcement learning post-training. It is compatible with various mainstream reinforcement learning algorithms, including PPO and GRPO, and supports application scenarios that require the generation of large-scale long sequences, such as Long Thought Chains (CoT). In the GRPO algorithm environment, this method utilizes the characteristics of group relative evaluation to quickly transform nodes that have completed sampling within a group into draft workers, thereby accelerating the generation speed of the remaining long sequence samples within the same group and effectively alleviating the long-tail effect in reinforcement learning training.