A model inference acceleration method and device, and a computing device
By integrating candidate lexical sequences generated from historical caches and draft models for parallel verification, the problems of high computational overhead and high response latency in large-scale language model inference are solved, thereby improving the acceptance rate of candidate lexical sequences and the quality of sequences.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XFUSION DIGITAL TECH CO LTD
- Filing Date
- 2026-02-10
- Publication Date
- 2026-06-23
AI Technical Summary
Large-scale language models generate lexical units one by one using an autoregressive approach during inference, resulting in high computational overhead and high response latency. The quality of candidate lexical units generated by the draft model is not high, leading to a low acceptance rate of candidate lexical units by the target model.
The high-quality candidate lexical sequences in the historical cache are integrated with the predicted candidate lexical sequences generated by multiple draft models into batch data, and then validated in parallel using a large model. High-quality candidate lexical sequences are selected using the first probability and the second probability.
While ensuring inference efficiency, it significantly improved the acceptance rate of candidate nouns, reduced the number of calls to large models, and improved the quality of candidate sequences.
Smart Images

Figure CN122264084A_ABST