A model inference acceleration method and device, and a computing device

By integrating candidate lexical sequences generated from historical caches and draft models for parallel verification, the problems of high computational overhead and high response latency in large-scale language model inference are solved, thereby improving the acceptance rate of candidate lexical sequences and the quality of sequences.

CN122264084APending Publication Date: 2026-06-23XFUSION DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XFUSION DIGITAL TECH CO LTD
Filing Date
2026-02-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Large-scale language models generate lexical units one by one using an autoregressive approach during inference, resulting in high computational overhead and high response latency. The quality of candidate lexical units generated by the draft model is not high, leading to a low acceptance rate of candidate lexical units by the target model.

Method used

The high-quality candidate lexical sequences in the historical cache are integrated with the predicted candidate lexical sequences generated by multiple draft models into batch data, and then validated in parallel using a large model. High-quality candidate lexical sequences are selected using the first probability and the second probability.

Benefits of technology

While ensuring inference efficiency, it significantly improved the acceptance rate of candidate nouns, reduced the number of calls to large models, and improved the quality of candidate sequences.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122264084A_ABST
    Figure CN122264084A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a model inference acceleration method and device, and a computing device. The method comprises: obtaining a current inference context generated for an inference request; obtaining a plurality of candidate token sequences; the candidate token sequences comprising a first token sequence and / or a second token sequence; integrating the plurality of candidate token sequences into batch data; inputting the batch data to a large model for parallel verification; determining a target token sub-sequence based on a first probability corresponding to each candidate token sequence output by the large model and a second probability corresponding to each candidate token sequence; wherein the target token sub-sequence is a token sub-sequence that is continuously accepted in a plurality of complete candidate sequences; taking the target token sub-sequence as inference output and continuing subsequent inference. The above method can generate high-quality candidate sequences while ensuring high efficiency, and improve the acceptance rate of the target model for candidate tokens.
Need to check novelty before this filing date? Find Prior Art