A model inference acceleration method and device, and a computing device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating candidate lexical sequences generated from historical caches and draft models for parallel verification, the problems of high computational overhead and high response latency in large-scale language model inference are solved, thereby improving the acceptance rate of candidate lexical sequences and the quality of sequences.

CN122264084APending Publication Date: 2026-06-23XFUSION DIGITAL TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XFUSION DIGITAL TECH CO LTD
Filing Date: 2026-02-10
Publication Date: 2026-06-23

Application Information

Patent Timeline

10 Feb 2026

Application

23 Jun 2026

Publication

CN122264084A

IPC: G06N5/04

AI Tagging

Application Domain

Inference methods

Technology Topics

Batch processingAlgorithm

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Data warehouse SQL script generation method, device and storage medium
CN122261540AIntelligent editors Special data processing applicationsData warehouseBatch processing
A method, device and medium for synchronous execution of data batch processing and range conversion
CN122285642ABatch processingData mining
System, apparatus, and method for maintaining data quality using automatic timeliness verification mechanisms
US12664149B2Database updating Visual data mining DatasheetBatch processing
Substrate processing system, substrate processing method, and program
WO2026133983A1Batch processingProcess engineering
Database transformation engineering monitoring method, device, equipment, medium and product
CN122262129ADatabase management systems Special data processing applicationsBatch processingTask dependency

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Large-scale language models generate lexical units one by one using an autoregressive approach during inference, resulting in high computational overhead and high response latency. The quality of candidate lexical units generated by the draft model is not high, leading to a low acceptance rate of candidate lexical units by the target model.

Method used

The high-quality candidate lexical sequences in the historical cache are integrated with the predicted candidate lexical sequences generated by multiple draft models into batch data, and then validated in parallel using a large model. High-quality candidate lexical sequences are selected using the first probability and the second probability.

Benefits of technology

While ensuring inference efficiency, it significantly improved the acceptance rate of candidate nouns, reduced the number of calls to large models, and improved the quality of candidate sequences.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122264084A_ABST

Patent Text Reader

Abstract

Embodiments of the present application provide a model inference acceleration method and device, and a computing device. The method comprises: obtaining a current inference context generated for an inference request; obtaining a plurality of candidate token sequences; the candidate token sequences comprising a first token sequence and / or a second token sequence; integrating the plurality of candidate token sequences into batch data; inputting the batch data to a large model for parallel verification; determining a target token sub-sequence based on a first probability corresponding to each candidate token sequence output by the large model and a second probability corresponding to each candidate token sequence; wherein the target token sub-sequence is a token sub-sequence that is continuously accepted in a plurality of complete candidate sequences; taking the target token sub-sequence as inference output and continuing subsequent inference. The above method can generate high-quality candidate sequences while ensuring high efficiency, and improve the acceptance rate of the target model for candidate tokens.

Need to check novelty before this filing date? Find Prior Art