Large language model text answer method incorporating draft answer and kv cache eviction

By incorporating draft answers and KV cache eviction into a large language model text answering method, the problem of inaccurate answer quality in long context scenarios of KV cache eviction is solved, and more efficient answer generation is achieved under low cache conditions.

CN120849565BActive Publication Date: 2026-06-26HARBIN INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HARBIN INST OF TECH
Filing Date
2025-07-22
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing key-value cache eviction methods, in long-context scenarios, fail to reflect the overall contextual text information and are inconsistent with the model's focus, resulting in a decline in response quality.

Method used

This paper proposes a text-based answering method for large language models that incorporates draft answers and key-value caching. By segmenting and encoding long text sequences, it retains query vectors at the end of the query vector set. Combined with attention score calculation, it retains important key vectors and value vectors and performs autoregressive operations to generate more accurate answers.

Benefits of technology

With the same answer accuracy, the KV cache usage is reduced, more accurate answers are generated, and the GPU memory used by the model to generate answers is saved.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120849565B_ABST
    Figure CN120849565B_ABST
Patent Text Reader

Abstract

The large language model text answer method integrating the draft answer and KV cache eviction belongs to the field of large language model text answer generation. The large language model answer method based on the existing KV cache eviction method has the problem of low answer quality. The information of the draft answer is used, so that the small part of the KV cache (K2 and V2) retained is more important, and the attention score is introduced in the process of obtaining the KV cache (K2 and V2) retained, so that the information consideration is more comprehensive, and the model generation will obtain more accurate answers. The present application is mainly applied to the answer of the large language model to the text question.
Need to check novelty before this filing date? Find Prior Art