Method and apparatus for compressing key-value pair cache data

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a key-value tree using a tree-like hierarchical structure and a sliding window for eviction range, the problem of excessive memory usage in the key-value pair cache of large language models is solved. This achieves efficient compression of key-value pair data within a fixed cache space, improving the model's performance and computation speed in long-context tasks.

WO2026129690A1PCT designated stage Publication Date: 2026-06-25HUAWEI TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2025-08-13
Publication Date: 2026-06-25

Application Information

Patent Timeline

13 Aug 2025

Application

25 Jun 2026

Publication

WO2026129690A1

IPC: G06F12/0897

AI Tagging

Application Domain

Memory systems

Technology Topics

Tree rootTheoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A spreader support tree shaker
CN224386248Uavoid damagepromote sheddingPicking devicesTree rootTree trunk
Cross-terminal cooperative building decoration design visual preview and interactive management system
CN122262100AGeometric CAD Program synchronisationTree rootProcessing
Pomegranate tree root deep infiltration irrigation insertion pipe device
CN224368632URoot feedersTree rootDrip irrigation
egg roll placard
CN122251839ACard gamesInformation transmission Theoretical computer science
Memory coherence with early store completion
US12664094B2Memory systems Theoretical computer science Data store

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Key-value pair caching in large language models consumes too much memory in long-context scenarios or resource-constrained environments, leading to inefficiency.

Method used

The key-value pair cache data is organized using a tree-like hierarchical structure. The key-value tree is constructed by using a sliding window for eviction range and importance indicators to gradually reduce the number of key-value pairs, keep the cache space fixed, and prioritize retaining the key-value pairs corresponding to the most recent token.

Benefits of technology

It effectively reduces the memory overhead of key-value pair caching, ensures that the model runs efficiently in long-context tasks, reduces the loss of important information, and improves computation speed and context understanding capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025114437_25062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided are a method and apparatus for compressing key-value pair cache data, relating to the technical field of artificial intelligence (AI). The method comprises: acquiring key-value pair data, wherein the key-value pair data comprises N key-value pairs, and each key-value pair comprises a key vector and a value vector obtained by means of attention computation performed by an attention mechanism-based AI model in an inference process; caching the key-value pair data into a cache space; determining that N is greater than a preset threshold M, wherein the preset threshold M is determined on the basis of a preset size of the cache space; determining target key-value pairs from the key-value pair data on the basis of M key-value trees, wherein each key-value tree is a tree-like hierarchical structure, the key-value tree comprises a plurality of nodes, each node represents one or more key-value pairs, and key-value pairs located at root nodes of the key-value trees are the target key-value pairs; and storing the target key-value pairs in the cache space, and deleting non-target key-value pairs from among the N key-value pairs. In the present application, the key-value pair cache data is smoothly compressed by means of the tree structure, thereby significantly reducing memory overhead of the key-value pair cache.

Need to check novelty before this filing date? Find Prior Art

Description

A method and apparatus for compressing key-value pair cached data

[0001] This application claims priority to Chinese Patent Application No. 202411854965.7, filed on December 16, 2024, entitled “A method and apparatus for compressing key-value pair cache data”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a method and apparatus for compressing key-value pair cached data. Background Technology

[0003] Large language models (LLMs) exhibit impressive capabilities in understanding and generating text, performing tasks such as summarizing, question answering, and creative writing at a human-level. To support efficient token generation, Transformer-based LLMs typically store key-value (KV) pairs of past tokens in memory, a process known as KV caching. However, for very long sequences, the memory required for KV caching can be several times that required to store model parameters, posing a significant challenge in long-context scenarios or resource-constrained environments. Summary of the Invention

[0004] The embodiments of this application provide a method and apparatus for compressing key-value pair cache data, which uses a tree structure to smoothly compress key-value pair cache data, significantly reducing the memory overhead of the key-value pair cache.

[0005] In a first aspect, this application provides a method for compressing key-value pair cached data. The method includes acquiring key-value pair data, which includes N key-value pairs, each key-value pair including a key vector and a value vector obtained by an AI model based on an attention mechanism during inference, where N is a positive integer; caching the key-value pair data into a cache space; determining that N is greater than a preset threshold M, where the preset threshold M is determined based on a preset size of the cache space, and M is a positive integer; determining a target key-value pair from the key-value pair data based on M key-value trees, where the key-value trees are tree-like hierarchical structures, each key-value tree including several nodes, each node representing one or more key-value pairs among the N key-value pairs, and the key-value pair located at the root node of the key-value tree being the target key-value pair; storing the target key-value pair in the cache space; and deleting non-target key-value pairs from the N key-value pairs.

[0006] The key-value pair caching data compression method provided in this application organizes the generated key-value pairs using a tree-like hierarchical structure. As the tree structure increases in level, the number of key-value pairs is gradually compressed, achieving smooth compression of the cached data and effectively reducing the memory overhead of the key-value pair cache. At the same time, the root node of the tree structure is used as the target key-value pair, achieving a linear increase in the number of key-value pairs generated by the AI model, but only requiring a fixed size of cache space.

[0007] Optionally, the attention-based AI model is a large language model.

[0008] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the decoding stage. Before determining the target key-value pairs from the key-value pair data based on the key-value tree, the following steps are included: constructing a key-value tree based on the importance index of each key-value pair in the N key-value pairs and the elimination range sliding window. The elimination range sliding window slides periodically from beginning to end along the current key-value pair sequence as the decoding time step increases. The elimination range sliding window is used to select at least two key-value pairs from the current key-value pair sequence. The at least two key-value pairs are designated as sibling nodes, and the key-value pair with the largest importance index among the at least two key-value pairs is designated as the parent node. The current key-value pair sequence includes the key-value pairs already cached in the current cache space and the key-value pairs corresponding to the words decoded in the current decoding time step.

[0009] In this possible implementation, a key-value tree is constructed using the importance index of each key-value pair and the elimination range sliding window. As the decoding time step increases, the elimination range sliding window slides from beginning to end along the current key-value pair sequence, prioritizing the elimination of key-value pairs corresponding to tokens that are less important in the past and prioritizing the retention of the most recent tokens. This minimizes the bias from highly concentrated areas and enhances the model's ability to handle tasks that require comprehensive context, such as long-form generation and detailed understanding.

[0010] In another possible implementation, a specific implementation of constructing a key-value tree based on the importance index of each key-value pair in N key-value pairs and the elimination range sliding window is as follows: determine the position of the elimination range sliding window in the current key-value pair sequence at the current decoding time step; take at least two key-value pairs selected by the elimination range sliding window from the current key-value pair sequence as sibling nodes; compare the importance index of each key-value pair in the at least two key-value pairs, and take the key-value pair with the largest importance index as the parent node; take the key-value pairs in the current key-value pair sequence that do not have a parent node as the root nodes of M key-value trees respectively.

[0011] The elimination range of the current key-value pair sequence is determined by using an elimination range sliding window. The level of the key-value tree is updated according to the key-value pairs within the elimination range. That is, the key-value pairs within the elimination range are treated as sibling nodes, and the key-value pairs with higher importance indices within the elimination range are treated as parent nodes. In this way, the level of the key-value tree increases smoothly as the elimination range sliding window moves, thereby achieving smooth compression of the key-value pair cache data.

[0012] In another possible implementation, the importance metric for each key-value pair is determined based on the historical average attention weight of the corresponding lexical unit. The historical average attention weight of each token refers to the average of all attention weights calculated for each token during inference. For example, for a token sequence of length 17, when generating the 18th token, if the first token in the token sequence historically has 17 attention weights, then the historical average attention weight of the first token is the average of these 17 attention weights. This application uses the historical average attention weight to measure the importance of the key-value pairs corresponding to the tokens, avoiding the concentration of selected key-value pairs at the beginning of the key-value pair sequence, which could lead to the loss of important information.

[0013] In another possible implementation, the density of target key-value pairs in the left key-value tree of the M key-value trees is less than the density of target key-value pairs in the right key-value tree, and the key-value pairs on the left key-value tree are generated earlier than the key-value pairs on the right key-value tree, thus achieving a left-combing, right-dense compressed structure.

[0014] In another possible implementation, the key-value pair cache data compression method provided in this application further includes: correcting the hierarchical structure of the key-value tree according to the currently generated key-value pairs, so that it can dynamically adapt to changes in data. This adaptability ensures that the cache remains relevant to the current context.

[0015] In another possible implementation, the first elimination range sliding window is determined based on the variable idx, which indicates that the starting position of the elimination range sliding window is the idx-th key-value pair in the current key-value pair sequence, and the ending position of the first elimination range sliding window is the idx+1-th key-value pair in the current key-value pair sequence; the variable idx changes periodically between 1 and M as the decoding time step increases.

[0016] For example, if the decoding time step is incremented by 1, the variable idx is incremented by 1. When idx changes to M, in the next decoding time step, idx becomes 1. In this way, the variable idx changes cyclically between 1, 2, 3, ..., M, realizing the elimination range sliding window sliding from the beginning to the end of the current key-value pair sequence. It prioritizes removing key-value pairs corresponding to tokens that are further away, while paying attention to key-value pairs corresponding to the latest tokens to ensure that context information is not lost. At the same time, through the periodic circulation of the variable idx between 1 and M, it always maintains a smooth transition on the right side, realizing smooth compression of key-value pair data.

[0017] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the imputation stage; before determining the target key-value pairs from the key-value pair data based on M key-value trees, the method further includes: dividing the N key-value pairs into multiple key-value pair blocks, each key-value pair block including key-value pairs corresponding to adjacent words in the input sequence; constructing the key-value tree based on the importance index of each key-value pair block in the multiple key-value pair blocks and M elimination range windows, wherein the M elimination range windows are used to divide the key-value pair block sequence into different key-value pair block groups, and different key-value pair blocks in each key-value pair block group are used as sibling nodes, and the key-value pair block with the largest importance index among the sibling nodes is used as the parent node.

[0018] Considering that important or irrelevant information is usually spatially clustered, this application takes into account that selecting a token individually may compromise the integrity of the context and computational speed. Therefore, this application adopts a block-level elimination strategy in the context compression task (i.e., the compression task of key-value pair data) to ensure the integrity of the context and speed up the inference speed of the model.

[0019] In another possible implementation, a specific implementation of constructing a key-value tree based on the importance index of each key-value pair in multiple key-value pair blocks and M elimination range windows is as follows: using the M elimination range windows, the key-value pair block sequence is divided into M key-value pair block groups; different key-value pair blocks in each key-value pair block group are taken as sibling nodes; the importance index of each key-value pair block in the sibling nodes is compared, and the key-value pair with the largest importance index is taken as the parent node; the key-value pair blocks without parent nodes in the M key-value pair block groups are taken as the root nodes of the M key-value trees respectively.

[0020] In another possible implementation, the importance metric for each key-value pair block is determined based on observation information obtained by using the last key-value pair block in the key-value pair block sequence as the observation window. This avoids full attention computation, improves computation speed, and reduces computational overhead.

[0021] In another possible implementation, the observation information of each key-value pair block includes multiple attention weights for each key-value pair block. The multiple attention weights are the attention weights of the lexicon corresponding to each key-value pair in each key-value pair block relative to the lexicon corresponding to each key-value pair in the last key-value pair block. The importance index of each key-value pair block is determined based on the average attention weight of each key-value pair block, which is obtained by averaging the multiple attention weights of each key-value pair block.

[0022] In another possible implementation, the structure of the key-value tree is determined based on the analysis results obtained by performing wavelet transform analysis on the product of the attention weights and value vectors of each generated word. The analysis results indicate the contribution of the key-value pairs of each generated word to the newly generated word.

[0023] By performing wavelet transform analysis on the product of the attention weights and value vectors of already generated tokens, the contribution of each key value of an already generated token to the currently generated token can be simultaneously reflected. For example, this application performs wavelet decomposition on the product of the attention weights and value vectors and plots the amplitude of the time-domain representation corresponding to different levels of frequency components. It was found that as the position approaches the end of the sequence, the time-domain signal assignments corresponding to all frequency components gradually increase, with higher frequencies increasing at a faster rate. This indicates that high-frequency information is denser closer to the generation end, meaning that the key-value pair data corresponding to the token closer to the generation end contributes more to the current generation. Therefore, this application was inspired to adopt a layered tree structure to achieve smooth compression of key-value pair data by left-combing and right-densification.

[0024] Secondly, this application also provides a compression device for key-value pair cached data. The device includes an acquisition module, a caching module, a first determination module, a second determination module, and a compression module. The acquisition module is used to acquire key-value pair data, which includes N key-value pairs. Each key-value pair includes a key vector and a value vector obtained by an AI model based on an attention mechanism during inference. N is a positive integer. The caching module is used to cache the key-value pair data in a cache space. The first determination module is used to determine that N is greater than a preset threshold M, which is determined based on a preset size of the cache space. M is a positive integer. The second determination module is used to determine the target key-value pair from the key-value pair data based on M key-value trees. The key-value trees are tree-like hierarchical structures, each including several nodes. Each node represents one or more key-value pairs, and the key-value pair located at the root node of the key-value tree is the target key-value pair. The compression module is used to store the target key-value pair in the cache space and delete non-target key-value pairs from the N key-value pairs.

[0025] Optionally, the attention-based AI model is a large language model.

[0026] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the decoding stage; the compression device for key-value pair cache data provided in this application also includes a first construction module, which is used to construct a key-value tree based on the importance index of each key-value pair in N key-value pairs and the elimination range sliding window. The elimination range sliding window slides periodically from beginning to end along the current key-value pair sequence as the decoding time step increases. The elimination range sliding window is used to select at least two key-value pairs from the current key-value pair sequence. The at least two key-value pairs are used as sibling nodes, and the key-value pair with the largest importance index among the at least two key-value pairs is used as the parent node. The current key-value pair sequence includes the key-value pairs already cached in the current cache space and the key-value pairs corresponding to the words decoded in the current decoding time step.

[0027] In another possible implementation, the first building module is specifically used to: determine the position of the elimination range sliding window in the current key-value pair sequence at the current decoding time step; designate at least two key-value pairs selected by the elimination range sliding window from the current key-value pair sequence as sibling nodes; compare the importance index of each key-value pair in the at least two key-value pairs, and designate the key-value pair with the largest importance index as the parent node; and designate the key-value pairs in the current key-value pair sequence that do not have a parent node as the root nodes of M key-value trees respectively.

[0028] In another possible implementation, the importance metric for each key-value pair is determined based on the historical average attention weight of the corresponding lexical term for each key-value pair.

[0029] In another possible implementation, the density of target key-value pairs in the left key-value tree of the M key-value trees is less than the density of target key-value pairs in the right key-value tree, and the key-value pairs on the left key-value tree are generated earlier than the key-value pairs on the right key-value tree.

[0030] In another possible implementation, the first building block is also used to correct the hierarchical structure of the key-value tree based on the currently generated key-value pairs, enabling it to dynamically adapt to changes in the data. This adaptability ensures that the cache remains relevant to the current context.

[0031] In another possible implementation, the first elimination range sliding window is determined based on the variable idx, which indicates that the starting position of the elimination range sliding window is the idx-th key-value pair in the current key-value pair sequence, and the ending position of the first elimination range sliding window is the idx+1-th key-value pair in the current key-value pair sequence; the variable idx changes periodically between 1 and M as the decoding time step increases.

[0032] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the filling stage; the compression device for key-value pair cache data provided in this application also includes a second construction module, which is used to: divide N key-value pairs into multiple key-value pair blocks, each key-value pair block including key-value pairs corresponding to adjacent words in the input sequence; construct the key-value tree based on the importance index of each key-value pair block in the multiple key-value pair blocks and M elimination range windows, wherein the M elimination range windows are used to divide the key-value pair block sequence into different key-value pair block groups, and different key-value pair blocks in each key-value pair block group are used as sibling nodes, and the key-value pair block with the largest importance index among the sibling nodes is used as the parent node.

[0033] In another possible implementation, a specific implementation of constructing a key-value tree based on the importance index of each key-value pair in multiple key-value pair blocks and M elimination range windows is as follows: using the M elimination range windows, the key-value pair block sequence is divided into M key-value pair block groups; different key-value pair blocks in each key-value pair block group are taken as sibling nodes; the importance index of each key-value pair block in the sibling nodes is compared, and the key-value pair with the largest importance index is taken as the parent node; the key-value pair blocks without parent nodes in the M key-value pair block groups are taken as the root nodes of the M key-value trees respectively.

[0034] In another possible implementation, the importance index of each key-value pair block is determined based on the observation information of each key-value pair block obtained by observing the last key-value pair block in the key-value pair block sequence as the observation window.

[0035] In another possible implementation, the observation information of each key-value pair block includes multiple attention weights for each key-value pair block. The multiple attention weights are the attention weights of the lexicon corresponding to each key-value pair in each key-value pair block relative to the lexicon corresponding to each key-value pair in the last key-value pair block. The importance index of each key-value pair block is determined based on the average attention weight of each key-value pair block, which is obtained by averaging the multiple attention weights of each key-value pair block.

[0036] In another possible implementation, the structure of the key-value tree is determined based on the analysis results obtained by performing wavelet transform analysis on the product of the attention weights and value vectors of each generated word. The analysis results indicate the contribution of the key-value pairs of each generated word to the newly generated word.

[0037] Thirdly, embodiments of this application provide a computing device, including a memory and a processor, wherein the memory stores instructions that, when executed by the processor, cause the method described in the first aspect or any possible implementation of the first aspect to be implemented.

[0038] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the method described in the first aspect or any possible implementation thereof to be implemented.

[0039] Fifthly, embodiments of this application also provide a computer program or computer program product, which includes instructions that, when executed, cause a computer to perform the method described in the first aspect or any possible implementation thereof.

[0040] In a sixth aspect, embodiments of this application also provide a chip including at least one processor and a communication interface, the processor being configured to perform the method described in the first aspect or any possible implementation thereof.

[0041] It is understandable that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here for the sake of brevity.

[0042] It is understandable that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here for the sake of brevity. Attached Figure Description

[0043] Figure 1 shows a schematic diagram of the location distribution of the tokens selected by the related technical solutions H2O, TOVA and the solution provided in the embodiments of this application (which can be referred to as TreeKV);

[0044] Figure 2 shows a schematic diagram of the distribution of the average high-frequency information component amplitude obtained by wavelet decomposition;

[0045] Figure 3 illustrates the compression process of the key-value pair cache data compression method provided in the embodiments of this application in an actual token sequence;

[0046] Figure 4 illustrates the compression process in the actual token sequence when the key-value pair cache data compression method provided in the embodiments of this application is applied to a long text compression scenario.

[0047] Figure 5 is a schematic diagram of the implementation process of a key-value pair cache data compression method provided in an embodiment of this application;

[0048] Figure 6 shows the negative log-likelihood (NLL) function curves of four compression schemes, TOVA, H2O, StreamingLLM, and TreeKV, with input lengths ranging from 0.1M to 1M.

[0049] Figure 7 shows the log-mean perplexity curves for three schemes: H2O, TreeKV, and TreeKV Select Left Token.

[0050] Figure 8 is a schematic diagram of a compression device for key-value pair cache data provided in an embodiment of this application;

[0051] Figure 9 is a schematic diagram of the structure of the computing device provided in the embodiment of this application. Detailed Implementation

[0052] The term "and / or" used in this article describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. The symbol " / " in this article indicates that the related objects are in an "or" relationship; for example, A / B means A or B.

[0053] The terms "first" and "second," etc., used in the specification and claims herein are used to distinguish different objects, not to describe a specific order of objects. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same properties in the description of embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such processes, methods, systems, products, or apparatus.

[0054] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner.

[0055] In the description of the embodiments of this application, unless otherwise stated, "multiple" means two or more, for example, multiple processing units means two or more processing units, multiple elements means two or more elements, etc.

[0056] The key-value pair cache data compression method and apparatus provided in this application can be applied to compress key-value cache data obtained by attention calculation during the inference process of AI models based on attention mechanisms, thereby reducing the memory overhead of key-value cache data. For example, it can be used to compress key-value cache data for long-context tasks in AI models such as computer vision, speech recognition, and multimodal learning, and is particularly applicable to the compression of key-value cache data for large language models. The detailed implementation of the key-value pair cache data compression method and apparatus provided in this application is described below using a large language model as an example. The specific implementations for other types of AI models are similar and can be referred to for implementation. For simplicity, the implementation of this application will not be described in detail.

[0057] To facilitate understanding of the key-value pair cache data compression method and apparatus provided in the embodiments of this application, some technical terms involved in the embodiments of this application will be briefly explained below.

[0058] Token: The basic unit in text, which can be a word, a character, a punctuation mark, or a subword. It is the smallest processing unit for text in a large language model. When a large language model infers on an input text sequence, it first divides the input text into multiple tokens, and then performs further inference and analysis on a unit of each token.

[0059] Transformer: A network based on multi-head attention mechanism that includes residual connections, layer normalization, and fully connected layers, capable of processing sequential data in parallel.

[0060] Key-value (KV) pairs: In a key-value pair, the "key" is a unique identifier, usually used to index and retrieve information; while the "value" is the data associated with that key.

[0061] KV caching: A mechanism in the Transformer model to store key-value pairs of generated tokens, thereby avoiding redundant calculations during subsequent inference and improving efficiency.

[0062] Prompt: refers to the input text or instruction provided when interacting with an artificial intelligence model, used to guide the model to generate corresponding output.

[0063] Language modeling is the task of predicting the probability distribution of the next word or character, which can help computers understand and generate natural language.

[0064] The negative log-likelihood (NLL) function is a metric that measures the difference between model predictions and actual data. The smaller the value, the better the model fits the data.

[0065] To address the issue of excessive memory consumption in key-value (KV) caching, various KV caching compression schemes (also known as KV caching eviction schemes) have been proposed in related technologies. These schemes typically involve two different scenarios: long text generation and long text compression.

[0066] In long text generation scenarios, related technical solutions include StreamingLLM and LM-Infinite, which only retain key-value pairs corresponding to the first and most recent tokens. Other solutions, such as H2O, TOVA, and Scissorhands, select tokens based on their importance scores. These methods not only control cache size but also allow the model to handle sequences longer than the pre-training context.

[0067] StreamingLLM and LM-Infinite found that attention scores are primarily concentrated on the initial and recent tokens in the KV cache, so their methods only retain the KV pairs corresponding to these tokens to reduce cache size. However, this can lead to the loss of important information because the KV pairs corresponding to tokens between the initial and sliding windows are directly discarded. H2O introduces a cache eviction method that greedily selects the KV pairs corresponding to important tokens based on a weight score calculated from the attention weights accumulated during generation. Scissorhands uses a similar strategy but binarizes the scores. TOVA uses the attention score of the last token to select tokens. However, these methods often ignore the structure of the key information distribution and simply evict tokens throughout the sequence.

[0068] Figure 1 illustrates the distribution of token positions selected by the related technical solutions H2O, TOVA, and the solution provided in this application embodiment (which can be referred to as TreeKV). Figures (a), (b), and (c) in Figure 1 show the distribution of selected tokens in H2O, TOVA, and the TreeKV solution provided in this application embodiment when a 512-length sequence is randomly selected from PG19, with a cache size of 128. The selected token positions are quantized as 1 (dark color), and others as 0 (light color), and the average is taken over the 32 heads to reduce noise. A significant pattern emerges: H2O and TOVA exhibit significant regional bias due to neglecting the token elimination range, which may lead to an oversimplified interpretation of the sequence, affecting the model's ability to understand subtle interactions, thereby reducing their performance in tasks requiring a holistic understanding.

[0069] In summary, the relevant technical solutions have some limitations. Purely position-based selection strategies may miss important tokens outside the predefined region. On the other hand, as shown in Figure 1, strategies based on global importance scores often exhibit strong regional bias (e.g., the tokens selected by the H2O scheme are concentrated at the beginning of the sequence, while the tokens selected by the TOVA scheme are concentrated at the end of the sequence). This limits the ability of the KV cache to maintain a global perspective and may affect the performance of large language models on complex and context-rich tasks.

[0070] In long text compression scenarios, related technical solutions, such as SnapKV and PyramidKV, also focus on retaining the key-value pairs corresponding to key tokens based on importance scores when compressing prompts.

[0071] SnapKV retains important tokens and their most recent tokens based on attention weights to obtain more detail. PyramidKV and PyramidInfer found that attention is widely distributed at lower levels and gradually concentrates at higher levels. Therefore, they adjust the size of the KV cache according to hierarchy and select tokens in a funnel-shaped manner. While these eviction strategies effectively reduce the size of the KV cache, their short-sighted focus on certain areas ignores the importance of comprehensive context within the broader narrative.

[0072] Another research direction focuses on structure-guided long context processing. FCA has developed a method for hierarchically selecting important tokens, and Fovea Transformer creates multi-scale trees to effectively capture long contextual dependencies. However, these studies are limited to encoder-based models and cannot be directly applied to pre-trained large language models without additional adjustments, thus limiting their applicability to generative models.

[0073] In view of this, embodiments of this application propose a method and apparatus for compressing key-value pair cache data. The generated KV pair data is organized in a tree-like hierarchical structure, and temporal locality is used to enhance smooth KV cache compression, thereby achieving effective compression of KV cache data while ensuring the performance of large language models.

[0074] The key-value pair cache data compression method provided in this application can be applied to KV cache compression in long text generation scenarios, as well as KV cache compression in long text compression scenarios. The implementation of the TreeKV scheme provided in this application is described below for both long text generation and long text compression scenarios.

[0075] Inspired by previous research applying frequency analysis to the hidden states of language models, the applicant of this application first uses wavelet transform to examine the frequency representation of the information contributed by tokens at different positions during the full attention generation process. The results show that as the position gradually approaches the end of the sequence, all frequency components gradually increase, with higher frequencies increasing at a faster rate. This indicates that as a token's position gets closer to the generation end, its contributed information not only increases but also tends to increase in difference from neighboring tokens.

[0076] Based on these insights, this application proposes a key-value pair cache data compression scheme, which can be called TreeKV. This is an intuitive, training-free method that achieves smooth KV cache compression through a tree structure. Unlike other cache eviction strategies, TreeKV optimizes computational efficiency and memory usage by maintaining an abstraction of the input sequence (i.e., selecting the most important KV pairs from the generated KV data through a tree structure for caching, thus extracting a summary of the input sequence), and promotes a structured and smooth transition between short-term and long-term contexts. By strategically removing past tokens while prioritizing the retention of recent ones, the proposed method minimizes bias from highly concentrated regions (see Figure 1(c), where the TreeKV scheme of this application is illustrated with gradually darkening colors, rather than being concentrated in a single region as in related technical solutions, indicating that the selected tokens in this application are not highly concentrated in a single region, thus fully preserving key contextual information). This enhances the model's ability to handle tasks requiring comprehensive context, such as long-form generation and detailed understanding.

[0077] Multi-level discrete wavelet transform allows signals to be represented using wavelets at multiple resolution levels. Unlike the sine and cosine functions used in Fourier transform, wavelets are localized and can simultaneously represent frequency and time (in this embodiment, time becomes the token) information. In single-level discrete wavelet decomposition, the discrete signal x[n] is filtered by a pair of low-pass filters g[n] and high-pass filters h[n], and then downsampled. The approximation coefficients A[n] represent the low-frequency components of the signal. They capture the main features and overall trends of the signal. The detail coefficients D[n] correspond to the high-frequency components of the signal and capture more subtle details. In multi-level discrete wavelet decomposition, the approximation coefficients A[n] are further decomposed through repeated single-level decomposition. The formula for multi-level discrete wavelet decomposition is shown below:

[0078] Where x represents the input signal of the multi-level discrete wavelet decomposition, n represents the frequency value, and k represents the position of the token in the generated sequence.

[0079] For example, for a sequence of length 512, we treat the product of the attention weight and value vector of the tokens preceding that position as a signal. We analyze it along the positional dimension using multi-level discrete wavelets (e.g., Haar transform) and reconstruct the signal using the detail coefficients of each level to obtain the time-domain representation of the frequency components at the corresponding level.

[0080] For example, given a sequence of length 512, when generating the 513th token using the last token as the query, a multi-level discrete wavelet decomposition is performed on the product of the attention weights (or attention scores) and value vectors of the previous 512 tokens, and the amplitudes of the time-domain representations corresponding to the frequency-domain components at different levels are plotted. For instance, the signal is first converted from the time domain to the frequency domain. In the frequency domain, only the top 5 frequency points D1, D2, D3, D4, and D5 are retained, and all other frequency points are set to zero. Then, the retained top 5 frequency points D1, D2, D3, D4, and D5 are inversely transformed back to the time domain to obtain the amplitudes Rec(D1), Rec(D2), Rec(D3), Rec(D4), and Rec(D5) of the high-frequency signal in the time domain.

[0081] Figure 2 illustrates the distribution of the average high-frequency information component amplitudes obtained through wavelet decomposition. As shown in Figure 2, it can be observed that as the position approaches the end of the sequence, the average magnitude of the time-domain signal corresponding to all frequency components gradually increases, with higher frequencies increasing at a faster rate. This indicates that the high-frequency information is denser closer to the generation end. That is, the key-value pair corresponding to the token closer to the generation end contributes more to the current generation, and the difference between it and its neighboring tokens also tends to increase. Therefore, this application's embodiment employs a layered tree structure to achieve smooth compression of the KV cache data by left-combing and right-densification.

[0082] For long text generation scenarios, the key-value pair cache data compression method provided in this application organizes the generated KV pair data into a tree-like hierarchical structure and uses the left-combing and right-densifying characteristics to enhance the smoothness of KV cache compression.

[0083] The inference process of a large language model consists of two stages: prefilling and decoding. To accelerate the inference speed of a large language model, key-value pairs (KV pairs) containing generated tokens are cached in a KV cache during both the prefilling and decoding stages. In subsequent inference iterations, these KV pairs can be retrieved from the cache, avoiding redundant computation and thus accelerating attention processing, effectively increasing the inference speed of the large language model. However, the KV cache consumes a significant amount of GPU memory, especially in long text generation scenarios, where memory usage is particularly severe.

[0084] The key-value pair cache data compression method provided in this application embodiment is applied to the compression of KV cache data in long text generation scenarios. Specifically, it refers to the cache compression of KV pair data generated by the long token sequence during the decoding stage of a large language model.

[0085] In other words, the key-value pair cache data compression method provided in this application embodiment can be applied to the smooth compression of KV cache in the decoding stage of a large language model, so as to achieve the use of only a fixed-size cache space to cache the KV pair data generated in the decoding stage.

[0086] First, set the cache size parameter c. This parameter determines the number of key-value pairs that the large language model can cache during the decoding stage (also known as the length of the key-value pair sequence that can be cached, where the key-value pair sequence refers to the key-value pairs corresponding to the token sequence generated during decoding). For example, if the cache size c determines that the number of key-value pairs that can be cached is 4, it means that the number of key-value pairs that can be cached during the decoding stage of the large language model is 4.

[0087] During the large language model decoding phase, the key-value pairs of the tokens obtained from the large language model decoding are cached. As the large language model decoding time step progresses, the number of key-value pairs cached in the cache gradually increases. When it is determined that the number of key-value pairs of the decoded tokens is greater than the number of key-value pairs that can be cached by the set cache size, for example, if the number of key-value pairs of the decoded tokens is 5, which is greater than the number of key-value pairs that can be cached by the set cache size of 4, then compression of the key-value cache data is started. At this point, when a new token is decoded and generated at each decoding time step (at which time the new token will have its corresponding key-value pair), an eviction key-value pair needs to be selected and evicted to maintain a constant key-value cache size.

[0088] To ensure that discarded key-value pairs do not affect the context information generated in the next decoding, this application embodiment is inspired by the wavelet transform analysis results. That is, as the position of the key-value pair data of the generated token sequence approaches the end of the sequence, the information contributed by the token increases. Therefore, the key-value pairs selected for retention should be left-compressed and right-dense in the key-value pair sequence corresponding to the generated token sequence. In order to achieve this compression effect, this application embodiment adopts a tree-like hierarchical structure to organize the key-value pairs of the generated tokens. By using the tree-like hierarchical structure, smooth compression of the key-value cache data can be achieved.

[0089] In this embodiment, when the number of key-value pairs of the decoded token is greater than 4, compression of the key-value cache data begins. Compression is performed once when a new key-value pair of a new token is generated at each decoding time step, eliminating one key-value pair to maintain a fixed key-value cache size. This application determines an elimination range window for each decoding time step. This elimination range window slides as the decoding time step increases and periodically slides along the current key-value pair sequence from beginning to end. The key-value pairs selected by the elimination range window are treated as sibling nodes. The importance index of each sibling node is compared, and the node with the higher importance index is selected as the parent node. Key-value pairs corresponding to sibling nodes with lower importance indices are eliminated (if a key-value pair is already cached, it is removed; if the key-value pair corresponds to a newly generated token, it is not cached). The root node is selected as the target key-value pair, and the target key-value pair is cached in the key-value cache. These nodes constitute a key-value tree.

[0090] Optionally, the sliding step size of the elimination range window is the same as the length of the elimination range window. For example, if the elimination range window can select 2 KV pairs each time, then the sliding step size of the elimination range window is also 2 each time, thus ensuring that the elimination range window does not miss any KV pairs.

[0091] In one example, a sliding window for the elimination range can be implemented by assigning a variable `idx` to each decoding time step. For instance, `idx` indicates the starting position of the elimination range window in the current KV pair sequence, and `idx+1` indicates the ending position. As the compression steps increase, the variable `idx` periodically changes between 1 and the number of KV pairs that the cache can hold. For example, if the cache size is 4, when the number of KV pairs in the decoded token exceeds 4, KV cache compression begins. As the decoding time steps increase, the value of the variable `idx` periodically changes between 1, 2, 3, and 4. For example, when the decoding time step is 5, the value of variable idx is 1; when the decoding time step is 6, the value of variable idx is 2; when the decoding time step is 7, the value of variable idx is 3; when the decoding time step is 8, the value of variable idx is 4; and when the decoding time step is 9, the value of variable idx is 1. Thus, the value of variable idx changes periodically between 1, 2, 3, and 4 as the decoding time step increases. This moves the elimination range from the beginning to the end of the current KV pair sequence, prioritizing the removal of KV pairs corresponding to more distant tokens, while also paying attention to the KV pairs corresponding to the latest tokens, ensuring that the KV cache remains relevant to the current context.

[0092] A key indicator for comparing the importance of key-value pairs on two sibling nodes can be the historical average attention weight of the tokens corresponding to the key-value pairs. This average attention weight refers to the attention weight of each token during attention calculations for each token generated during decoding. For example, in a token sequence of length 18, if the key-value pair corresponding to the first token has never been eliminated, then when the 18th token is generated, the first token has undergone 17 attention calculations and has 17 attention weights. Therefore, the average attention weight of the first token is the average attention weight obtained by averaging the 17 attention weights of that token. This avoids using the cumulative attention weight as an importance indicator to measure the importance of two key-value pairs, which would cause the final selected key-value pairs to be concentrated at the beginning of the key-value sequence, losing important contextual information later on.

[0093] Of course, in some other examples, the importance metric for key-value pairs can be measured in other ways, such as the median of historical attention weights. For example, if a token has 17 attention weights in its historical generation process, the median of these 17 attention weights can be taken as the importance metric for the key-value pair corresponding to that token.

[0094] The following example illustrates the practical implementation of the key-value pair cache data compression method provided in this application. For instance, the token sequence generated by the large language model decoding is a sequence of length 18: "The sunset slowly over the horizon, casting vibrant hues of orange and pink across the evening sky," and the KV cache size is 4. The large language model decodes and generates the first token "The" in the first decoding time step, caching the key-value pair of the first token in the key-value cache. In the second decoding time step, it decodes and generates the second token "sunset", caching the key-value pair of the second token in the key-value cache and recording the attention weight of the first token "The" (i.e., the attention weight of the first token "The" when calculating the attention for the second token "sunset" in the second decoding time step) for subsequent calculation of the average attention weight. In the third decoding time step, it decodes and generates the third token "slowly", caching the key-value pair of the third token in the key-value cache and recording the attention weights of the first token "The" and the second token "sunset". Thus, when the fifth token "the" arrives, the key-value cache contains key-value pairs corresponding to four tokens: "The", "sunset", "slowly", and "over", and the compression process begins. To determine the elimination range, we assign a variable idx to each generation step, and the elimination range is defined by (idx, idx+1). We remove the idx-th token or the (idx+1)-th token with the lowest average attention weight. Therefore, we permanently remove the first token "The" within the eviction range ("The", "sunset") from the cache. Now the cache contains: "sunset", "slowly", "over", and "the". The variable idx iterates through 1, 2, 3, and 4, moving the eviction range from left to right, prioritizing the removal of older tokens while paying attention to the newest ones. This ensures that the KV pairs cached conform to the distribution characteristics shown in Figure 1(c), minimizing bias from highly concentrated areas and enhancing the ability of large language models to handle tasks requiring comprehensive context.

[0095] Figure 3 illustrates the compression process of the key-value pair cache data compression method provided in this application embodiment in an actual token sequence. As shown in Figure 3, at the fifth decoding time step, the new KV pair corresponding to the token "the" arrives, and the KV... The cache contains four tokens: "The", "sunset", "slowly", and "over", each corresponding to a KV pair (in the diagram, the KV pairs are represented by the token; in reality, the cache stores the KV pairs corresponding to each token). Since the number of KV pairs exceeds the cache's capacity of 4, compression of the KV cache data begins. At this point, the variable `idx` equals 1, indicating that the starting position of the eviction window is the second token from the end of the current KV pair sequence (i.e., the KV pairs corresponding to the tokens "The", "sunset", "slowly", "over", and "the"). The eviction window selects tokens "The" and "sunset", treating them as sibling nodes. The average attention weights of "The" and "sunset" are compared; "The" has an average attention weight of 0.1, while "sunset" has an average attention weight of 0.2. "Sunset" wins, becoming the parent node of "The" and "sunset", and evictions the KV pair corresponding to "The", removing it from the KV cache. Removed from the cache, the key-value pair corresponding to "the" is cached in the key-value cache. At this time, "sunset", "slowly", "over" and "the" have no parent nodes and are all root nodes. That is to say, there are four key-value trees, namely the key-value tree with "sunset" as the root node, the key-value tree with "slowly" as the root node, the key-value tree with "over" as the root node and the key-value tree with "the" as the root node.

[0096] In the sixth decoding time step, the new key-value pair corresponding to the token "horizon" arrives. The key-value cache contains the key-value pairs corresponding to the four tokens that were not eliminated in the previous step: "sunset", "slowly", "over", and "the". At this time, the variable idx = 2, and the elimination range is the key-value pair corresponding to the second token "slowly" and the third token "over". The key-value pairs corresponding to the second token "slowly" and the third token "over" are designated as sibling nodes. The average attention weight of "slowly" is 0.3, and the average attention weight of "over" is 0.2. "slowly" wins, and the key-value pair corresponding to "slowly" is designated as the parent node. The key-value pair corresponding to "over" is eliminated and removed from the key-value cache. The key-value pair corresponding to "horizon" is cached in the key-value cache.

[0097] Thus, we can see that in this embodiment, when the number of KV pairs of the decoded token is greater than 4, the level of the corresponding KV tree will be calibrated and updated for each newly generated KV pair of the token. For example, in the fifth decoding time step, the leftmost KV tree changes from one layer to a two-layer structure, that is, the height of the KV tree increases, and the height changes from 1 to 2. This adaptability ensures that the cache remains relevant to the current context.

[0098] Referring to step 17 in Figure 3, the cache state in the KV cache and the unfolding of the complete eviction process are shown. The leftmost KV tree supports the merging of tokens from different levels. That is, in Figure 3, the level of the subtree of "sunset" on the left side of the leftmost KV tree is 4, while the level of the subtree of "vibrant" on the right side is 3. This realizes the merging of tokens from the 4th level node and tokens from the 3rd level node, always maintaining a smooth transition on the right side.

[0099] As shown in step 17 of Figure 3, the distribution of the uncannily eliminated tokens is left-combing and right-density. That is, on the left, there are 12 tokens from "The" to "orange", and only one uncannily eliminated token, "sunset", with a density of 1 / 12. On the right, there are 5 tokens from "and" to "evening", and three uncannily eliminated tokens, "pink", "across" and "evening", with a density of 3 / 5.

[0100] The pseudocode implementation of the key-value pair cache data compression method provided in this application embodiment is as follows:

[0101] The key-value pair cache data compression method provided in this application embodiment can also be applied to long text compression scenarios to achieve smooth compression of KV cache data in scenarios with long input sequences.

[0102] Long text compression refers to the pre-filling stage of a large language model, where key-value data of a long input sequence is compressed. For example, if the length of the input token sequence is 1 million tokens, this obviously exceeds the limit that a large language model can handle. In this case, the key-value pair cache data compression method provided in this application embodiment can be used to compress it to 2,000 tokens, and only the key-value pair data corresponding to these 2,000 tokens are cached, which greatly reduces the size of the key-value cache in the pre-filling stage.

[0103] Considering the different characteristics of the pre-filling stage and the decoding stage, the pre-filling stage processes the input sequence and knows the length of the token sequence to be processed. For example, if the length of the input prompt sequence is 1,000, the length of the token sequence to be processed can be clearly known. The length of the token sequence to be compressed is static. However, in the decoding stage, the decoded tokens arrive one by one, and the length of the token sequence to be processed is unknown. Therefore, the elimination range needs to be dynamically determined at each time step. The pre-filling stage does not need the elimination range sliding window and can directly use a static elimination range sliding window to divide the elimination range at once.

[0104] For example, the input token sequence (i.e., the hints) can be directly divided into different elimination ranges. Based on the wavelet transform analysis, the elimination ranges in the input token sequence can gradually decrease from beginning to end. The key-value pairs corresponding to the tokens in each elimination range are treated as sibling nodes, with the one with the highest importance index serving as the parent node. The key-value pairs corresponding to the tokens are selected and cached through the root node in the key-value tree.

[0105] Furthermore, considering that most context compression techniques rely on token-level selection, they ignore the fact that important or irrelevant information is often spatially clustered. Selecting only a token can compromise the integrity of the context and computational speed. Therefore, this application employs a block-level elimination strategy in the context compression task, applying our algorithm to blocks rather than tokens. We use the last block of the prompt as an observation window to "query" the importance score of the input, thus avoiding full attention computation. The selected block and the observation window block together form a new KV cache for generation.

[0106] Figure 4 illustrates the compression process of the key-value pair cache data compression method provided in this application when applied to a long text compression scenario, specifically in the actual token sequence. As shown in Figure 4, the input token sequence is “The moonlit painted the world in silver, enchanting hearts with its magic and wonder.”, with a length of 18. First, the input token sequence is divided into blocks, for example, two adjacent tokens are grouped into one block, resulting in a total of 8 blocks: “The moonlit” forms one block, “painted the” forms another, “world in” forms another, “silver,” forms another, “enchanting hearts” forms another, “with its” forms another, “magic and” forms another, and “wonder.” forms another. This further increases compression efficiency.

[0107] To avoid full attention calculation, in this embodiment, the last block is used as the observation window to calculate the average attention weight of each block. The average attention weight of each block is used as an indicator to measure the importance of each block and as the basis for eliminating blocks.

[0108] For example, the tokens in each block calculate attention weights relative to the tokens in the last block. Each token in each block gets 2 attention weights, and each token in each block gets 4 attention weights. These 4 attention weights are averaged to get the average attention weight of each block. Taking block 1 as an example, block 1 includes two tokens, "The" and "moonlit". The attention weights of "The" are calculated for "wonder" and "." respectively, resulting in two attention weights for "The". The attention weights of "moonlit" are calculated for "wonder" and "." respectively, resulting in two attention weights for "moonlit". The four attention weights of "The" and "moonlit" are added together and divided by 4 to calculate the average attention weight of block 1, which is 0.25. Similarly, the average attention weights of other blocks are calculated, as shown in Figure 4. Therefore, the average attention weight of block 2 is 0.10, the average attention weight of block 3 is 0.10, the average attention weight of block 4 is 0.05, the average attention weight of block 5 is 0.15, the average attention weight of block 6 is 0.05, the average attention weight of block 7 is 0.15, and the average attention weight of block 8 is 0.15. Blocks 1 to 4 are divided into elimination range 1, blocks 5 and 6 into elimination range 2, block 7 into elimination range 3, and block 8 into elimination range 4. In elimination range 1, block 1 has the highest average attention weight, so block 1 is selected to be retained, and blocks 2 to 4 are eliminated. In elimination range 2, block 5 has the highest average attention weight, so block 5 is selected to be retained, and block 6 is eliminated. Elimination ranges 3 and 4 each have only one block and are not eliminated. Thus, blocks 1, 5, 7, and 8 that need to be replaced are obtained. The key-value pairs corresponding to the tokens in block 1, block 5, block 7, and block 8 are cached, thereby achieving effective compression of the key-value cache data in the pre-filling stage.

[0109] Figure 5 is a schematic diagram illustrating the implementation flow of a key-value pair cache data compression method provided in an embodiment of this application. This method can be executed by any device, equipment, platform, or device cluster with computing capabilities. This application embodiment does not specifically limit the specific computing device executing this method; a suitable computing device can be selected for execution as needed. For example, it can be implemented by a terminal; that is, the key-value pair cache data compression method provided in this application embodiment can be deployed as software on a terminal device with an AI model based on an attention mechanism, compressing the KV cache and reducing GPU memory usage. Alternatively, it can be implemented by a server, which deploys an AI model based on an attention mechanism (e.g., a large language model). When the AI model performs inference, the KV cache is compressed, reducing GPU memory usage. For ease of description, the following does not distinguish the executing entity and refers to it as the executing device. The specific implementation of the key-value pair cache data compression method provided in this application embodiment is described. As shown in Figure 5, the key-value pair cache data compression method provided in this application embodiment includes at least steps S501 to S505.

[0110] In step S501, KV pair data is obtained.

[0111] In this embodiment, the key-value pair data refers to the key-value pair data generated during attention calculation in the large language model inference process. For example, it could be the key-value pair data corresponding to each token in the prompt sequence obtained by attention calculation on the prompt sequence during the pre-filling stage, or the key-value pair data corresponding to the tokens generated during the decoding stage. The method of obtaining key-value pair data of tokens by performing attention calculation in the pre-filling or decoding stages of the large language model is a mature existing technology, and for the sake of simplicity, it will not be described in detail in this embodiment.

[0112] In step S502, the KV pair data is cached in the KV cache.

[0113] Generated key-value pairs are cached in a key-value cache to avoid redundant computations on previous token key-value pairs, thus accelerating attention processing and speeding up the inference process of the large language model. For example, if the large language model decodes and generates tokens "The", "sunset", "slowly", and "over", the key-value pairs corresponding to these tokens are cached in the key-value cache. When generating the next inference calculation, the key-value pairs of previously generated tokens can be retrieved from the key-value cache, avoiding redundant computations on previous token key-value pairs, accelerating attention processing, and thus speeding up the inference process of the large language model.

[0114] In step S503, it is determined that the number N of generated KV pairs is greater than the preset threshold M.

[0115] In this embodiment of the application, by setting a parameter M, which indicates the number of KV pairs that can be cached in a fixed-size KV cache, for example, M is 4. If the KV pair of the fifth token is decoded in the fifth decoding time step, it is determined that the number of generated KV pairs is greater than the preset threshold. At this time, the compression of the KV cache data begins.

[0116] In step S504, the target KV pair is determined from the current KV pair data based on M KV trees.

[0117] In this embodiment, the generated key-value pair data is organized using a layered structure of a key-value tree. As the tree structure increases in level, the number of key-value pairs is gradually compressed, achieving smooth compression of cached data and effectively reducing the memory overhead of key-value pair caching. At the same time, the root node of the tree structure is used as the target key-value pair, achieving a linear increase in the number of key-value pairs generated by the large language model, but only requiring a fixed size of cache space.

[0118] For details on constructing a KV tree and determining KV pairs from the current KV pair data using a KV tree, please refer to the corresponding description above. For the sake of brevity, it will not be repeated here.

[0119] In step S505, the target KV pair is stored in the KV cache, and non-target KV pairs are deleted from the N KV pairs.

[0120] Non-obsolete key-value pairs are cached in the key-value cache, and obsolete key-value pairs are permanently removed from the key-value cache, thus achieving smooth and effective compression of the key-value cache of large language models and reducing memory usage.

[0121] To verify the compression effect of the TreeKV scheme provided in this application, the TreeKV scheme of this application is compared with four baseline schemes: efficient KV eviction strategies such as StreamingLLM, H2O, and TOVA, and a full attention method that caches all keys and values. This application evaluates these five schemes using sequences with context lengths of 4k, 8k, and 16k, while maintaining a cache size of only 1k. When the context length exceeds the pre-training limit of LLM, the TreeKV scheme of this application outperforms all baseline methods. The results are shown in Tables 1 and 2. The scheme of this application shows significant improvement on all 16k lengths, surpassing the second-place TOVA by 3.6% and 1.1% on the PG19 and OpenWebText2 datasets, respectively, while achieving a 16x reduction in KV cache size within a 16k context.

[0122] Table 1

[0123] Table 1 shows the perplexity of different compression algorithms based on the Llama-2-7B model on the PG19 dataset at different context lengths. The cache size for all efficient methods is set to 1024.

[0124] Table 2

[0125] Table 2 shows the perplexity of different compression algorithms based on the Llama-2-7B model on the OpenWebText2 dataset at different context lengths. The cache size for all efficient methods is set to 1024.

[0126] This application embodiment further examines the longest length that TreeKV can handle to determine whether an LLM pre-trained with a 4k content window can effectively model language for “infinite” text. This application embodiment concatenates the first 13 books from the PG19 test set to create a 1M example, and evaluates its generative capabilities using Llama-2-27B against StreamingLlm, H2O, TOVA, and TreeKV. Figure 6 shows the negative log-likelihood (NLL) function curves for input lengths from 0.1M to 1M. These curves show that TOVA and H2O gradually degrade in performance when handling longer sequences, resulting in significantly lower NLLs than StreamingLlm. In contrast, TreeKV consistently outperforms all other baseline methods, including StreamingLlm, demonstrating superior ability to handle longer inputs.

[0127] Figure 6 shows how we concatenated the first 13 books from the PG19 test set to create a 1M-length sequence, and then applied four caching compression methods to it: TOVA, H2O, StreamingLLM, and TreeKV. We then plotted the relationship between negative log-likelihood and sequence length.

[0128] The embodiments of this application were also validated on the Longbench-E benchmark set. Longbench is a multi-task benchmark covering a variety of long-context tasks to evaluate a model's ability to handle extended text input. Longbench-E is a subset of Longbench, providing a balanced length distribution with similar data volumes across the 0-4k, 4-8k, and 8k+ length ranges, making it well-suited for evaluating model performance at different context lengths. Longbench-E contains 13 tasks categorized into 6 classes: single-document question answering, multi-document question answering, summarizing, few-shot learning, synthesis tasks, and code completion. The average length of all subsets is approximately 11k. Our method is compared with two baseline methods: the efficient KV eviction policy H2O, and a full attention method that caches all keys and values.

[0129] The results are shown in Table 3, summarizing the performance metrics for all tasks and lengths in the Longbench-E benchmark. The TreeKV scheme provided in this application demonstrates a significant improvement over the baseline method H2O. Overall, our model outperforms the efficient KV elimination strategy H2O by an average absolute difference of 2.24, highlighting its ability to retain relevant information from expanded text. However, the TreeKV scheme provided in this application does not outperform the full attention model, leaving room for further improvements with minimal resource requirements.

[0130] Table 1 compares the performance of H2O and TreeKV on the Longbench-E benchmark using LlaMa-3-8B-Instruct as the base model. Results are reported for lengths of 0-4k, 4-8k, and 8k+, with a cache size of 2048. BR represents the best budget ratio for each length range. The best results between H2O and TreeKV are highlighted in bold. Our model outperformed H2O in 11 out of 13 tasks. Results for the full attention model are shown at the top of the table.

[0131] Table 3

[0132] An important question is: what is the key component of the TreeKV scheme provided in this application – the tree structure or the attention weight-based token selection mechanism? To investigate this, we modified our approach, always selecting the leftmost token within the revocation range, rather than the token with the highest attention weight. This change allows us to analyze the impact of the tree structure itself separately, thus focusing on how token hierarchy affects model performance, unaffected by changes introduced by weight differences.

[0133] Figure 7 shows the log-mean perplexity curves for three schemes: H2O, TreeKV, and TreeKV_Select_Left_Token. As shown in Figure 7, we present the log-mean perplexity of the first book in PG19, with a length of 65k tokens. We compared the three methods: 1) H2O, which greedily selects tokens based on cumulative attention weights; 2) TreeKV, which uses average attention weights to guide the relationship between revocation and temporal proximity through the tree structure; and 3) TreeKV_Select_Left_Token, a variant of TreeKV that prioritizes the leftmost token without using attention scores. The results show that a consistent token selection strategy results in minimal variation in perplexity, indicating that the tree structure plays a crucial role in shaping model decisions. In summary, the research in this application confirms that the tree structure is an important component of the scheme provided in this application.

[0134] The key-value pair cache data compression method provided in this application can be applied to large language models in various scenarios, such as long text generation. These scenarios include:

[0135] Creative Writing: Long text generation models such as LongWriter can help writers and content creators quickly generate story outlines, character descriptions, and plot developments, sparking creative inspiration. These models can generate coherent text exceeding 10,000 words, breaking through the limitations of previous AI models in terms of text length.

[0136] Casual Chat: The GPT2-based Chinese casual chatbot is trained on a GPT2 model to generate natural and fluent conversations. This chatbot can be used in various scenarios such as customer service, online chat, and language learning.

[0137] Content creation: Use long text generation models to generate blog posts, news reports, marketing copy, etc., to improve the efficiency of content production.

[0138] In the education sector, educators can use long text generation models to assist teaching, such as generating teaching materials, course content, or learning guides, to help students better understand and master knowledge points.

[0139] News media: News organizations can use long text generation technology to quickly generate news reports, in-depth analysis articles, or feature reports, thereby improving the efficiency of news production.

[0140] Technical Documentation Writing: In the technical field, long text generation models can help technical personnel write technical documents, development manuals, and API documentation, improving the efficiency and quality of document writing.

[0141] Scriptwriting: In the film and television production field, long text generation technology can assist screenwriters in generating initial drafts of scripts, providing creative inspiration and suggestions for plot development.

[0142] Game Development: In game development, long text generation technology can be used to generate in-game dialogue, story background, and task descriptions, enhancing the game's immersion and interactivity.

[0143] Automated report generation: In the fields of business intelligence and data analytics, long text generation technology can automatically generate reports and analysis summaries, helping decision-makers quickly obtain key information.

[0144] Another example is in long text compression scenarios. These include the following scenarios:

[0145] Long document comprehension: When processing long documents, such as academic papers, legal documents, or technical manuals, long text compression techniques can help models process and understand the main content of the document within a limited context. Through methods such as keyword extraction, sentence fusion, or text summarization, long text can be converted into shorter text, thereby retaining key information and reducing text length, enabling models to process long documents more effectively.

[0146] Webpage understanding: Webpage content typically contains a large amount of information. Long text compression techniques can summarize or extract key parts of webpage content, allowing users to quickly grasp the core content. This is especially useful for search engine optimization (SEO) and content recommendation systems, as they need to quickly understand webpage content and provide relevant search results or recommendations.

[0147] RAG (Retrieval-Augmented Generation) is a model that combines retrieval and generation. It assists in generation by retrieving relevant fragments, compensating for the shortcomings of language models in long text modeling. Long text compression techniques can reduce the model's memory burden. By compressing text information, RAG models can retrieve and generate content more efficiently, improving the performance of long text tasks.

[0148] Long text summarization: Long text summarization is a direct application of long text compression technology. By extracting key sentences and using graph models or Transformer-based models, summaries of long texts can be generated, which is very useful for content summaries, news aggregations, or rapid reading of research papers.

[0149] Long text classification: In long text classification tasks, such as sentiment analysis or topic classification, long text compression techniques can help models maintain performance when processing text beyond their original processing capabilities. By compressing text, models can process long texts more efficiently while retaining enough information for accurate classification.

[0150] Based on the same concept as the aforementioned embodiment of a key-value pair cache data compression method, this application also provides a key-value pair cache data compression device 800. This device 800 can be deployed on a server or terminal device to provide KV cache compression services, significantly reducing memory overhead. The key-value pair cache data compression device 800 includes units or modules for implementing the various steps in the key-value pair cache data compression method shown in Figures 3-5.

[0151] Figure 8 is a schematic diagram of a compression device for key-value pair cache data provided in an embodiment of this application. As shown in Figure 8, the compression device 800 for key-value pair cache data includes an acquisition module 801, a caching module 802, a first determination module 803, a second determination module 804, and a compression module 805. The acquisition module 801 acquires key-value pair data, which includes N key-value pairs. Each key-value pair includes a key vector and a value vector obtained by an AI model based on an attention mechanism during inference, where N is a positive integer. The caching module 802 caches the key-value pair data in a cache space. The first determination module 803 determines that N is greater than a preset threshold M, where M is a positive integer, determined based on a preset size of the cache space. The second determination module 804 determines target key-value pairs from the key-value pair data based on M key-value trees. The key-value trees are hierarchical tree structures, each containing several nodes, where each node represents one or more key-value pairs. The key-value pair located at the root node of the key-value tree is the target key-value pair. The compression module 805 stores the target key-value pairs in the cache space and deletes non-target key-value pairs from the N key-value pairs.

[0152] Optionally, the attention-based AI model is a large language model.

[0153] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the decoding stage; the compression device 800 for key-value pair cache data provided in this application also includes a first construction module 806, which is used to construct a key-value tree based on the importance index of each key-value pair in N key-value pairs and the elimination range sliding window. The elimination range sliding window slides periodically from beginning to end along the current key-value pair sequence as the decoding time step increases. The elimination range sliding window is used to select at least two key-value pairs from the current key-value pair sequence. The at least two key-value pairs are used as sibling nodes, and the key-value pair with the largest importance index among the at least two key-value pairs is used as the parent node. The current key-value pair sequence includes the key-value pairs already cached in the current cache space and the key-value pairs corresponding to the words decoded in the current decoding time step.

[0154] In another possible implementation, the first building module 806 is specifically used to: determine the position of the elimination range sliding window in the current key-value pair sequence at the current decoding time step; designate at least two key-value pairs selected by the elimination range sliding window from the current key-value pair sequence as sibling nodes; compare the importance index of each key-value pair in the at least two key-value pairs, and designate the key-value pair with the largest importance index as the parent node; and designate the key-value pairs in the current key-value pair sequence that do not have a parent node as the root nodes of M key-value trees respectively.

[0155] In another possible implementation, the importance metric for each key-value pair is determined based on the historical average attention weight of the corresponding lexical term for each key-value pair.

[0156] In another possible implementation, the density of target key-value pairs in the left key-value tree of the M key-value trees is less than the density of target key-value pairs in the right key-value tree, and the key-value pairs on the left key-value tree are generated earlier than the key-value pairs on the right key-value tree.

[0157] In another possible implementation, the first building module 806 is also used to correct the hierarchical structure of the key-value tree based on the currently generated key-value pairs, enabling it to dynamically adapt to changes in the data. This adaptability ensures that the cache remains relevant to the current context.

[0158] In another possible implementation, the first elimination range sliding window is determined based on the variable idx, which indicates that the starting position of the elimination range sliding window is the idx-th key-value pair in the current key-value pair sequence, and the ending position of the first elimination range sliding window is the idx+1-th key-value pair in the current key-value pair sequence; the variable idx changes periodically between 1 and M as the decoding time step increases.

[0159] In another possible implementation, the key-value pair data includes the key-value pairs corresponding to each word obtained by the large language model during the attention calculation in the filling stage; the compression device 800 for key-value pair cache data provided in this application also includes a second construction module 807, which is used to: divide N key-value pairs into multiple key-value pair blocks, each key-value pair block including key-value pairs corresponding to adjacent words in the input sequence; construct the key-value tree based on the importance index of each key-value pair block in the multiple key-value pair blocks and M elimination range windows, wherein the M elimination range windows are used to divide the key-value pair block sequence into different key-value pair block groups, and different key-value pair blocks in each key-value pair block group are used as sibling nodes, and the key-value pair block with the largest importance index among the sibling nodes is used as the parent node.

[0160] In another possible implementation, a specific implementation of constructing a key-value tree based on the importance index of each key-value pair in multiple key-value pair blocks and M elimination range windows is as follows: using the M elimination range windows, the key-value pair block sequence is divided into M key-value pair block groups; different key-value pair blocks in each key-value pair block group are taken as sibling nodes; the importance index of each key-value pair block in the sibling nodes is compared, and the key-value pair with the largest importance index is taken as the parent node; the key-value pair blocks without parent nodes in the M key-value pair block groups are taken as the root nodes of the M key-value trees respectively.

[0161] In another possible implementation, the importance index of each key-value pair block is determined based on the observation information of each key-value pair block obtained by observing the last key-value pair block in the key-value pair block sequence as the observation window.

[0162] In another possible implementation, the observation information of each key-value pair block includes multiple attention weights for each key-value pair block. The multiple attention weights are the attention weights of the lexicon corresponding to each key-value pair in each key-value pair block relative to the lexicon corresponding to each key-value pair in the last key-value pair block. The importance index of each key-value pair block is determined based on the average attention weight of each key-value pair block, which is obtained by averaging the multiple attention weights of each key-value pair block.

[0163] In another possible implementation, the structure of the key-value tree is determined based on the analysis results obtained by performing wavelet transform analysis on the product of the attention weights and value vectors of each generated word. The analysis results indicate the contribution of the key-value pairs of each generated word to the newly generated word.

[0164] The key-value pair cache data compression device 800 according to the embodiments of this application can correspond to the execution of the methods described in the embodiments of this application, and the above and other operations and / or functions of each module in the key-value pair cache data compression device 800 are respectively to implement the corresponding processes of each method in FIG3-5. For the sake of brevity, they will not be described again here.

[0165] This application also provides a computing device including at least one processor, a memory, and a communication interface, wherein the processor is used to execute the method described in Figures 3-5.

[0166] Figure 9 is a schematic diagram of the structure of the computing device provided in the embodiment of this application.

[0167] As shown in Figure 9, the computing device 900 includes at least one processor 901, a memory 902, and a communication interface 903. The processor 901, memory 902, and communication interface 903 are communicatively connected, which can be achieved via a wired (e.g., bus) or wireless connection. The communication interface 903 is used to send and / or receive data from other devices. The memory 902 stores computer instructions, which the processor 901 executes to perform the method described in the preceding method embodiments, thereby compressing the KV cache data and reducing video memory overhead.

[0168] It should be understood that, in the embodiments of this application, the processor 901 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0169] The memory 902 may include read-only memory and random access memory, and provides instructions and data to the processor 901. The memory 902 may also include non-volatile random access memory. Optionally, the random access memory may be, for example, high bandwidth memory (HBM).

[0170] The memory 902 can be volatile memory or non-volatile memory, or it can include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).

[0171] It should be understood that the computing device 900 according to the embodiments of this application can execute the method shown in Figures 3-5 of the embodiments of this application. For a detailed description of the implementation of the method, please refer to the above text. For the sake of brevity, it will not be repeated here.

[0172] Embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, wherein when the computer instructions are executed by a processor, the aforementioned method is implemented.

[0173] An embodiment of this application provides a chip including at least one processor and an interface, wherein the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method mentioned above.

[0174] Embodiments of this application provide a computer program or computer program product that includes instructions that, when executed, cause a computer to perform the methods mentioned above.

[0175] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0176] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented using hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0177] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of this application. It should be understood that the above description is only a specific embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A method for compressing key-value pair cached data, characterized in that, include: Acquire key-value pair data, which includes N key-value pairs. Each key-value pair includes a key vector and a value vector obtained by the AI model based on the attention mechanism during the inference process, where N is a positive integer. Cache the key-value pair data to the cache space; It is determined that N is greater than a preset threshold M, wherein the preset threshold M is determined based on a preset size of the cache space, and M is a positive integer; The target key-value pair is determined from the key-value pair data based on M key-value trees. The key-value trees are tree-like hierarchical structures, each including several nodes. Each node represents one or more key-value pairs from the N key-value pairs. The key-value pair located at the root node of the key-value tree is the target key-value pair. The target key-value pair is stored in the cache space, and non-target key-value pairs are deleted from the N key-value pairs.

2. The method according to claim 1, characterized in that, The attention-based AI model is a large language model.

3. The method according to claim 2, characterized in that, The key-value pair data includes the key-value pairs corresponding to each word element obtained by the large language model during the attention calculation in the decoding stage; The step of determining the target key-value pair from the key-value pair data based on the key-value tree also includes: Based on the importance index and elimination range sliding window of each of the N key-value pairs, the key-value tree is constructed. The elimination range sliding window slides periodically from beginning to end along the current key-value pair sequence as the decoding time step increases. The elimination range sliding window is used to select at least two key-value pairs from the current key-value pair sequence. The at least two key-value pairs are used as sibling nodes, and the key-value pair with the largest importance index among the at least two key-value pairs is used as the parent node. The current key-value pair sequence includes the key-value pairs already cached in the current cache space and the key-value pairs corresponding to the words decoded in the current decoding time step.

4. The method according to claim 3, characterized in that, The construction of the key-value tree based on the importance index and elimination range sliding window of each of the N key-value pairs includes: Determine the position of the elimination range sliding window in the current key-value pair sequence at the current decoding time step; The elimination range sliding window selects at least two key-value pairs from the current key-value pair sequence as sibling nodes; Compare the importance index of each key-value pair in the at least two key-value pairs, and take the key-value pair with the largest importance index as the parent node; The key-value pairs in the current key-value pair sequence that have no parent node are respectively used as the root nodes of the M key-value trees.

5. The method according to claim 3 or 4, characterized in that, The importance index of each key-value pair is determined based on the historical average attention weight of the corresponding word in each key-value pair.

6. The method according to any one of claims 3-5, characterized in that, The density of target key-value pairs in the left key-value tree among the M key-value trees is less than the density of target key-value pairs in the right key-value tree, and the key-value pairs on the left key-value tree are generated earlier than the key-value pairs on the right key-value tree.

7. The method according to any one of claims 3-6, characterized in that, The first elimination range sliding window is determined based on the variable idx, where the variable idx indicates that the starting position of the elimination range sliding window is the idx-th key-value pair in the current key-value pair sequence, and the ending position of the first elimination range sliding window is the idx+1-th key-value pair in the current key-value pair sequence; The variable idx changes periodically between 1 and M as the decoding time step increases.

8. The method according to claim 2, characterized in that, The key-value pair data includes the key-value pairs corresponding to each word element obtained by the attention calculation during the filling phase of the large language model; The step of determining the target key-value pair from the key-value pair data based on M key-value trees also includes: The N key-value pairs are divided into multiple key-value pair blocks, and each key-value pair block includes key-value pairs corresponding to adjacent words in the input sequence; Based on the importance index of each key-value pair block in the plurality of key-value pair blocks and M elimination range windows, the key-value tree is constructed. The M elimination range windows are used to divide the key-value pair block sequence into different key-value pair block groups. Different key-value pair blocks in each key-value pair block group are used as sibling nodes. The key-value pair block with the largest importance index among the sibling nodes is used as the parent node.

9. The method according to claim 8, characterized in that, The construction of the key-value tree based on the importance index of each key-value pair block in the plurality of key-value pair blocks and M elimination range windows includes: The key-value pair block sequence is divided into M key-value pair block groups using M elimination range boxes; Different key-value pair blocks in each key-value pair block group are regarded as sibling nodes; Compare the importance index of each key-value pair in the sibling nodes, and take the key-value pair with the largest importance index as the parent node; The key-value pair blocks without parent nodes in the M key-value block groups are respectively used as the root nodes of the M key-value trees.

10. The method according to claim 8 or 9, wherein the importance index of each key-value pair block is determined based on the observation information of each key-value pair block obtained by observing the last key-value pair block in the key-value pair block sequence as the observation window.

11. The method according to claim 10, characterized in that, The observation information of each key-value pair block includes multiple attention weights of each key-value pair block. The multiple attention weights are the attention weights of the lexicon corresponding to each key-value pair in each key-value pair block relative to the lexicon corresponding to each key-value pair in the last key-value pair block. The importance index of each key-value pair block is determined based on the average attention weight of each key-value pair block, which is obtained by averaging multiple attention weights of each key-value pair block.

12. The method according to any one of claims 2-11, characterized in that, The structure of the key-value tree is determined based on the analysis results obtained by wavelet transform analysis of the product of the attention weights and value vectors of each generated word, and the analysis results indicate the contribution of the key-value pair data of each generated word to the newly generated word.

13. A compression device for key-value pair cached data, characterized in that, include: The acquisition module is used to acquire key-value pair data, which includes N key-value pairs. Each key-value pair includes a key vector and a value vector obtained by the AI model based on the attention mechanism during the inference process, where N is a positive integer. The caching module is used to cache the key-value pair data to the cache space; The first determining module is used to determine that N is greater than a preset threshold M, wherein the preset threshold M is determined based on a preset size of the cache space, and M is a positive integer; The second determining module is used to determine the target key-value pair from the key-value pair data based on M key-value trees. The key-value trees are tree-like hierarchical structures, each key-value tree includes several nodes, each node represents one or more key-value pairs, and the key-value pair located at the root node of the key-value tree is the target key-value pair. A compression module is used to store the target key-value pairs in the cache space and delete non-target key-value pairs from the N key-value pairs.

14. The apparatus according to claim 13, characterized in that, The attention-based AI model is a large language model.

15. The apparatus according to claim 14, characterized in that, The key-value pair data includes the key-value pairs corresponding to each word element obtained by the large language model during the attention calculation in the decoding stage; The device further includes a first construction module, which is used to construct the key-value tree based on the importance index of each key-value pair in the N key-value pairs and the elimination range sliding window. The elimination range sliding window slides periodically from beginning to end along the current key-value pair sequence as the decoding time step increases. The elimination range sliding window is used to select at least two key-value pairs from the current key-value pair sequence. The at least two key-value pairs are used as sibling nodes, and the key-value pair with the largest importance index among the at least two key-value pairs is used as the parent node. The current key-value pair sequence includes the key-value pairs already cached in the current cache space and the key-value pairs corresponding to the words decoded in the current decoding time step.

16. [Amended according to Rule 26, 22.08.2025] The apparatus according to claim 15, characterized in that, The first construction module is specifically used to: determine the position of the elimination range sliding window in the current key-value pair sequence at the current decoding time step; The elimination range sliding window selects at least two key-value pairs from the current key-value pair sequence as sibling nodes; Compare the importance index of each key-value pair in the at least two key-value pairs, and take the key-value pair with the largest importance index as the parent node; The key-value pairs in the current key-value pair sequence that have no parent node are respectively used as the root nodes of the M key-value trees.

17. The apparatus according to claim 14, characterized in that, The key-value pair data includes the key-value pairs corresponding to each word element obtained by the attention calculation during the filling phase of the large language model; The device further includes a second construction module, which is used to divide the N key-value pairs into multiple key-value pair blocks, each key-value pair block including key-value pairs corresponding to adjacent words in the input sequence; Based on the importance index of each key-value pair block in the plurality of key-value pair blocks and M elimination range windows, the key-value tree is constructed. The M elimination range windows are used to divide the key-value pair block sequence into different key-value pair block groups. Different key-value pair blocks in each key-value pair block group are used as sibling nodes. The key-value pair block with the largest importance index among the sibling nodes is used as the parent node.

18. The apparatus according to claim 17, characterized in that, The construction of the key-value tree based on the importance index of each key-value pair block in the plurality of key-value pair blocks and M elimination range windows includes: The key-value pair block sequence is divided into M key-value pair block groups using M elimination range boxes; Different key-value pair blocks in each key-value pair block group are regarded as sibling nodes; Compare the importance index of each key-value pair in the sibling nodes, and take the key-value pair with the largest importance index as the parent node; The key-value pair blocks without parent nodes in the M key-value block groups are respectively used as the root nodes of the M key-value trees.

19. A computing device, comprising a memory and a processor, characterized in that, The memory stores instructions that, when executed by a processor, cause the method described in any one of claims 1-12 to be implemented.

20. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it causes the method as described in any one of claims 1-12 to be implemented.