A Key-Value Caching Compression Method Based on Importance Awareness, Dynamic Hierarchy, and Clustering Fusion

By combining global importance awareness and dynamic hierarchical clustering, the problem of excessive memory overhead in long text reasoning of large language models is solved, and efficient compression of KV cache is achieved, maintaining the accuracy and computational efficiency of long-range context understanding.

CN122309754APending Publication Date: 2026-06-30DALIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DALIAN UNIV OF TECH
Filing Date
2026-03-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In the process of reasoning long texts, large language models suffer from excessive memory overhead due to the linear growth of the key-value cache size, which limits the model's concurrent processing capability. Existing methods are unable to effectively reduce the size of the key-value cache while ensuring the accuracy of long-range context understanding.

Method used

We adopt a dynamic hierarchical and clustering fusion method based on importance awareness. We calculate the importance score of words by accumulating global attention, divide words into core retention, mergeable and discard sets by dynamic threshold, perform global semantic clustering without position restrictions and adaptive Gaussian weighted fusion, and combine sliding window buffering and dynamic importance update to achieve efficient compression of KV cache.

Benefits of technology

Under limited GPU memory conditions, it effectively preserves the core semantics of long-range contexts, maintains high fidelity and computational efficiency in long text reasoning, and outperforms existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309754A_ABST
    Figure CN122309754A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of large language model inference optimization, and discloses a key-value (KV) cache compression method based on importance-aware dynamic hierarchical and clustering fusion. By combining importance-aware eviction and global semantic clustering fusion mechanisms, it addresses the problem of excessive GPU memory overhead in long text inference for autoregressive large language models based on the Transformer architecture, achieving high-fidelity long text inference under limited GPU memory conditions. The global semantic clustering algorithm captures cross-paragraph semantic relationships in long contexts, achieving redundant background compression and core semantic preservation, overcoming the obstacles of existing merging methods' difficulty in achieving global semantic awareness and static strategies' inability to adapt to dynamic changes. From the perspective of sliding window incremental merging, newly generated tokens are temporarily stored and similarity retrieval is updated, maintaining the local contextual coherence in the decoding stage, avoiding information loss due to premature compression, and improving the overall performance of long text inference while balancing storage efficiency and semantic fidelity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large language model inference optimization technology, and in particular to a key-value cache compression method based on importance-aware dynamic hierarchical and cluster fusion. Background Technology

[0002] In recent years, Large Language Models (LLMs) have made groundbreaking progress in various natural language processing tasks. Models based on the Transformer architecture have continued to expand in size, and their ability to handle long contexts has significantly improved. However, this performance improvement is accompanied by a sharp increase in computational resource consumption, especially in the autoregressive decoding stage, where the key and value states of historical lexical units must be cached to form a KV cache, thus avoiding the time overhead of repeated computation. The size of the KV cache is linearly positively correlated with the length of the input sequence. When processing long texts or multi-turn dialogues, the ever-growing KV cache quickly consumes GPU memory, limiting the model's concurrent processing capabilities and even leading to memory overflow errors. How to effectively reduce the size of the KV cache while ensuring the model's accuracy in understanding long contexts has become a key issue for efficient inference in large language models. This section will systematically review existing research from two perspectives: an overview of large language model compression methods and large model compression methods based on KV caches, and analyze the limitations of current methods.

[0003] (1) Overview of large language model compression methods; Compression techniques for large language models aim to reduce the storage and computational overhead of the model while maintaining its original performance as much as possible, and are a key means to promote the deployment of large models in resource-constrained environments. Existing mainstream compression methods mainly fall into four technical paths: model pruning, model quantization, knowledge distillation, and low-rank decomposition. Model pruning reduces the model size by removing unimportant weights or structural units, and can be divided into unstructured pruning and structured pruning. Unstructured pruning performs sparsity processing at the granularity of individual weights. Although it can achieve high compression ratios, the resulting sparse patterns require specialized hardware acceleration. Structured pruning removes the overall structure, such as attention heads, feedforward network layers, or complete Transformer layers, maintaining the original regularity of the model and making it easier to achieve acceleration on general-purpose hardware. For example, LLM-Pruner uses a loss-sensitivity-based evaluation method to identify and remove redundant structures, while Sheared LLaMA induces structural sparsity by introducing regularization terms, thereby achieving regular compression of the model.

[0004] Model quantization techniques reduce storage and computational overhead by lowering the numerical precision of parameters and activation values. Existing research shows that the weight and activation distribution of large language models exhibits heavy-tailed characteristics, meaning that the vast majority of values ​​are concentrated in a narrow range, while a few outliers have a decisive impact on the model output. To address this characteristic, researchers have proposed several hybrid precision quantization strategies that compress conventional values ​​while maintaining high precision for outliers. GPTQ uses second-order Hessian matrix information to iteratively compensate for quantization errors, AWQ protects important weight channels through activation-aware scaling factors, and SmoothQuant transfers the quantization difficulty in activation to the weight side, thereby maintaining precision under low-bit quantization.

[0005] Knowledge distillation obtains a compressed model by having a small student model mimic the output behavior of a large teacher model. In the field of language modeling, research by MiniLLM et al. has shown that using a sequence-level knowledge distillation strategy can significantly reduce the size of the student model while maintaining the quality of its generation. Low-rank decomposition utilizes the low-rank property of the weight matrix, decomposing it into a product of several smaller matrices to reduce the number of parameters. Since the singular values ​​of the weight matrix in the Transformer layer typically exhibit a rapid decay trend, retaining only the first k principal singular values ​​is sufficient to approximately restore the main information of the original matrix. The above compression methods reduce the storage and computational requirements of the model from different perspectives, but they mainly affect the model parameters themselves, with less attention paid to optimizing the dynamically generated KV cache during inference.

[0006] (2) Overview of large model compression methods based on KV caching; Key-value (KV) caching is a crucial component of autoregressive decoding in the Transformer architecture. By storing key-value pairs of historical terms, it avoids redundant computation and significantly improves inference efficiency. However, as sequence length increases, the storage pressure from KV caching becomes increasingly prominent, becoming a major bottleneck restricting the throughput of large model services. To alleviate this pressure, researchers have explored KV caching compression methods from multiple dimensions. Based on the systematic classification of existing literature, KV caching compression strategies can be summarized into five categories: label-based eviction methods, merging-based methods, quantization-based methods, sharing-based methods, and attention head reduction methods.

[0007] Tag-based eviction methods evaluate the importance of each lexical unit and directly remove lower-scoring lexical units from the cache. The core of these methods lies in designing effective importance assessment metrics. Early research used attention scores within local windows as the criterion, assuming that historical lexical units with high attention in the current generation step are more important. The H2O method filters lexical units based on global cumulative attention scores, retaining those with high cumulative attention in history and discarding the rest. SnapKV combines a local window attention mechanism to retain important lexical units and their adjacent local context to maintain semantic coherence. StreamingLLM, designed for streaming long text generation scenarios, retains attention-converging lexical units at the initial position of the sequence and the most recently generated local window lexical units, thus maintaining the model's performance stability in long text processing. The advantages of these methods are computational efficiency and simplicity of implementation; however, direct truncation strategies can easily lead to the loss of fine-grained contextual information and implicit dependencies across paragraphs in long texts, especially in long-term tasks requiring precise retrieval, which may cause a significant drop in model performance.

[0008] Merging-based methods attempt to mitigate information loss caused by direct expulsion through feature fusion. The CaM method merges the value states of the expendable lexical units into the retained lexical units, preserving the contribution of the discarded lexical units through a weighted average. The EMS method combines global and local importance scores to dynamically adjust the expulsion and merging ratio, balancing cache overhead with generation accuracy. However, existing merging strategies often rely on fixed compression ratios or are limited to physically adjacent local windows, lacking the ability to dynamically perceive globally semantically equivalent lexical units in long-range contexts. Due to the lack of distinction between critical and general redundant information, indiscriminate merging of local lexical units easily leads to ambiguity of core semantics and introduces feature noise.

[0009] Quantization-based methods compress storage space by performing low-bit quantization on the key-value cache. These methods draw on techniques accumulated from model quantization, compressing key-value vectors represented by 16-bit or 32-bit floating-point numbers into 8-bit, 4-bit, or even lower integer representations. The advantage of quantization methods is their predictable compression rate and regular computational process, but their effectiveness is limited by the impact of numerical precision loss on attention computation. Sharing-based methods utilize the reusability of key-value pairs from repeated terms or similar contexts to reduce cache size, for example, reusing historical caches in multi-turn dialogues where the same prefix appears multiple times. Attention head reduction methods identify redundant heads in multi-head attention mechanisms and remove their corresponding key-value caches entirely. Both of these methods exhibit good compression performance in specific scenarios, but their generalization ability needs improvement.

[0010] (3) Limitations of existing methods Despite the progress made in the aforementioned KV cache compression methods, existing research still suffers from the following limitations. First, eviction-based methods are prone to losing fine-grained information in scenarios with limited GPU memory, while merging-based methods struggle to achieve global semantic awareness. Second, most methods employ static compression strategies, making it difficult to adapt to the dynamic changes in lexical importance during long text generation. Third, existing methods primarily focus on the compression ratio itself, with insufficient consideration given to the computational overhead and generalization ability of the compressed cache. To address these issues, this paper proposes a dynamic KV cache compression method combining importance-aware eviction and global similarity clustering merging. By dynamically evaluating lexical importance, employing position-independent semantic clustering, and utilizing an adaptive fusion strategy, this method achieves a high compression ratio while effectively preserving the coherence of core semantics and long-range context. Summary of the Invention

[0011] Current key-value cache compression methods mainly focus on heuristic rule-based lexical expulsion or feature merging based on local windows, failing to effectively address the balance between fine-grained information preservation and global semantic fusion in long-range contexts. This invention aims to overcome the excessive memory overhead caused by the linear growth of key-value cache size in long-text reasoning stages of large language models. Specifically, for autoregressive large language models based on the Transformer architecture, a dynamic key-value cache compression method combining importance-aware expulsion and global semantic clustering is proposed to achieve high-fidelity long-text reasoning under limited memory conditions.

[0012] A key-value cache compression method based on importance-aware dynamic hierarchical and clustering fusion includes the following steps: S1: Select the target large language model, determine the model configuration and input context, and generate the original key-value cache; This invention is applicable to various autoregressive large language models based on the Transformer decoder architecture, including but not limited to the Llama series and the Mistral series. Model loading is accomplished through a deep learning framework, and the input sequence contains initial cue words and subsequently generated words. In the pre-filling stage, the model performs forward computation on the input sequence to generate the key and value states of each attention head in each layer, forming the original key-value cache. Given an input sequence with T words, for a certain attention head in the l-th layer (1 ≤ l ≤ L) of the model, its original cached key and value sets can be represented as: in, , , and These represent the hidden layer dimensions for keys and values, respectively. As autoregressive generation progresses, the sequence length T continuously increases, and the storage requirements for the key-value cache increase linearly.

[0013] S2: Calculate lexical importance scores based on global attention accumulation; To evaluate the contribution of historical lexical units to subsequent text generation, this invention employs a standardized importance evaluation mechanism based on global attention accumulation. For any lexical unit i (1 ≤ i ≤ T) in the input sequence, the cumulative sum of its attention scores across all subsequent generation steps is calculated and divided by the total number of subsequent lexical units to eliminate positional bias, thus defining its global importance score. for: in, This represents the attention weight assigned by the model to the i-th historical lexical when generating the j-th lexical. This score reflects the global semantic value of the lexical in the long-range context. To reduce the computational overhead of attention in the pre-filling stage, a uniform sampling method for probing lexicals can be used to approximate the global attention distribution, with a sampling ratio of 10%.

[0014] S3: Achieve ternary classification of word units through dynamic threshold segmentation; Global importance score calculated based on S2 Based on the statistical distribution characteristics, this invention employs a dynamic threshold strategy to achieve a ternary partitioning of the original index set. Specifically, it calculates the mean of the global importance scores of all tokens. and standard deviation And introduce a threshold coefficient and Set dual thresholds: Based on the above threshold, the original sequence is divided into the following three mutually exclusive subsets: (1) Core Reserved Set : Contains words that are of high importance in the global semantics, and their corresponding key-value states will be fully preserved in the cache after compression.

[0015] (2) Sets to be merged This includes moderately important terms that are neither core semantics nor completely redundant. These terms will be used in subsequent clustering and fusion operations to reduce feature dimensionality.

[0016] (3) Discard set : Contains low-importance terms, whose key-value states will be released from the cache to reduce memory overhead.

[0017] Threshold coefficient and It can be determined by grid search on a small validation set, with typical values ​​ranging from 0.3 to 0.8.

[0018] S4: Performs global semantic clustering without positional constraints on the merged set; Traditional merging methods are often limited to locally adjacent windows in terms of physical location, making it difficult to capture semantic relationships across paragraphs. This invention addresses the issues raised in the merging process. A bottom-up hierarchical clustering algorithm without positional constraints is proposed. First, the algorithm calculates... Construct a similarity matrix by calculating the cosine similarity between all pairwise key vectors within the word. , among which | Iterative search Maximum similarity ,like Greater than the preset merging threshold If no cluster pair can be merged, the corresponding clusters will be merged, and the cluster representative vector will be updated until no more cluster pairs can be merged. This ultimately generates a semantic cluster set. , , , .

[0019] Merging threshold The granularity of clustering is controlled; too high a value can lead to insufficient clustering, while too low a value may incorrectly merge semantically irrelevant terms. A typical value range is 0.65 to 0.85.

[0020] S5: Introduce hub selection and joint importance scoring for secondary screening; To reduce noise introduced during the clustering process, a hub selection and secondary screening mechanism is introduced before fusion. For each cluster... Select The highest lexical unit as the pivot lexical unit Calculate other terms within the cluster. With hub Joint importance score : This formula combines the attention association strength with the Euclidean distance in the feature space. A selection threshold is set... Only retain lexical set Participate in subsequent integration. Screening threshold. The proportion of retained terms within a cluster is controlled, typically ranging from 0.15 to 0.25.

[0021] S6: Adaptive Gaussian weighting strategy is used to achieve intra-cluster feature fusion; In the feature fusion stage, this invention employs an adaptive Gaussian weighting strategy, centering the set on pivot words. The key-value states within the data are weighted and fused. The Gaussian kernel weights are then calculated. and normalized weights : in, The adaptive bandwidth parameter is dynamically calculated based on the lexical distribution characteristics within the cluster. Ultimately, a single compressed key-value pair representing the cluster is generated: In the above formula, for Multiply by scalar The aim is to approximately maintain the counting effect of the cluster in the original sequence in subsequent Softmax attention calculations, thus preserving the equivalence of the attention distribution.

[0022] S7: Construct a local incremental merging strategy based on a sliding window to handle newly generated lexical units during the decoding stage; After entering the autoregressive decoding stage, the model needs to manage the caching of newly generated tokens. Directly compressing new tokens may disrupt the coherence of local context. Therefore, this invention introduces a fixed capacity... sliding window The newly generated lexical units first enter the window. Temporarily stored when the window is full ( When ), the oldest word element pops up. Process it.

[0023] For the pop-up The algorithm calculates its key vector. With the current compressed cache Given existing clusters representing vectors, extract the most similar vectors. The candidate clusters constitute a set ,in Then, the maximum cosine similarity in the candidate set is determined. and the corresponding best matching cluster .

[0024] like If this occurs, incremental merging is triggered, and the key-value state of the cluster is updated using a weighted average method based on the number of existing lexical units within the cluster. in, This represents the original set of locations covered by the cluster. If... This indicates that the lexical unit has different semantic features, and the algorithm will... Create a new, independent cluster for it.

[0025] Sliding window capacity This controls the memory allocation ratio between local buffering and global compression, typically ranging from 16 to 64. Candidate cluster search count. The typical value ranges from 5 to 15.

[0026] S8: Design a dynamic budget control and importance score update mechanism; As the decoding process progresses, the cache... The scale will continue to grow. This is to keep the video memory within the preset budget. Within this invention, a replacement mechanism based on dynamic updating of global importance is designed. The semantic value of a cluster changes with the generated context. Therefore, when generating new lexical units... At that time, the algorithm dynamically calculates the cluster of lexical pairs. All original locations covered inside The sum of attention weights is used as the increment in importance for the newly added cluster: To reduce the computational overhead of large-scale sequence inference, this update process can be optimized to per Perform a batch calculation every step. The typical value ranges from 8 to 16. When At that time, the system will locate and remove the global importance score. The lowest-ranking cluster releases the corresponding video memory resources.

[0027] Budget It can be set according to the actual video memory capacity and application requirements. The typical value is 256 to 1024 semantic clusters, which corresponds to a compression rate of 50% to 90% for the original key-value cache.

[0028] S9: Define compression ratio and fidelity constraints; Compressed key cache Formalizable definition: Compression ratio is defined as the ratio of the compressed cache size to the original cache size: To ensure the accuracy of long-range contextual reasoning, compression must satisfy a fidelity constraint: for any subsequent query Use compressed caching and Calculated output Compared to the original fully buffered output The differences between them should be controlled within the tolerance range: in To characterize the minimum positive number of the performance loss threshold, it can be calibrated by calculating the average relative error on the validation set.

[0029] S10: Design a multi-task joint optimization objective; The method of this invention is a training-independent compression strategy that does not require optimization of model parameters through gradient descent. The effectiveness of the algorithm is evaluated by performance metrics on an experimental validation set, including but not limited to: (1) language modeling perplexity; (2) accuracy, F1 score, and ROUGE score on long text benchmark tasks; (3) end-to-end inference latency and throughput; and (4) peak memory usage.

[0030] Compared with existing technologies, the advantages of this invention are as follows: It achieves refined classification of different value terms in long contexts through global importance assessment and dynamic thresholding; it effectively captures semantic relationships across paragraphs through position-unrestricted global semantic clustering and adaptive Gaussian weighted fusion, preserving core semantics while compressing redundant background; it maintains the coherence of local contexts during the decoding stage through sliding window buffering and incremental merging strategies, avoiding information loss caused by premature compression; and it achieves a dynamic balance between compression ratio and generation quality under limited GPU memory conditions through dynamic importance updates and budget control mechanisms. Experimental results show that in the LongBench multi-task benchmark and Needle-in-a-Haystack precise retrieval task, the method of this invention achieves better performance than existing baseline models under the same GPU memory budget, providing a solution for long text inference that balances storage efficiency and semantic fidelity. Attached Figure Description

[0031] Figure 1 This is a diagram illustrating the overall framework of the importance-aware dynamic hierarchical and clustering fusion key-value caching compression method of this invention.

[0032] Figure 2 This is a schematic diagram illustrating the global importance assessment and dynamic threshold division of this invention.

[0033] Figure 3 This is a flowchart of the global semantic clustering and adaptive Gaussian weighted fusion process of the present invention.

[0034] Figure 4 This is a schematic diagram of the local incremental merging strategy based on a sliding window according to the present invention.

[0035] Figure 5 This is a schematic diagram of the dynamic budget control and importance score update mechanism of the present invention. Detailed Implementation

[0036] The technical solution of the present invention will now be described in detail with reference to the accompanying drawings. Those skilled in the art should understand that the embodiments described herein are for illustrative purposes only and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0037] This invention aims to efficiently compress the key-value cache of large language models in the long text reasoning process by adopting a dynamic hierarchical and cluster fusion method based on importance awareness, and to verify the technology on the LongBench long text benchmark using the Llama-2-7B-chat model as an example.

[0038] A key-value cache compression method based on importance-aware dynamic hierarchical and clustering fusion includes the following steps: S1: Set up the experimental environment and load the pre-trained model; S2: Generate the original key-value cache based on the input sequence; S3: Calculate lexical importance scores based on global attention accumulation; S4: Achieve ternary classification of word units through dynamic thresholding; S5: Perform hierarchical clustering on the merged set; S6: Introduce hub selection and joint importance scoring for secondary screening; S7: Adaptive Gaussian weighting strategy is used to achieve intra-cluster feature fusion; S8: Generate the initial compressed buffer; S9: Construct a sliding window to temporarily store newly generated word elements; S10: Perform similarity retrieval and incremental merging on the pop-up word elements; S12: Update the global importance score of each cluster in real time; S13: Implement budget control and elimination mechanisms; S14: Evaluate the model's performance based on multiple performance evaluation metrics.

[0039] 1. Experimental environment and model configuration; The experiments of this invention were conducted on a server equipped with an NVIDIA RTX A800 (80GB VRAM) GPU, running Ubuntu 20.04, using the PyTorch deep learning framework, and loading pre-trained models with the HuggingFace Transformers library. The base models used in the experiments included Llama-2-7B, Llama-3-8B-Instruct, and Mistral-7B-Instruct-v0.2. All models maintained their original parameters without fine-tuning or gradient updates. During model inference, FP16 data types were used to balance memory usage and computational accuracy.

[0040] 2. Method Design: (1) Generate the original key-value cache based on the input sequence; Given an input sequence containing initial cue words and subsequently generated words, in the pre-filling stage, the model performs a forward computation on the input sequence to generate the key and value states for each attention head at each layer, forming the original key-value cache. For the model's... layer For a given attention head, its original cached key set and value set can be represented as: in This represents the total length of the currently generated sequences. , , and These are the hidden layer dimensions for keys and values, respectively, in the Llama-2-7B model. This invention will perform subsequent compression operations based on the original cache.

[0041] (2) Global importance assessment and ternary partitioning in the pre-filling stage; To assess the contribution of historical lexical units to subsequent text generation, this invention employs a standardized importance evaluation mechanism based on global attention accumulation. For any lexical unit in the input sequence... The cumulative sum of attention scores received in all subsequent generation steps is calculated and divided by the total number of subsequent lexical units to eliminate positional bias, thus defining its global importance score. for: in The model represents the generation of the first... When the word element is 1, the historical 1 Attention weights are assigned to each lexical unit. This score reflects the global semantic value of the lexical unit in a long-range context. To reduce the computational overhead of attention in the pre-filling stage, this embodiment uses a uniform sampling method to probe lexical units to approximate the global attention distribution. The sampling ratio is set to 10%, that is, 10% of the positions are randomly selected from subsequent generation steps to calculate their attention distribution. This approximation operation keeps the computational overhead within an acceptable range while ensuring the effectiveness of importance assessment.

[0042] Global importance score calculated based on S3 Based on the statistical distribution characteristics, this invention employs a dynamic threshold strategy to achieve a ternary partitioning of the original index set. Specifically, it calculates the mean of the global importance scores of all tokens. and standard deviation And introduce a threshold coefficient and Set dual thresholds: Based on the above threshold, the original sequence is divided into the following three mutually exclusive subsets: I. Core Reserved Set : Contains words that are of high importance in the global semantics, and their corresponding key-value states will be fully preserved in the cache after compression.

[0043] II. Sets to be merged This includes moderately important terms that are neither core semantics nor completely redundant. These terms will be used in subsequent clustering and fusion operations to reduce feature dimensionality.

[0044] III. Discard Set : Contains low-importance terms, whose key-value states will be released from the cache to reduce memory overhead.

[0045] Threshold coefficient and This is determined by performing a grid search on a small validation set (taken from the LongBench evaluation set). In this embodiment, Set it to 0.5. Set it to 0.5. If If the value is too high (e.g., greater than 1.0), it will result in an excessively small core retain set, leading to the loss of critical information; if... Values ​​that are too low (e.g., less than 0.3) will result in an excessively large discard set, leading to the accidental deletion of potentially useful information. The values ​​mentioned above strike a balance between compression ratio and fidelity.

[0046] (3) Global semantic clustering and adaptive Gaussian weighted fusion; Traditional merging methods are often limited to locally adjacent windows in terms of physical location, making it difficult to capture semantic relationships across paragraphs. This invention addresses the issues raised in the merging process. A bottom-up hierarchical clustering algorithm without positional constraints is proposed. First, the algorithm calculates... Construct a similarity matrix by calculating the cosine similarity between all pairwise key vectors within the word. ,in The formula for calculating cosine similarity is: Iterative search Maximum similarity ,like Greater than the preset merging threshold If no clusters are found, the corresponding clusters are merged, and the cluster representative vector is updated (by taking the average of the two vectors), until no more cluster pairs can be merged. This ultimately generates a semantic cluster set. .

[0047] Merging threshold Control the granularity of clustering. If If the value is too high (e.g., 0.95), the number of clusters will be high. near The compression effect is not obvious; if If the value is too low (e.g., 0.5), semantically irrelevant terms may be incorrectly merged, introducing noise. In this embodiment, a grid search is used to determine... This value achieved the best average performance across all LongBench tasks.

[0048] To reduce noise introduced during the clustering process, a hub selection and secondary screening mechanism is introduced before fusion. For each cluster... Select The highest lexical unit as the pivot lexical unit Calculate other terms within the cluster. With hub Joint importance score : This formula combines the attention association strength with the Euclidean distance in the feature space. A selection threshold is set... Only retain lexical set Participate in subsequent integration. Screening threshold. Control the proportion of reserved terms within a cluster. If If the value is too high (e.g., 0.5), then In the middle, only the pivot word itself may remain, degenerating into direct retention; if If the value is too low (e.g., 0.1), it cannot effectively filter noise. In this embodiment, Set it to 0.2.

[0049] In the feature fusion stage, this invention employs an adaptive Gaussian weighting strategy, centering the set on pivot words. The key-value states within the data are weighted and fused. The Gaussian kernel weights are then calculated. and normalized weights : in The adaptive bandwidth parameter is dynamically calculated based on the lexical distribution characteristics within the cluster, eliminating the need for manual setting. Ultimately, a single compressed key-value pair representing the cluster is generated: In the above formula, for Multiply by scalar The aim is to approximately maintain the counting effect of this cluster in the original sequence during subsequent Softmax attention calculations, thus preserving the equivalence of the attention distribution. Experiments have verified that this counting compensation mechanism can improve accuracy by approximately 3-5 percentage points in precise retrieval tasks such as Needle-in-a-Haystack.

[0050] Core Reserve The key-value pairs and the set to be merged are generated after merging. Cluster key-value pairs are merged to form the initial compressed cache. . The scale is Compression ratio In this embodiment, under a typical long text scenario, It can reach between 35% and 50%.

[0051] (4) Sliding window incremental merging strategy in the decoding stage; After entering the autoregressive decoding stage, newly generated lexical units are cached. Directly compressing new lexical units may disrupt the coherence of local context. Therefore, this invention introduces a fixed capacity. sliding window The newly generated lexical units first enter the window. Temporarily stored when the window is full ( When ), the oldest word element pops up. Processing is required. Window capacity. It is a key hyperparameter that controls the ratio of video memory allocation between local buffering and global compression. Values ​​that are too small (such as 0) will cause the local context to be compressed immediately, breaking the syntactic coherence; Values ​​that are too large (such as 128) will consume too much storage space in the global semantic cluster, affecting long-range dependencies. In this embodiment, sensitivity analysis is used to determine... This is the optimal value.

[0052] For the pop-up By calculating its key vector With the current compressed cache Given existing clusters representing vectors, extract the most similar vectors. The candidate clusters constitute a set ,in In this embodiment Let's set it to 10. Then, determine the maximum cosine similarity in the candidate set. and the corresponding best matching cluster .

[0053] like If the merging threshold is kept consistent with the pre-filling stage (set to 0.75), incremental merging is triggered, and the key-value state of the cluster is updated using a weighted average method based on the number of existing lexical units within the cluster. in This represents the original set of locations covered by the cluster. If... This indicates that the lexical unit has different semantic features. A new, independent cluster is created for each. This incremental merging mechanism ensures that new information can be incorporated into the cache without disrupting the existing cluster structure, while avoiding frequent compression of terms within the sliding window.

[0054] (5) Dynamic budget control and importance score updates; As the decoding process progresses, the cache... The scale will continue to grow. This is to keep the video memory within the preset budget. Within this invention, a replacement mechanism based on dynamic updating of global importance is designed. The semantic value of a cluster changes with the generated context. Therefore, when generating new lexical units... At that time, the algorithm dynamically calculates the cluster of lexical pairs. All original locations covered inside The sum of attention weights is used as the increment in importance for the newly added cluster: in This is the accumulated global importance score for the cluster, initially calculated based on the total number of terms within the cluster during the pre-filling phase. The sum. To reduce the computational overhead of large-scale sequence inference, this update process can be optimized to per Each step performs a batch calculation. In this embodiment... Set to 16, meaning a batch update is performed every 16 new morphemes generated.

[0055] Budget Represents compressed cache The maximum number of semantic clusters that can be accommodated in the middle. The value of depends on the actual video memory capacity and application requirements. In this embodiment, is set This means that a maximum of 512 semantic clusters can be retained in the cache. For the Llama-2-7B model (32 layers, 32 attention heads), this corresponds to a memory footprint of approximately 32 (BS) × 32 (layers) × 32 (heads) × 512 (budgets) × 128 (dimensions) × 2 (K / V) × 2 (bytes) ≈ 8.59 GB, which enables efficient concurrent inference on a GPU with 80 GB of memory.

[0056] when At that time, the system will locate and remove the global importance score. The lowest-ranking clusters are released, along with their corresponding memory resources. This eviction mechanism ensures that, even with limited memory budgets, the cache always retains the semantic information that contributes most to subsequent generation. In tasks requiring precise retrieval, such as Needle-in-a-Haystack, this mechanism allows target information (Needles) to accumulate importance scores as they are subsequently noticed, even if their initial importance is low, thus avoiding accidental eviction.

[0057] (6) Experimental data analysis The hyperparameters of the method in this invention are all determined through grid search on a small validation set. The main parameter settings are shown in the table below: Table 1. Parameter Settings Diagram

[0058] During the model testing phase, the compressed model was evaluated on the LongBench multi-task benchmark and the Needle-in-a-Haystack task. LongBench covers 16 subsets across six categories: multi-document question answering, single-document question answering, summarization generation, few-shot learning, synthesis tasks, and code tasks. Evaluation metrics included: F1 score (question answering task), ROUGE mean (summarization task), accuracy (few-shot and synthesis tasks), and edit similarity (code task). The Needle-in-a-Haystack task was tested with a 32K context length, and the evaluation metric was retrieval accuracy.

[0059] This embodiment uses an absolute cache budget (i.e., a limit on the total number of lexical units that can be retained for each attention head) as the core evaluation condition. The budget is set at different levels such as 256, 512, and 1024, and is compared with the full cache baseline and existing methods (StreamingLLM, H2O, SnapKV, CaM, EMS, etc.).

[0060] Experimental results on the LongBench multi-task benchmark demonstrate that our method outperforms baseline models within the same budget in tasks relying on global long-range cues, such as multi-document question answering and summarization. For example, the Llama-3-8B model scores 46.20 on HotpotQA, higher than H2O's 43.74. This performance improvement is primarily attributed to the global semantic clustering and adaptive Gaussian weighted fusion mechanism. This mechanism maps equivalent semantic entities and discrete cues across paragraphs to unified low-dimensional cluster centers, compressing redundant background while preserving the core semantics of the paragraphs.

[0061] In single-document question answering and few-shot learning tasks, this method can adaptively identify and retain core features that are highly relevant to the query through global attention accumulation and dynamic threshold partitioning, outperforming methods such as SnapKV that rely on local context preservation.

[0062] In the tasks of synthetic class retrieval and code completion generation, this method achieves significant performance improvements. The sliding window mechanism preserves the integrity of the local code syntax tree and the latest generation logic, avoiding information loss caused by premature compression; the semantic clusters based on cosine similarity provide global feature representations, enabling the model to accurately retrieve discrete target information or previously defined variables across long sequences.

[0063] In the Needle-in-a-Haystack task, this method maintains high retrieval accuracy under extreme context length and strictly limited cache budget (achieving 84.5% accuracy with an extreme budget of 128; further improving to 97.5% with a budget of 512), outperforming other baseline algorithms. This is mainly due to the dynamic importance update mechanism and global similarity merging strategy: even if the attention distribution of target information is weak in the initial stage, the algorithm can effectively integrate it into relevant semantic clusters through similarity matching in the feature space, reducing the risk of key facts being directly truncated.

[0064] To deeply analyze the independent contributions of each core module, this embodiment is implemented within a strictly limited video memory budget ( Ablation experiments were conducted. Removing the clustering fusion mechanism resulted in the most significant performance degradation (a 6.15-point decrease in the LongBench average score), indicating that fusion operations based on global semantics can more effectively preserve long-range contextual information than direct expulsion strategies. Removing the sliding window buffer also caused a significant performance loss (a 2.81-point decrease), validating the necessity of local buffering mechanisms in autoregressive decoding. Removing dynamic threshold partitioning and dynamic importance updates also reduced inference accuracy, confirming that static rules are ill-suited to the diversity of long text feature distributions.

[0065] In the comparative experiment of feature fusion strategies, adaptive Gaussian weighted fusion (comprehensive average score of 54.61) is better than average pooling fusion (52.18) and retaining only hub words (50.45), indicating that the distance-aware weighted fusion strategy based on hub words can more effectively preserve the core semantics and suppress the introduction of noise.

[0066] Within the limited total cache budget ( Under the condition of ), the size of the sliding window Sensitivity analysis was performed. At that time, the model's performance dropped significantly on both coding and synthesis tasks; with With the addition of local context, scores for code and synthesis tasks steadily increased, allowing for complete preservation of local context. However, when... When the number of available resources increases to 64 or even 128, the global available budget decreases significantly, leading to a noticeable performance degradation in tasks that heavily rely on long-range, cross-paragraph features, such as multi-document question answering, single-document question answering, and summary generation. A comprehensive evaluation of the performance across all task dimensions reveals that... At that time, the model can achieve a better balance between maintaining the local sequence structure and preserving the global long-range semantics, and obtain the highest LongBench overall average score.

[0067] This invention addresses the memory limitations and throughput bottlenecks caused by the linear growth of key-value cache size in long-context reasoning stages of large language models. It proposes a key-value cache compression method based on importance-aware dynamic hierarchical and clustering fusion. In the pre-filling stage, this method accumulates lexical importance through global attention and uses dynamic thresholds to divide lexical units into a retain set, a merge set, and a discard set. For the merge set, hierarchical clustering and adaptive Gaussian weighted fusion are performed to generate an initial compressed cache. In the decoding stage, a sliding window buffer is constructed to hold new lexical units, and incremental merging or new cluster creation is performed through cosine similarity retrieval. Simultaneously, a dynamic importance update and budget control mechanism is designed to ensure limited memory usage. Experimental results show that in the LongBench multi-task benchmark and Needle-in-a-Haystack precise retrieval task, this method achieves superior performance compared to existing baseline models under the same memory budget, providing a solution for long-text reasoning that balances storage efficiency and semantic fidelity.

[0068] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.

Claims

1. A key-value cache compression method based on importance-aware dynamic hierarchical and clustering fusion, comprising the following steps: S1: Select the target large language model, determine the input context, and generate the original key-value cache; The target large language model is an autoregressive model based on the Transformer decoder architecture, which accepts an input sequence after loading. The input sequence contains initial prompting lexical units and subsequently generated lexical units. In the pre-filling stage, forward computation is performed on the input sequence to generate the key state and value state of each attention head in each layer, which constitutes the original key-value cache. S2: Calculate lexical importance scores based on global attention accumulation; For each word in the input sequence, the cumulative sum of the attention scores it receives in all subsequent generation steps is calculated and divided by the total number of words after that word to obtain the global importance score of that word. The global importance score reflects the semantic value of a word in a long-range context; S3: Achieve ternary classification of word units through dynamic threshold segmentation; Based on the statistical distribution characteristics of the global importance scores of all lexical units, their mean and standard deviation are calculated, and a first threshold and a second threshold are set to divide the original lexical unit set into three mutually exclusive subsets: the core retention set, which consists of lexical units with a global importance score not lower than the first threshold, whose key-value states are completely preserved after compression; The set to be merged consists of words whose global importance scores are between the second threshold and the first threshold, and is used for subsequent clustering fusion; the set to be discarded consists of words whose global importance scores are below the second threshold, and their key-value states are released from the cache. S4: Performs global semantic clustering without positional constraints on the merged set; Calculate the cosine similarity between all pairs of word key vectors in the set to be merged, and construct a similarity matrix; A bottom-up hierarchical clustering algorithm is used to iteratively merge cluster pairs with similarity higher than a preset merging threshold and update the cluster representative vector until there are no more cluster pairs to merge, thus generating a semantic cluster set. S5: Introduce hub selection and joint importance scoring for secondary screening; For each semantic cluster, the word with the highest global importance score within the cluster is selected as the hub word; the joint importance score of other words within the semantic cluster and the hub word is calculated, and the joint importance score combines the attention association strength and the Euclidean distance in the feature space; Set a screening threshold and retain lexical units whose joint importance score is not lower than the screening threshold to form a set of lexical units that participate in subsequent fusion; S6: Adaptive Gaussian weighting strategy is used to achieve intra-cluster feature fusion; Centered on the pivot word, Gaussian kernel weights are calculated for the set of words participating in the fusion, and adaptive bandwidth parameters are dynamically calculated based on the word distribution within the semantic cluster. The Gaussian kernel weights are normalized to obtain the fusion weights of each word. The keys and values ​​are weighted and summed separately to generate a single compressed key-value pair representing the semantic cluster. The Gaussian weighted sum of the value vectors is multiplied by the semantic cluster size |R| to approximately maintain the counting effect in the original sequence. S7: Construct a local incremental merging strategy based on a sliding window to handle newly generated lexical units during the decoding stage; During the autoregressive decoding stage, a sliding window with a fixed capacity is set to temporarily store newly generated words; when the window is full, the word that entered the window earliest is popped out. Calculate the cosine similarity between the key vector of the popped word and the compressed key vector of each semantic cluster in the current compressed cache, and extract the most similar candidate clusters; determine the maximum similarity and its corresponding best matching cluster; if the maximum similarity is not lower than the preset merging threshold, merge the word into the best matching cluster, and update the key value state of the cluster based on the number of original words in the cluster using a weighted average method. Otherwise, create a new cluster for the popped term in the compressed cache; S8: Design a dynamic budget control and importance score update mechanism; Set the maximum number of semantic clusters that can be retained in the compressed cache as the budget; During the decoding process, when a new word is generated, the sum of the attention weights of that word to the original positions covered by each semantic cluster is dynamically calculated as the new importance increment of each semantic cluster, and accumulated to the global importance score of each semantic cluster; when the size of the compressed cache exceeds the budget, the cluster with the lowest global importance score is located and removed to release the video memory resources. S9: Define compression ratio and fidelity constraints; The compression ratio is the ratio of the size of the compressed key-value cache to the size of the original cache. Compression must meet fidelity constraints, meaning that for any subsequent query, the difference between the output calculated using the compressed cache and the output calculated using the original full cache must be controlled within a preset tolerance range.

2. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion as described in claim 1, characterized in that, The global importance score in step S2 The formula for calculating the global importance score is: in, Given the total length of the input sequence, Generate the first for the model The word meta-time for the history of the first Attention weights are assigned to each word.

3. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion as described in claim 1, characterized in that, In step S3, the first threshold and the second threshold are based on the mean of the global importance scores. and standard deviation Combined with the first threshold coefficient Second threshold coefficient Confirmed, specifically: in, The first threshold, The second threshold, and These are preset coefficients.

4. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The preset merging threshold mentioned in step S4 The clustering granularity is controlled, with a value ranging from 0.65 to 0.

85.

5. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The formula for calculating the joint importance score in step S5 is as follows: in, as a word element With pivot words The strength of attentional association between them and Each is a word element and pivot words The key vector, Represents Euclidean distance; filtering threshold The value ranges from 0.15 to 0.

25.

6. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The specific calculation method for the adaptive Gaussian weighted fusion in step S6 is as follows: First, calculate the Gaussian kernel weights. ,in For adaptive bandwidth parameters, by Sure, The set of lexical units participating in the fusion; Then calculate the normalized weights. ; Finally, compressed key-value pairs are generated: in, as a word element The value vector, The number of lexical units involved in the fusion.

7. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The capacity of the sliding window mentioned in step S7 The value ranges from 16 to 64; the number of candidate clusters The value range is 5 to 15; the preset merging threshold The merging threshold should be consistent with that in step S4.

8. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The budget mentioned in step S8 The value range is from 256 to 1024; the importance score is updated in batches, with each generation... Perform a batch update for each new word element. The value range is from 8 to 16.

9. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, The expression for the fidelity constraint in step S9 is: in, The output is calculated using the original full cache. This is an approximate output calculated using a compressed cache. This is a preset performance loss threshold.

10. The KV cache compression method based on importance-aware dynamic hierarchical and clustering fusion according to claim 1, characterized in that, It also includes extended steps for multi-turn dialogue scenarios: at each round of user input, the input is treated as a new query, the sum of attention weights of the query to each cluster in the current compressed cache is calculated as the dynamic importance increment of the cluster, and the clusters are sorted accordingly; when the cache size exceeds the budget, the cluster with the lowest dynamic importance is removed first.