A data management method for large language models
By dividing the key-value cache sequence of a large language model into sub-blocks and performing gating modulation based on the consistency ratio calculated using Euclidean distance and attention weights, the problems of memory overflow and key information omission in long text reasoning of large language models are solved, achieving efficient utilization of memory and improved reasoning accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHONGNAN INFORMATION TECH (SHENZHEN) CO LTD
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-16
AI Technical Summary
In long text reasoning scenarios of large language models, existing cache management schemes are prone to memory overflow and loss of key information. Existing cache eviction methods such as uniform compression or single evaluation index lead to resource waste and loss of key semantics.
The key-value cache sequence of the large language model is divided into multiple sub-blocks along the token position dimension. The dispersion coefficient and association concentration are calculated based on the Euclidean distance and attention weight within the sub-block. Gating modulation is performed through the consistency ratio, the cache retention quota is non-uniformly allocated, and tokens are filtered according to attention value within the sub-block.
Under conditions of limited video memory, it effectively avoids the omission of key details and global semantic breaks, improves the accuracy of model inference and resource utilization, and ensures the preservation of key information.
Smart Images

Figure CN122020132B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and more specifically to a data management method for large language models. Background Technology
[0002] In recent years, Large Language Models (LLMs) have demonstrated outstanding performance in various natural language processing tasks. During the autoregressive decoding inference process of LLMs, a key-value cache (KV Cache) mechanism is typically introduced to avoid redundant computation. This mechanism stores the key and value vectors of historical tokens in GPU memory for later retrieval and reuse during generation. This mechanism significantly improves inference speed and is a fundamental data management component of current large language model inference engines.
[0003] As large-scale model applications become increasingly sophisticated, the length of user-input text is increasing dramatically, particularly in scenarios involving lengthy legal contracts, complex technical documents with extensive code and parameters, or extremely long business conversations. In these long-text reasoning scenarios, the amount of data in the key-value (KV) cache expands rapidly with the length of the context sequence. Since the memory capacity of server graphics processing units (GPUs) is extremely limited, a large KV cache can quickly exhaust available GPU memory, leading to memory overflow or forcing the system to reduce its concurrent throughput. Therefore, filtering and compressing the KV cache within limited GPU memory resources to maintain continuous reasoning of long texts has become an essential requirement in this scenario.
[0004] However, when faced with complex and lengthy documents in real-world business scenarios, existing caching and eviction methods that rely on uniform compression or a single evaluation metric are prone to causing serious anomalies in actual business operations. For example, when a user requests a large model to analyze a technical document containing lengthy background information and dense core technical parameters, a single metric can easily lead to misjudgments: either it might be influenced by positional biases, retaining the irrelevant opening paragraphs as important information, or it might mistakenly identify a redundant text listing numerous irrelevant technical terms as crucial information. This single-dimensional misjudgment causes the system to waste valuable memory on low-value redundant information, while discarding the truly critical core clauses or parameters. Ultimately, when the large model answers user questions about these core details, it suffers from severe omissions of key information or fabrications—a phenomenon known as "hallucination"—causing the long-text question-answering and analysis business to completely fail. Summary of the Invention
[0005] To address the issue that cache eviction methods employing uniform compression or single evaluation metrics can easily lead to serious anomalies in actual business operations, this invention proposes a data management method for large language models. The method includes: dividing the key-value cache sequence generated during the large language model inference process into multiple continuous sub-blocks along the token position dimension; for each sub-block, calculating a dispersion coefficient representing the local information density level based on the statistical dispersion of the Euclidean distance between the value vectors of each token within the sub-block and the mean vector of the sub-block; for each sub-block, calculating an association concentration degree representing the global semantic association concentration degree based on the uneven distribution of the cumulative degree of global attention to each token in the attention weight matrix generated during the large language model inference process within the sub-block; calculating a consistency ratio based on the numerical consistency between the dispersion coefficient and the association concentration degree, and gating the fusion value of the dispersion coefficient and the association concentration degree using the consistency ratio to generate retention weights for each sub-block; and non-uniformly distributing the available cache retention quota to each sub-block according to the retention weights, and filtering tokens within each sub-block based on their attention value to generate a compressed key-value cache sequence.
[0006] Compared to existing cache management schemes in large language model inference that employ uniform culling or rely on a single attention dimension, leading to the loss of key semantics, this invention combines local information density with global semantic association concentration and uses consistency ratio for gating modulation to non-uniformly allocate retention quotas. This significantly compresses key-value cache to reduce device memory consumption, while effectively avoiding the omission of key details and global semantic breaks in practical application scenarios such as long text generation or complex multi-turn dialogues, thus significantly improving the model's inference accuracy and resource utilization.
[0007] Furthermore, the method for calculating the dispersion coefficient includes: performing element-wise mean aggregation on the value vectors of all tokens within the sub-block to obtain the sub-block mean vector; calculating the Euclidean distance between the value vector of each token within the sub-block and the sub-block mean vector; calculating the arithmetic mean distance and the distance standard deviation for all Euclidean distance values within the sub-block; and using the ratio of the distance standard deviation to the arithmetic mean distance as the dispersion coefficient of the sub-block.
[0008] Compared to traditional feature extraction methods that rely on a single statistical value, this invention characterizes dispersion by calculating the ratio of the distance standard deviation to the arithmetic mean distance. This allows for a more accurate and robust measurement of the information richness within each local segment, thereby helping the model to more scientifically identify and retain key text segments with extremely high information density in application scenarios such as long text summarization.
[0009] Furthermore, the method for calculating the association concentration includes: performing arithmetic mean aggregation on the attention weight matrix along the attention head dimension to obtain an average attention matrix; performing summation on the average attention matrix in the column direction to obtain the attention value for each token position; arranging the attention values in ascending order within each sub-block, and calculating the association concentration of the sub-block based on the sorted attention values using the Gini coefficient formula.
[0010] Compared to the potential local focus bias that may result from directly using the original attention weights, this invention uses the aggregation and averaging along the attention head and introduces the Gini coefficient to assess the uneven distribution of attention. This allows for a more objective reflection of the concentration of semantic focus in the global context. In scenarios such as long document reading comprehension, it can ensure that the model accurately captures and retains the core clues that are widely relied upon throughout the entire document.
[0011] Furthermore, the consistency ratio is calculated as follows: take the smaller value between the dispersion coefficient and the correlation concentration as the numerator, take the larger value between the two as the denominator, and use the ratio of the numerator to the denominator as the consistency ratio of the sub-block.
[0012] Furthermore, the method for generating the retention weights includes: using the arithmetic mean of the dispersion coefficient and the correlation concentration to characterize the overall importance level of the sub-block; using the consistency ratio as a gating coefficient to perform confidence weighting on the overall importance level; and introducing a global normalization term to ensure that the sum of the retention weights of all sub-blocks is always equal to the total number of sub-blocks, so as to ensure that the retention weights only change the quota allocation ratio between sub-blocks.
[0013] Compared to unconstrained weight stacking mechanisms, this invention uses a consistency ratio as a gating factor for confidence weighting and introduces global normalization processing. This ensures the fairness of resource competition among different sub-blocks and maintains a constant total quota. In edge computing or model deployment scenarios with strictly limited video memory, the system can optimally allocate video memory within the available resource boundaries, avoiding the risk of video memory overflow caused by weight inflation.
[0014] Furthermore, the method of non-uniformly distributing the available cache retention quota of the system to each sub-block according to the retention weight also includes: determining the target total number of retention tokens based on the current available video memory capacity and the storage occupancy of a single token key-value pair; dividing the target total number of retention tokens by the total number of sub-blocks to obtain the basic retention amount; and multiplying the retention weight of each sub-block by the basic retention amount and rounding it down to obtain the number of retention tokens for that sub-block.
[0015] Compared to traditional fixed-length cache truncation strategies, this invention dynamically calculates the target retention amount by combining the current available video memory capacity and adaptively allocates the number of tokens according to the weight of each sub-block. This enables large language models to have high flexibility in different hardware environments. In practical application scenarios with large video memory fluctuations, such as multi-task concurrency, it not only ensures the stable operation of the system, but also maximizes the semantic benefits under specific hardware conditions.
[0016] Furthermore, it also includes: in response to a deviation between the sum of the rounded-down number of reserved tokens of all sub-blocks and the target total number of reserved tokens, performing a compensation process on the deviation, successively deducting positive deviations from the sub-blocks with the lowest retention weight, and successively supplementing negative deviations to the sub-blocks with the highest retention weight, until the total number of reserved tokens is strictly equal to the target total number of reserved tokens.
[0017] Furthermore, the filtering of tokens based on attention value within each sub-block also includes: sorting all tokens within the sub-block from high to low according to their attention value; retaining the key-value cache data corresponding to tokens ranked within the number of retained tokens; and rearranging and concatenating the key-value pairs retained in each sub-block according to their position index in the original sequence from small to large to form the compressed key-value cache sequence.
[0018] Furthermore, it also includes: after calculating the dispersion coefficients of all sub-blocks, performing outlier detection, and clamping dispersion coefficient values that exceed three times the standard deviation of the mean dispersion coefficients of all sub-blocks to the boundary value of that range.
[0019] Furthermore, it also includes: in response to detecting that the sequence formed by the discrete coefficients and the sequence formed by the correlation concentration are inconsistent in length, triggering degradation processing logic, discarding the dual-channel feature data of the current batch, and falling back to the degradation mode of performing cache compression on each sub-block with uniform weight.
[0020] The technical effects of this invention are as follows:
[0021] This invention proposes a dynamic compression mechanism for key-value (KV) caching in large language models based on dual-channel feature fusion. Overcoming the limitations of existing technologies that suffer from key semantic loss due to uniform truncation or single attention filtering, this scheme innovatively divides the KV cache into sub-blocks, extracts features from both local information density and global semantic association dimensions, and uses consistency ratio for gating modulation to achieve a non-uniform dynamic allocation of cache retention quotas for each sub-block. Combined with dynamic memory awareness, strict bias compensation, and anomaly degradation mechanisms, this invention can adaptively and significantly reduce the memory consumption of model inference in memory-constrained practical application scenarios, while maintaining the integrity of contextual logic and generation quality. Attached Figure Description
[0022] Figure 1This is a schematic flowchart illustrating a data management method for large language models according to an embodiment of the present invention;
[0023] Figure 2 This is a bar chart illustrating the comparison of retention weight allocation in four text scenarios in an embodiment of the present invention. Detailed Implementation
[0024] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0025] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0026] This embodiment describes a data management method for large language models that operates within an inference service computing system. This computing system includes at least one server equipped with a graphics processing unit (GPU) accelerator card (GPU) featuring high-bandwidth video memory. The GPU GPU has a high-bandwidth video memory capacity of at least 80GB and a video memory bandwidth of at least 2TB / s. It is connected to the central processing unit (CPU) on the host side via a high-speed interconnect bus. The CPU has at least 64 physical cores, and the host-side system memory capacity is at least 512GB, employing a multi-channel memory architecture to ensure high data throughput. The server's persistent storage layer uses a solid-state drive (SSD) array based on the non-volatile memory fast-track protocol, with a sequential read bandwidth of at least 7GB / s, used for temporary storage and recovery of overflow key-value cache data in long text inference tasks.
[0027] For example, the server deploys a large language model inference engine based on a Transformer decoder architecture. This inference engine interacts with the graphics processing unit through a unified computing device architecture, maintaining key-value cache data for the inference process in video memory. The large language model preferably has at least 7 billion parameters, and the number of attention heads is preferably set to 32. However, in practical applications, the number of attention heads can be adjusted from 8 to 128 depending on deployment requirements. The model's embedding dimension is preferably set to 4096 dimensions, but depending on the model architecture, this dimension can be selected from 1024 to 8192 dimensions. The inference engine receives text inference requests submitted by external clients through a network interface module. Each request contains long text sequence data to be processed.
[0028] An example of a data management method for large language models:
[0029] like Figure 1 As shown, a data management method for large language models according to the present invention includes:
[0030] S101. Perform equal-width contiguous chunking along the token position dimension on the complete key-value cache sequence output by the inference engine to construct block-level analysis granularity.
[0031] First, in response to the inference engine receiving a long text inference request, the encoder module of the inference engine performs tokenization processing on the input text and completes forward propagation computation, generating a complete key-value cache sequence in the graphics processing unit's video memory. The key-value cache sequence contains... The pairing data of key vectors and value vectors at each token position, wherein the dimension of each key vector and value vector is equal to the embedding dimension of the model divided by the number of attention heads. In this embodiment, The preferred setting is 8192, meaning the input sequence contains 8192 tokens. However, in practical applications, the number of tokens may vary depending on the length of the input text. It can vary within the range of 1024 to 131072.
[0032] Furthermore, the method will have a length of The key-value cache sequence is divided along the token position dimension. Each non-overlapping contiguous sub-block contains [number] sub-blocks. Key-value pair data at consecutive token positions, the width of the sub-block equal Divide by The result is then rounded down. In this embodiment, The preferred setting is 64, corresponding to each sub-block containing 128 key-value pairs of tokens. However, in practical applications, the specific value may vary depending on the sequence length and computing resources. It can be adjusted within the range of 16 to 256. The execution logic of the block operation is as follows: Each sub-block, among which The value ranges from 1 to , containing the original sequence from the first The position to the first All key-value pairs at each location.
[0033] Then, if Cannot be If divisible, the actual number of tokens in the last sub-block will be greater than [the number of tokens in the last sub-block]. In this embodiment, for this boundary case, the remaining tokens at the end are grouped separately into the last sub-block, so that the number of tokens in the last sub-block is [number missing]. Add the remainder.
[0034] It should be noted that although this embodiment adopts an equal-width block strategy, in other embodiments where computing power allows, a semantic boundary block strategy based on punctuation mark positions or a storage alignment block strategy based on fixed byte size can also be used to achieve the same sequence partitioning effect.
[0035] After the above processing, a set of key-value cache sub-blocks after block division is obtained, the set containing Each sub-block holds a set of key-value pairs of consecutive token positions, serving as the basic analysis unit for subsequent block-level feature extraction.
[0036] S102. Calculate the semantic discreteness statistic along the value vector embedding dimension within each sub-block to extract the first channel feature representing the local information density level.
[0037] First, for each key-value cache sub-block output in step S101, the value vector data of all tokens within it is extracted. The value vector is a vector representation obtained after value projection matrix transformation in the large language model attention mechanism, carrying the information encoding of the corresponding token in the semantic space. In the processing scenario of long documents with mixed content, when the text region corresponding to a sub-block carries dense and diverse semantic information, the distribution of the value vectors of each token in the embedding space within that region tends to be far apart; when the text region corresponding to a sub-block contains redundant content with sparse information, the distribution of the value vectors of each token in the embedding space tends to be clustered together. Based on this statistical characteristic, the information density level of a region can be indirectly characterized by quantifying the dispersion of the value vectors within the sub-block.
[0038] Furthermore, regarding the first All value vectors within a sub-block are aggregated element-wise by mean, that is, the sum of the values contained in that sub-block is calculated. The arithmetic mean of each value vector in each embedded dimension component is used to obtain a sub-block mean vector with the same dimension as the individual value vectors. The formula for calculating the sub-block mean vector is based on the following: Since the arithmetic mean of each token value vector within a sub-block can characterize the geometric center position of that region in the semantic space, subsequent steps will use the degree to which each value vector deviates from this center as a measure of dispersion; therefore, this center position needs to be determined in advance. The sub-block mean vector... The calculation formula is:
[0039] ;
[0040] in For the first A vector of values for each token position. The summation range covers the position indices of all tokens within the sub-block, representing the number of tokens contained in the sub-block. Therefore, as the differences between the token value vectors within the sub-block increase, the mean vector tends towards a compromise representation of directional information, rather than being biased towards the semantic direction of any single token.
[0041] Then, regarding the first For each token value vector within a sub-block, calculate its Euclidean distance to the mean vector of that sub-block. As shown below: Euclidean distance can compress the overall deviation between two vectors in a high-dimensional embedding space into a scalar value. This scalar value simultaneously considers the common contribution of deviations across all embedding dimensions, making it suitable as a comprehensive measure of how far a vector deviates from the center. The first sub-block Euclidean distance from each token value vector to the sub-block mean vector The calculation formula is:
[0042] ;
[0043] in For the first A vector of values for each token position. The first one calculated in step S102 The mean vector of each sub-block This is the L2 norm operator, which takes the square root of the sum of the squares of the components of the vector difference. Therefore, when the semantic information carried by a token is highly different from other tokens in the sub-block, the Euclidean distance of its value vector from the mean center will increase accordingly.
[0044] Next, regarding the first A second-order statistical analysis is performed on the Euclidean distance values of all tokens within each sub-block. Specifically, the Euclidean distance values within that sub-block are calculated. The arithmetic mean of the Euclidean distance values with standard deviation Relying solely on the mean distance cannot distinguish between two distribution patterns with different implications in information density assessment: all tokens moving uniformly away from the center, and most tokens moving close to the center while a few tokens are extremely far away. Introducing the standard deviation captures the fluctuation range of distance values around their mean, thus more accurately characterizing the intrinsic structure of dispersion. The arithmetic mean distance... and distance standard deviation The calculation formula is:
[0045] ;
[0046] ;
[0047] in For the first The first sub-block The Euclidean distance from each token value vector to the mean vector of the sub-blocks; This is the arithmetic mean of all Euclidean distance values within the sub-block, with the summation covering all tokens within the sub-block. Therefore, the larger the difference in distance values from each token to the center within the sub-block, the larger the standard deviation. The larger the value, the more complex the hierarchical structure of the value vector distribution in space.
[0048] Subsequently, the first channel feature, namely the dispersion coefficient, is constructed based on the standard deviation and arithmetic mean distance. Different sub-blocks, due to the different positions and scales of their corresponding text content in the semantic space, will introduce scale bias by directly comparing the absolute values of their standard deviations. Using the ratio of the standard deviation to the mean can eliminate this scale difference across sub-blocks, making the dispersion coefficient a dimensionless relative metric. The dispersion coefficient of each sub-block The calculation formula is:
[0049] ;
[0050] in For the first The standard deviation of the Euclidean distance between the value vectors within each sub-block; This is the arithmetic mean of the Euclidean distances between the value vectors within the sub-block; To prevent division by zero of extremely small positive numbers, in this embodiment, Preferred setting is However, in practical applications, the required numerical precision varies. Available to Adjust within the specified range.
[0051] From this formula, we can see that the dispersion coefficient The numerical value directly reflects the degree of discretization of the value vector within the corresponding sub-block in the semantic space relative to its mean center. When Compared to When the ratio increases, An increasing value indicates that the tokens within the text area covered by the sub-block carry highly differentiated semantic information, and the distances of each value vector to the mean center show significant fluctuations, indicating a high level of information density in the region. Conversely, when... Compared to When the ratio decreases, A decrease in the value indicates that the tokens within the text area covered by the sub-block carry similar redundant semantic information, and the distances of each value vector to the mean center are close to each other, indicating that the information density level of the area is low.
[0052] For example, in one inference task of this embodiment, a mixed-content technical document containing 8192 tokens is processed. The value is extrapolated from a sparse background description sub-block to an information-dense parameter definition sub-block: by formula It can be seen that when the content corresponding to the sub-block changes from redundant formatted padding text to a dense list of key technical parameters, the standard deviation increases because the semantic information carried by each token in the latter is highly differentiated. Relative to average distance The ratio will be significantly improved, and the dispersion coefficient will be reduced. The corresponding increase will be made.
[0053] Specifically, if an irrecoverable bit-flip error occurs in the graphics processing unit's memory, causing an anomaly in the value vector data of a certain sub-block, the calculated dispersion coefficients may contain extreme outliers. To address this, the system is configured to perform an outlier detection round after calculating the dispersion coefficients of all sub-blocks. This involves clamping coefficient values exceeding three standard deviations of the mean dispersion coefficients of all sub-blocks to that boundary value to prevent abnormal data from interfering with subsequent weight calculations.
[0054] After the above processing, the discrete coefficient sequence of each sub-block is obtained, wherein the sequence of the discrete coefficients is... The element represents the first... The local information density level of each sub-block in the value vector space dimension is used as the input of the first channel.
[0055] S103. Perform multi-head aggregation and column-wise accumulation on the attention weight matrix along the query dimension, and calculate the concentration of attention distribution at the sub-block granularity to extract the second channel features.
[0056] First, the attention weight matrix is read from the last layer of the attention module of the large language model inference engine. The dimension of the attention weight matrix is the number of attention heads multiplied by the sequence length, and then multiplied by the sequence length again. Each two-dimensional slice corresponds to the normalized attention score of all query positions to all key positions under an attention head. In this embodiment, the number of attention heads is preferably set to 32, and the sequence length is the one determined in step S101. value.
[0057] It should be noted that although this embodiment uses the weight matrix of the last layer attention module as the data source, in other embodiments, the average value of the multi-layer attention weights or the weighted fusion value of the intermediate layer and the last layer can also be used to achieve the same feature extraction effect.
[0058] Furthermore, an arithmetic mean aggregation operation is performed on the attention weight matrix along the attention head dimension. This involves averaging the attention scores for each query position and key position across all attention heads, resulting in an average attention matrix with dimensions equal to the sequence length multiplied by the sequence length. Since different attention heads learn different attention patterns to the input sequence during training, the weight distribution of a single attention head may be biased. Averaging across all attention heads yields a more representative comprehensive attention pattern. The average attention matrix... The calculation formula is:
[0059] ;
[0060] in For the number of heads; For the first The attention weight matrix corresponds to each attention head, and the dimension of this matrix is... ,in Let be the sequence length. Therefore, as the number of attention heads involved in the averaging increases, the ability of the average attention matrix to suppress accidental biases of individual attention heads will correspondingly increase.
[0061] Then, a summation operation is performed on the average attention matrix column-wise. That is, for each key position, the average attention scores of all query positions for that key position are summed, thus obtaining the attention level of each token position. The essence of the attention mechanism lies in the fact that query positions extract information from key positions. The more query positions access a key position with a higher attention score, the more widely the information carried by that key position is relied upon during the reasoning process of the entire sequence. Column-wise summation accumulates this global degree of reliance into a scalar value. Attention to each token position The calculation formula is:
[0062] ;
[0063] in The first in the average attention matrix Line number The element values of the column represent the query position. Key position Average attention weight, Let be the sequence length. Therefore, when a token position is queried by more other positions in the sequence with higher attention weights during inference, the attention level of that token position increases. The corresponding increase reflects the higher importance of this token as a global information source.
[0064] Next, a quantitative analysis of the concentration of attention distribution is performed at the sub-block granularity. Specifically, for the first... Extract the contents of each sub-block. The attention given to each token position is calculated and arranged in ascending order. The Gini coefficient, a classic statistic for measuring the degree of distributional inequality, can effectively distinguish between two scenarios: all tokens receiving moderate attention evenly and a very small number of tokens receiving high attention while the rest receive almost no attention. Therefore, the [number missing] token position... Clustering of associations in sub-blocks The calculation formula is:
[0065] ;
[0066] in This is the index of the attention level within the sub-block, sorted in ascending order, with values ranging from 1 to... ; For the first Arranged in the sub-blocks at the th The level of attention given to the position; The number of tokens contained in the sub-block. The association concentration. The range of values for is a left-closed, right-open interval between zero and one. To prevent division by zero of extremely small positive numbers, in this embodiment, Preferred setting is with Different However, in practical applications, the required numerical precision varies. Available to Adjust within the specified range.
[0067] From this formula, we can see that the degree of correlation concentration The value of this value directly reflects the degree to which attention resources within the corresponding sub-block are concentrated in the hands of a few tokens. When When the value increases, it indicates that a small number of tokens within the sub-block attract the majority of attention allocation within that block. This means that these tokens play an irreplaceable role as information sources during global reasoning, and the sub-block contains high-value semantic anchors. Conversely, when... When the value approaches zero, it corresponds to the similar level of attention given to each token within the sub-block. Attention resources are evenly and diffusely distributed within the sub-block, meaning that there are no prominent semantic key nodes in the block, and its information contribution is dispersed and compressible.
[0068] For example, by formula It can be seen that when a sub-block transforms from a plain background description to a clause containing definitions of core terms, the attention given to a few defining tokens in the latter is much higher than that of other tokens within the same sub-block because these defining tokens are frequently accessed by subsequent queries that heavily reference those terms. Consequently, the contribution of higher-ranking tokens in the weighted cumulative items after sorting will increase dramatically, leading to a higher correlation concentration. The corresponding increase.
[0069] After the above processing, the correlation concentration sequence of each sub-block is obtained, and the sequence is as follows: The element represents the first... The degree of global semantic association concentration of each sub-block in the attention weight allocation dimension is used as the second channel input.
[0070] S104. Perform a dual-channel consistency ratio check on the dispersion coefficient of the first channel and the correlation concentration of the second channel, and merge them to generate adaptive retention weights for each sub-block.
[0071] First, read the discrete coefficient sequence of each sub-block output in step S102 and the correlation concentration sequence of each sub-block output in step S103.
[0072] Furthermore, the engineering necessity of introducing a dual-channel verification mechanism needs to be clarified. If only the dispersion coefficient is relied upon for cache quota allocation, when the text region corresponding to a sub-block happens to contain a large number of unrelated proper noun enumerations, the distribution of value vectors in the embedding space may exhibit high dispersion. However, these tokens are not key information sources in global inference, leading to the misallocation of reserved quotas to low-value regions. If only the association concentration is relied upon for quota allocation, when the attention mechanism has an inherent positional encoding bias for the sequence start position, sub-blocks located at the beginning of the sequence may exhibit high association concentration. However, this concentration does not stem from true semantic importance, also leading to misallocation of quotas. Based on this, a dual-channel consistency ratio is constructed to simultaneously verify the judgment results of both channels.
[0073] Then, the dual-channel consistency ratio is calculated for each sub-block. The ratio of the smaller to the larger of the two channel eigenvalues measures the degree of agreement between two independent metrics in assessing the importance of the same object. The ratio approaches one when the two channels give similar assessments, and approaches zero when they give significantly different assessments, thus forming a mechanism to suppress false alarms from single channels. Consistency ratio of individual sub-blocks The calculation formula is:
[0074] ;
[0075] in The first obtained in step S102 The dispersion coefficient of each sub-block; The first obtained in step S103 The degree of correlation concentration of each sub-block; This indicates taking the smaller of the two channel feature values; This indicates taking the larger of the two channel feature values; To prevent extremely small positive numbers from being divided by zero, their values are consistent with those in step S102. The consistency ratio... The range of values is greater than zero and does not exceed one.
[0076] From this formula, we can see that the consistency ratio The magnitude of the value directly reflects the effect of the two independent channels on the first... The degree of consistency in the assessment of the importance of each sub-block. When and When the values are similar, the ratio of the smaller value to the larger value tends to be close to one. As the value increases, the two independent evaluation dimensions provide mutually corroborating judgments for the sub-block. These judgments have high confidence, and the overall importance signal of the sub-block should subsequently be allowed to pass with high confidence. and When the values differ significantly, the ratio of the smaller value to the larger value approaches zero. A decrease in the value indicates a contradiction between the two evaluation dimensions. The higher value may be a false alarm signal, and the overall importance signal of this sub-block should be suppressed in the future.
[0077] Next, an adaptive retention weight is constructed based on the consistency ratio and the mean of the two-channel features. The arithmetic mean of the two-channel features represents the overall importance level of the sub-block, and the consistency ratio is used as a gating modulation coefficient to weight the overall importance with confidence. Simultaneously, a global normalization term is introduced to ensure that the sum of the retention weights of all sub-blocks remains constant, thus changing only the quota allocation ratio between sub-blocks without altering the total global retention. Preservation weight of each sub-block The calculation formula is:
[0078] ;
[0079] in For the first Consistency ratio of individual sub-blocks; For the first The dispersion coefficient of each sub-block; For the first The degree of correlation concentration of each sub-block; This is the arithmetic mean of the two-channel features; The total number of sub-blocks is represented by the summation term in the denominator, which is the sum of the overall importance of all sub-blocks after consistency ratio gating. The reserved weights are... by The average value of each sub-block is the benchmark. A value greater than one indicates that the sub-block should receive a retention amount exceeding the uniform quota, while a value less than one indicates that the sub-block should receive a retention amount below the uniform quota. This represents the smallest positive number that is protected from division by zero, the same as in step S103.
[0080] As can be seen from this formula, the weight is retained. The magnitude of the value is regulated by three factors. First, when the dispersion coefficient... As the value increases, the mean term of the two channels increases. It tends to increase. Second, when the concentration of association... As the value increases, the mean term of both channels also increases. The trend is towards increasing. Third, even if the dual-channel mean is high, if the consistency ratio... A lower gating coefficient will suppress the weight of the sub-block, making... It is impossible to obtain a quota commensurate with its apparent importance. The function of this mechanism is:
[0081] A sub-block will only be allocated a higher cache retention quota if it exhibits high feature values in both the value vector dispersion dimension and the attention association concentration dimension, and the evaluation results of the two dimensions are consistent.
[0082] For example, by formula It can be seen that, assuming the dispersion coefficient of a certain sub-block The value is 0.8 and the correlation concentration is When the value is 0.75, the consistency ratio is high because the characteristic values of the two channels are similar. This will approach 0.94, and multiplying it by the dual-channel mean of 0.775 will yield a higher gating importance value. The other sub-block's... The value is 0.9, but At only 0.1, although Higher numerical values, but higher consistency rate With an importance value of only about 0.11, the gating value will be significantly lower than the former, thus retaining the weight. This will also reduce the false alarm rate of a single channel, thereby effectively suppressing false alarms.
[0083] Combination Figure 2As shown, in the formatted text filling scenario, due to the low local information density and global semantic relevance, the retention weights assigned by both single-channel and dual-channel mechanisms are at a low level. In the key technical parameter list scenario, both are assigned higher retention weights. However, in the proper noun enumeration scenario and the sequence start position scenario, the single-channel mechanism gives a higher retention weight due to being misled by a single feature. The dual-channel consistency verification mechanism introduced in this invention significantly reduces the overall importance in these two abnormal scenarios by comparing the consistency of the two, effectively correcting the resource misallocation caused by a single indicator.
[0084] Specifically, if network transmission jitter or asynchronous computation scheduling causes a discrepancy in the output sequence lengths of steps S102 and S103, the system is configured to perform a length alignment check before calculating the consistency ratio. In response to detecting a mismatch in the lengths of the two sequences, the system will trigger exception handling logic, discard the dual-channel feature data of the current batch, fall back to a degraded mode that performs cache compression with uniform weights, and report the exception event to the system log module.
[0085] It should be noted that although this embodiment uses the arithmetic mean to fuse the dual-channel features, in other embodiments, the geometric mean or harmonic mean can also be used to achieve the same dual-channel fusion effect. Furthermore, although this embodiment uses the ratio of the minimum to the maximum value as a consistency measure, in other embodiments, the cosine similarity or Pearson correlation coefficient between the two channel feature values can also be used as a consistency measure.
[0086] After the above processing, an adaptive weighting sequence for each sub-block is obtained, wherein the weighting sequence of the sub-blocks is as follows: The element represents the first... The relative retention quota that each sub-block should obtain in subsequent cache compression execution serves as the core control parameter of the non-uniform compression strategy.
[0087] S105. Based on the adaptive retention weight, the global cache retention quota is unevenly distributed to each sub-block, and within the sub-block, token-level filtering and compression reorganization are performed according to the order of importance.
[0088] First, determine the target total number of tokens to be reserved. The target total number of tokens to be reserved... This is calculated by the system based on the available video memory capacity of the current graphics processing unit and the storage usage of a single token key-value pair. In this embodiment, The preferred setting is 2048, which means retaining key-value cache data for 2048 tokens out of 8192 original tokens, corresponding to a compression rate of 75%. However, in practical applications, depending on hardware resources and inference quality requirements, the specific compression ratio may vary. Available Adjustments can be made within the range of 10% to 50%.
[0089] Further, the retention weights of each sub-block output in step S104 are converted into the specific number of retention tokens for each sub-block. For the first... Each sub-block, the number of tokens it should retain The calculation method is as follows:
[0090] Using the base retention amount of each sub-block under the uniform strategy as a benchmark, multiplying it by the retention weight of that sub-block serves as an adjustment multiplier, so that high-weight sub-blocks receive a retention amount exceeding the benchmark while the retention amount of low-weight sub-blocks is correspondingly compressed. Number of reserved tokens per sub-block The calculation formula is:
[0091] ;
[0092] in The first obtained in step S104 The retention weight of each sub-block; Total number of tokens reserved for the target; This represents the total number of sub-blocks. This represents the base retention amount for each sub-block under a uniform strategy. This is a rounding operation. In this embodiment, the base retention is... Tokens.
[0093] From this formula, it can be seen that when retaining weights More than one time, If the retention amount is greater than the base retention of 32, the sub-block will receive a retention amount exceeding the uniform quota; when the retention weight... Less than one hour If the number of sub-blocks is less than the base retention limit of 32, the retention amount of that sub-block will be reduced to below the base quota. Through this differentiated quota allocation, sub-blocks corresponding to information-intensive regions will retain more key token key-value pairs, while sub-blocks corresponding to information-sparse regions will retain only a small number of representative token key-value pairs.
[0094] Then, because the rounding operation may cause a slight deviation between the sum of the number of tokens retained in all sub-blocks and the target total number of tokens retained, the system performs compensation processing on this deviation. Specifically, the sum of the retained tokens after rounding is calculated and... The difference is calculated as follows: if the difference is positive, one reservation token is deducted from the sub-block with the lowest reservation weight; if the difference is negative, one reservation token is added to the sub-block with the highest reservation weight, until the total number of reservation tokens is strictly equal to the value of the reservation token. .
[0095] Next, token-level filtering is performed within each sub-block. For the first... For each sub-block, read the attention level of each token within the sub-block that has been calculated in step S103. , within the sub-block The tokens are sorted from highest to lowest popularity, and the top-ranked tokens are retained. The key-value cache data corresponding to the token is discarded, and the key-value cache data of the remaining tokens is discarded.
[0096] Subsequently, the key-value pairs retained from each sub-block are rearranged and concatenated according to their index in the original sequence from smallest to largest, forming a sequence of length [length missing]. The compressed key-value cache sequence preserves the relative positional relationships between tokens in the original sequence and can directly replace the original key-value cache data as input to the attention calculation module of the large language model for subsequent autoregressive decoding inference.
[0097] In particular, if multiple tokens within a sub-block have the exact same level of attention during the token-level filtering process and cannot be strictly truncated according to their ranking, the system is configured to adopt a strategy of prioritizing the tokens with smaller position indices among tokens with the same level of attention, that is, prioritizing the tokens that are earlier in the original sequence.
[0098] It should be noted that although this embodiment uses attention level as the sorting criterion for token-level filtering within sub-blocks, other embodiments can also use a weighted comprehensive score of dispersion coefficient and attention level, or a sorting strategy based on the cosine distance between the token embedding vector and the sub-block mean vector, to achieve the same filtering effect within sub-blocks. Furthermore, although this embodiment uses a direct discarding method to handle unretained token key-value data, in other embodiments with sufficient video memory overflow space, the discarded key-value data can be temporarily stored in the host-side system memory or solid-state drive for later retrieval when backtracking is required during subsequent inference.
Claims
1. A data management method for large language models, characterized in that, The method includes: dividing the key-value cache sequence generated during the inference process of a large language model into multiple contiguous sub-blocks along the token position dimension; For each sub-block, the dispersion coefficient, which characterizes the local information density level, is calculated based on the statistical dispersion of the Euclidean distance between the value vector of each token in the sub-block and the mean vector of the sub-block. For each sub-block, the degree of uneven distribution of the cumulative global attention of each token in the attention weight matrix generated by the large language model inference process is used to calculate the degree of association concentration, which represents the degree of global semantic association concentration. The consistency ratio is calculated based on the degree of consistency between the discreteness coefficient and the correlation concentration, and the fusion value of the discreteness coefficient and the correlation concentration is gated and modulated in combination with the consistency ratio to generate the retention weight of each sub-block. Based on the retention weight, the available cache retention quota of the system is unevenly distributed to each sub-block, and within each sub-block, tokens are filtered according to their attention value to generate a compressed key-value cache sequence.
2. The data management method for large language models according to claim 1, characterized in that, The method for calculating the dispersion coefficient includes: Perform element-wise mean aggregation on the value vectors of all tokens within the sub-block to obtain the sub-block mean vector; Calculate the Euclidean distance between the value vector of each token in the sub-block and the mean vector of the sub-block; Calculate the arithmetic mean distance and standard deviation of all Euclidean distance values within a sub-block; The ratio of the standard deviation of the distance to the arithmetic mean distance is used as the dispersion coefficient of the sub-block.
3. The data management method for large language models according to claim 1, characterized in that, The method for calculating the correlation concentration includes: The average attention matrix is obtained by performing arithmetic mean aggregation on the attention weight matrix along the attention head dimension; The attention value for each token position is obtained by performing a summation operation on the average attention matrix along its column direction. Within each sub-block, the attention values are arranged in ascending order, and the correlation concentration of the sub-block is calculated based on the sorted attention values using the Gini coefficient formula.
4. The data management method for large language models according to claim 1, characterized in that, The consistency ratio is calculated as follows: The smaller of the dispersion coefficient and the correlation concentration is taken as the numerator, and the larger of the two is taken as the denominator. The ratio of the numerator to the denominator is taken as the consistency ratio of the sub-block.
5. The data management method for large language models according to claim 1, characterized in that, The methods for generating the retained weights include: The overall importance level of the sub-block is characterized by the arithmetic mean of the dispersion coefficient and the correlation concentration. The overall importance level is weighted with confidence using the consistency ratio as a gating coefficient. A global normalization term is introduced to ensure that the sum of the retention weights of all sub-blocks is always equal to the total number of sub-blocks, so as to guarantee that the retention weights only change the quota allocation ratio between sub-blocks.
6. The data management method for large language models according to claim 1, characterized in that, Based on the retention weight, the available cache retention quota of the system is non-uniformly distributed to each sub-block, including: The target total number of reserved tokens is determined based on the current available video memory capacity and the storage usage of a single token key-value pair; The base retention amount is obtained by dividing the total number of target reserved tokens by the total number of sub-blocks; Multiply the retention weight of each sub-block by the base retention amount and round down to get the number of retention tokens for that sub-block.
7. A data management method for large language models according to claim 6, characterized in that, Also includes: In response to a discrepancy between the sum of the rounded-down number of reserved tokens in all sub-blocks and the target total number of reserved tokens, a compensation process is performed on the discrepancy. Positive discrepancies are deducted sequentially from the sub-blocks with the lowest retention weight, and negative discrepancies are sequentially added to the sub-blocks with the highest retention weight, until the total number of reserved tokens is strictly equal to the target total number of reserved tokens.
8. A data management method for large language models according to claim 6, characterized in that, And within each sub-block, tokens are filtered according to their attention value, including: Sort all tokens within a sub-block according to their attention value from highest to lowest. Retain the key-value cache data corresponding to tokens whose ranking is within the number of retained tokens; The key-value pairs retained in each sub-block are rearranged and concatenated according to their position indices in the original sequence from smallest to largest to form the compressed key-value cache sequence.
9. A data management method for large language models according to claim 1, characterized in that, Also includes: After calculating the dispersion coefficients of all sub-blocks, outlier detection is performed, clamping dispersion coefficient values that exceed three standard deviations of the mean dispersion coefficients of all sub-blocks to the boundary values of that range.
10. A data management method for large language models according to claim 1, characterized in that, Also includes: In response to the detection that the sequence formed by the discrete coefficients and the sequence formed by the correlation concentration are of different lengths, a degradation processing logic is triggered, discarding the dual-channel feature data of the current batch and falling back to the degradation mode that performs cache compression on each sub-block with uniform weights.