Method and apparatus for optimizing cache during large language model inference

By evaluating the caching preferences of the attention layers of large models layer by layer, dynamically allocating cache regions and updating cache data, the memory limitation problem of large models in long text processing is solved, achieving more efficient use of cache resources and higher generation quality.

WO2026138159A1PCT designated stage Publication Date: 2026-07-02ALIPAY (HANGZHOU) DIGITAL SERVICE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ALIPAY (HANGZHOU) DIGITAL SERVICE TECHNOLOGY CO LTD
Filing Date
2025-10-30
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

When processing long texts, the increased computational load of the attention layer in existing large models leads to memory limitations in hardware devices. Existing caching optimization methods cannot effectively utilize cache resources, thus affecting the generation quality.

Method used

During the pre-filling stage, caching operations are performed layer by layer for multiple attention layers in the large model. By evaluating the changes of attention layers in the spatial and temporal dimensions, cache regions are dynamically allocated, and previous caches are updated in a cascading manner to optimize cache resource utilization.

Benefits of technology

Making better use of cache resources within limited memory resources has almost no impact on the generation quality of large models, adapting to various contexts and memory budgets, reducing peak memory usage, and avoiding information loss.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025131212_02072026_PF_FP_ABST
    Figure CN2025131212_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the embodiments of the present description is a method for optimizing a cache during large language model inference. The method comprises: in a pre-filling stage, performing cache operations layer by layer for a plurality of attention layers in a large language model. The cache operation for any ith layer comprises: acquiring a target attention matrix of the ith layer; on the basis of the distributions of row data and column data of the target attention matrix, determining a first indicator value and a second indicator value, respectively; on the basis of the first indicator value and the second indicator value, determining an ith preference score corresponding to the ith layer; on the basis of the ith preference score, determining from a total cache area a target cache area allocated to the ith layer, and storing in the target cache area attention data of target characters in input text; and on the basis of the ith preference score, updating prior cache areas of layers preceding the ith layer, and updating attention data of characters stored therein.
Need to check novelty before this filing date? Find Prior Art

Description

Caching Optimization Methods and Devices in Large Model Inference

[0001] This application claims priority to Chinese Patent Application No. 202411931981.1, filed on December 25, 2024, entitled "Cache Optimization Method and Apparatus in Large Model Inference", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This specification relates to one or more embodiments in the field of computer technology, and in particular to a cache optimization method and apparatus for large model inference. Background Technology

[0003] Large Language Models (LLMs), often simply called large models, are artificial intelligence models with a large number of parameters, specifically designed to process and generate natural language text. They are trained on massive amounts of text data, typically based on Transformers architectures, and have multiple self-attention layers, containing billions or even hundreds of billions of parameters. The more layers a model has, the better it can generally understand and generate complex text structures.

[0004] Currently, large models have significantly improved their ability to process long texts; for example, some large models can handle more than 128k characters (tokens). However, as text length increases, the computational load of the attention layer in large models increases dramatically. A common strategy is to store intermediate results of the keys and values ​​computed for the attention layer during the large model's inference process, also known as a key-value cache (KV cache). This significantly reduces computational complexity because the large model does not need to recalculate the keys and values ​​of all previous characters when generating new characters; instead, it can efficiently generate new characters by referencing the information in the cache. However, due to the memory limitations of hardware devices such as GPUs, optimizing the cache during large model inference is crucial. Summary of the Invention

[0005] This specification describes one or more embodiments of a caching optimization method and apparatus for large model inference, which can effectively perform key-value caching within limited memory resources, thereby ensuring the generation quality of large models.

[0006] Firstly, a caching optimization method for large model inference is provided, comprising: in the pre-filling stage, performing caching operations layer by layer for multiple attention layers in the large model, wherein the caching operation for any i-th layer includes:

[0007] Obtain the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer;

[0008] Based on the distribution of the row data of the target attention matrix, a first index value representing spatial dispersion is determined; and based on the distribution of the column data of the target attention matrix, a second index value representing temporal offset is determined.

[0009] Based on the first indicator value and the second indicator value, determine the i-th preference score corresponding to the i-th layer;

[0010] Based on the i-th preference score, determine the target cache region allocated for the i-th layer in the total cache region, and store the attention data of the target characters in the input text therein;

[0011] Based on the i-th preference score, update the prior cache regions of each layer before the i-th layer, and update the attention data of the characters stored therein.

[0012] Secondly, a cache optimization device is provided, comprising:

[0013] The acquisition unit is used to acquire the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer;

[0014] The determining unit is configured to determine a first index value representing spatial dispersion based on the distribution of the row data of the target attention matrix; and to determine a second index value representing temporal offset based on the distribution of the column data of the target attention matrix.

[0015] The determining unit is further configured to determine the i-th preference score corresponding to the i-th layer based on the first index value and the second index value.

[0016] A storage unit is used to determine the target cache region allocated for the i-th layer in the total cache region based on the i-th preference score, and to store attention data of the target character in the input text therein;

[0017] The update unit is used to update the prior cache regions of each layer before the i-th layer according to the i-th preference score, and to update the attention data of the characters stored therein.

[0018] Thirdly, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of the first aspect.

[0019] Fourthly, a computing device is provided, including a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method of the first aspect.

[0020] The cache optimization method and apparatus for large model inference provided in one or more embodiments of this specification perform caching operations layer by layer for multiple attention layers in the large model during the pre-filling stage. Specifically, for the current attention layer, the preference of the attention layer for caching is dynamically evaluated by considering the changes in the attention coefficient of the attention layer in the spatial and temporal dimensions, and a corresponding cache is allocated to the attention layer based on the preference. The caches of the previous attention layers are also updated in a cascading manner. As a result, the limited memory resources can be utilized better and more fully, while having almost no impact on the generation quality of the large model. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments in this specification, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 shows a comparison of the effects of the traditional key-value caching method and the proposed solution.

[0023] Figure 2 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification;

[0024] Figure 3 shows a flowchart of a cache optimization method in large model inference according to an embodiment of this specification;

[0025] Figure 4 shows a schematic diagram of the target attention matrix in one example of this specification;

[0026] Figure 5 shows a schematic diagram of a cache optimization method in one example of this specification;

[0027] Figure 6 shows a schematic diagram of a cache optimization apparatus according to one embodiment of this specification. Detailed Implementation

[0028] The solution provided in this specification will now be described with reference to the accompanying drawings.

[0029] As mentioned earlier, key-value caching is required to improve the efficiency of generating characters from large models.

[0030] Currently, key-value caching is mainly carried out in the following two ways:

[0031] First, for each attention layer in the large model, a cache region of the same size is allocated, and key-value pairs of characters are stored according to a predetermined size. However, this approach does not take into account the heterogeneity of the model and each attention layer within the model, thus failing to effectively utilize cache resources.

[0032] Second, for each attention layer of the large model, a corresponding cache region is allocated according to a fixed proportion based on pre-observed patterns. This method is only applicable to specific models and cannot be generalized to multiple models.

[0033] To address this, this solution proposes that during the pre-filling stage, caching operations be performed layer by layer for multiple attention layers in the large model. Specifically, for the current attention layer, the preference of the attention layer for caching is dynamically evaluated by considering the changes in the attention coefficient of the attention layer in the spatial and temporal dimensions. Based on this preference, a corresponding cache is allocated to the attention layer, and the caches of previous attention layers are updated in a cascading manner. This allows for better and more efficient use of limited memory resources, while having almost no impact on the generation quality of the large model.

[0034] Figure 1 shows a comparison of the effects of traditional key-value caching methods and the proposed solution. In Figure 1, the upper part shows two traditional key-value caching methods. The left side shows a key-value caching method based on a uniform allocation ratio, where for any input text, the cache allocation ratio of each of the four attention layers of the large model is 0.25. The right side shows a key-value caching method based on a fixed allocation pattern, where for any input text, the cache allocation ratio of the four attention layers of the large model is fixed at 0.50, 0.34, 0.18, and 0.02.

[0035] In this scheme, for input text A or input text B, two metric values ​​are first determined for each of the four attention layers of the large model. One metric value represents the spatial dispersion of the attention coefficients of that attention layer, and the other represents the temporal offset of the attention coefficients. Next, based on the two metric values ​​for each attention layer, a corresponding layer preference is determined, and a corresponding cache is allocated to each attention layer of the large model based on this preference. Specifically, for input text A, the cache allocation ratios for the four attention layers of the large model can be 0.25, 0.19, 0.27, and 0.29. For input text B, the cache allocation ratios for the four attention layers of the large model can be 0.31, 0.11, 0.35, and 0.23.

[0036] It should be understood that Figure 1 is merely an illustrative example. In practice, large models may include more than four attention layers, such as ten or even more, and this specification does not limit this. Furthermore, the cache allocation ratio for each attention layer is not limited to the values ​​mentioned above.

[0037] As can be seen, in this scheme, the cache allocation ratio between different layers can be adjusted according to layer preferences to adapt to various contexts and given memory budgets. This allows for better and more efficient use of limited memory resources, while having almost no impact on the generation quality.

[0038] Figure 2 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification. In Figure 2, the inference process of the large model includes a pre-filling stage and a decoding stage.

[0039] In the pre-filling stage, caching is performed layer by layer for multiple attention layers in the large model. Specifically, for any i-th layer, firstly, based on the distribution of attention coefficients in the target attention matrix corresponding to the i-th layer, two indicators representing spatial dispersion and temporal offset are determined. Next, based on these two indicators, a preference score is determined for the i-th layer. Finally, based on this preference score, a corresponding cache region is determined to store the attention data of the target characters in the input text, and the prior cache regions of all layers before the i-th layer are updated based on the preference score, along with the attention data of the characters stored therein.

[0040] During the decoding phase, based on the attention data of the characters stored in the corresponding buffer areas of multiple attention layers, the generated characters for the input text are output.

[0041] It should be understood that in practice, the above decoding stage will be executed iteratively multiple times to output multiple generated characters.

[0042] As mentioned earlier, in this scheme, during the pre-filling stage, caching operations are performed layer by layer for multiple attention layers in the large model. Since the caching operations for each layer are similar, the following explanation uses any i-th layer as an example to illustrate the corresponding caching operations.

[0043] Figure 3 shows a flowchart of a caching optimization method in large model inference according to an embodiment of this specification. This method can be executed by any device, apparatus, platform, or cluster of devices with computing and processing capabilities. As shown in Figure 3, the method may include the following steps:

[0044] Step S302: Obtain the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer.

[0045] Generally speaking, for each attention layer of a large model, the attention matrix corresponding to that attention layer can be calculated based on the character representations output by the previous attention layer for the characters in the input text, as well as the parameters of that attention layer (i.e., the key matrix, vector matrix, and value matrix). This matrix contains the attention coefficients between characters in the input text.

[0046] Taking an input text of length S (i.e. containing S characters) as an example, the size of the attention matrix above is S×S, which can be represented as follows:

[0047] In Formula 1, D is the dimension of the character representation, Q and K are the query vector and key vector, respectively, where Q = XW Q K = XW K Here, X represents the character representation output by the previous attention layer for the characters in the input text, and W... Q and W K These are the query matrix and the key matrix, respectively.

[0048] It should be understood that in the above attention matrix of size S×S (hereinafter also called the original attention matrix), each row from left to right (or each column from top to bottom) corresponds to each character from front to back in the input text.

[0049] In this scheme, the target attention matrix of the i-th layer mentioned above can refer to the submatrix of the original attention matrix corresponding to the i-th layer within a preset observation window, where the size of the observation window can be S. w ×S, or S w ×(SS w ).

[0050] Where the size of the observation window is S w In the case of ×S, the target attention matrix can be represented as: A i [-S w :,:], which can be shown as on the left side of Figure 4 (filled with diagonal lines), that is, it contains the post-S from the original attention matrix. w Rows and all S columns. The size of the observation window is S. w ×(SS w In the case of A), the target attention matrix can be represented as: A i [-S w :,:-S w It can be seen as shown on the right side of Figure 4 (filled with diagonal lines), that is, it contains the post-S from the original attention matrix. w Line and 1 to SS w List.

[0051] In summary, the target attention matrix described above will include the post-S w Okay, it should be understood that the S after this w The lines correspond to the S-shaped lines of the input text, starting from the end and moving forward. w The last S character (also known as the most recent character) in the original attention matrix; similarly, the last S character in the original attention matrix. w The columns also correspond to the most recent characters. That is, in the target attention matrix: A i [-S w :,:-S w In the case of [], the target attention matrix does not include columns corresponding to the most recent characters.

[0052] Of course, in practice, the original attention matrix corresponding to the i-th layer can be directly used as the target attention matrix corresponding to the i-th layer, and this specification does not limit this.

[0053] Furthermore, in practice, since a multi-head attention mechanism is usually used, the same observation window can be used to select the corresponding target attention matrix from the original attention matrix corresponding to each attention head. Then, the selected target attention matrices are combined (e.g., averaged) to obtain the final target attention matrix. Then, the following steps are performed based on the final target attention matrix.

[0054] Step S304: Determine the index value E representing spatial dispersion based on the distribution of row data in the target attention matrix; and determine the index value V representing temporal offset based on the distribution of column data in the target attention matrix.

[0055] Specifically, for each target row in the target attention matrix, the corresponding information entropy is calculated, and then the information entropies are summed to obtain the aforementioned index value E. For each target column in the target attention matrix, the corresponding variance is calculated, and then the variances are summed to obtain the index value V.

[0056] It should be understood that the above is merely an example of how to calculate the indicator values ​​E and V. In practice, other calculations can be performed on each target row or column in the target attention matrix to obtain the two indicator values. For example, when calculating the indicator value V, the calculation of variance can be replaced by the calculation of covariance or standard deviation, etc.

[0057] Step S306: Determine the i-th preference score corresponding to the i-th layer based on the index value E and the index value V.

[0058] In one example, the product of index value E raised to the power of T1 and index value V raised to the power of T2 can be determined as the i-th preference score, where T1 and T2 are temperature parameters, which are used to control the influence of index value E and index value V on the i-th preference score, respectively.

[0059] For example, the i-th preference score can be calculated using the following formula:

[0060] Among them, P i Let A be the preference score for the i-th preference. i [-S w :,:-S w ] represents the target attention matrix, the E() function represents calculating the information entropy of each target row and then summing them, the V() function represents calculating the variance of each column and then summing them, and T1 and T2 are temperature parameters.

[0061] It should be noted that the importance of attention dispersion and shift may vary under different models and cache constraints. This solution can dynamically adjust the importance of attention dispersion and shift by using the temperature parameters mentioned above, thus enabling it to flexibly adapt to various scenarios.

[0062] Of course, in practice, the product of index value E and index value V can be directly used to determine the i-th preference score corresponding to the i-th layer. Alternatively, other operations besides exponentiation can be performed on index value E and index value V to determine the i-th preference score. As long as the influence of the two index values ​​on the i-th preference score can be flexibly adjusted, this specification does not limit this.

[0063] It should be understood that when calculating the attention score of the i-th layer based on the above-mentioned index values ​​E and V, the caching requirements of each layer are actually dynamically evaluated by considering attention in both spatial and temporal dimensions.

[0064] It should also be understood that since the above-mentioned index values ​​E and V are calculated based on the target attention matrix, and this target attention matrix changes with different input texts, this scheme will calculate different preference scores for different input texts, thereby ensuring adaptability.

[0065] Step S308: Based on the i-th preference score, determine the target cache region allocated to the i-th layer in the total cache region, and store the attention data of the target characters in the input text therein.

[0066] Specifically, the preference scores for each layer preceding the i-th layer can be obtained, and the obtained preference scores are summed with the i-th preference score to obtain a summation result. That is, the i preference scores corresponding to the first i layers are summed. Then, the ratio of the i-th preference score to the summation result can be determined as the cache allocation ratio for the i-th layer, and the product of this cache allocation ratio and the preset total number of characters can be determined as the target number of characters for the i-th layer. Finally, based on the target number of characters for the i-th layer, the target cache area allocated to the i-th layer in the total cache area is determined.

[0067] In one example, the target number of characters can be determined using the following formula:

[0068] Among them, B i P represents the number of target characters corresponding to the i-th layer. i Let P be the preference score corresponding to the i-th layer. k The k-th layer preceding the i-th layer. B is the cache allocation ratio corresponding to the i-th level. total This is the preset total number of characters.

[0069] More specifically, the cache size for a single character (i.e., the cache size for a single character's key-value pairs) can be pre-defined, and then the target number B is calculated. i Multiplying by the cache occupancy of a single character yields the target occupancy for the i-th layer. Then, a region matching this target occupancy can be allocated within the total cache area as the target cache region.

[0070] It should be noted that since the target number of characters is calculated based on the i-th preference score, it can also be considered that the buffer area is allocated to the i-th layer based on the i-th preference score. Since the i-th preference score will be adjusted accordingly with different input texts, this scheme actually implements a preference-first adaptive allocation strategy.

[0071] Of course, in practice, the above-mentioned preset total number of characters can also be replaced by the preset total cache usage. Thus, after determining the cache allocation ratio corresponding to the i-th layer, the determined cache allocation ratio can be directly multiplied by the preset total cache usage to obtain the target usage corresponding to the i-th layer.

[0072] Finally, since the target cache region corresponding to layer i only stores the attention data of the target character, a removal operation is actually performed on layer i as well (that is, the attention data of non-target characters in the input text is removed).

[0073] The following describes the process of determining the target character stored in the target cache area.

[0074] This scheme determines the target character based on the aforementioned target attention matrix, and as mentioned earlier, the target attention matrix can be A. i [-S w :,:] can also be A i [-S w :,:-S w Therefore, the following will explain the method for determining the target character in two different cases.

[0075] First, the target attention matrix is: A i [-S w In the case of :,:], that is, when the target column corresponding to the most recent character in the input text is included, the most recent character is determined by any large number of target values. For the remaining characters in the input text excluding the most recent character, the mean and variance are calculated for each target column corresponding to it in the target attention matrix. These mean and variance are then weighted and combined to obtain the importance score for each remaining character. In this way, the importance score of each character in the time dimension can be obtained.

[0076] After obtaining the importance score of each character in the input text in the time dimension, select B target characters from each character. i The characters are used as target characters. For example, the characters can be sorted from highest to lowest importance score, and then the number B of target characters at the top of the sorted list can be selected. i The character is used as the target character.

[0077] It should be understood that since the importance score of the most recent character is the target large number, it will definitely be selected as the target character. Therefore, the most recent character mentioned above can also be called the reserved character.

[0078] In one example, the importance score for each character can be calculated using the following formula:

[0079] Among them, I i [n] represents the importance score of the nth character, A i [-S w [:,n] represents the nth column of the target attention matrix, Mean() is the mean function, which measures the sustained importance of attention, Var() is the variance function, which measures the variability of attention, and Ω is the target large number.

[0080] As can be seen, this scheme will consider the importance of characters based on multiple factors.

[0081] It should be understood that in practice, Formula 4 can be modified in various ways. For example, γ can be removed, and the mean and variance can be summed directly. Alternatively, Var() can be replaced with a function that calculates the standard deviation or covariance, and so on.

[0082] Secondly, the target attention matrix is: A i [-S w :,:-S w In the case where the target column corresponding to the most recent character in the input text is not included, the most recent character is used as a reserved character, and the characters corresponding to each target column in the target attention matrix are used as candidate characters. Then, the target candidate character is selected from each candidate character, such that the target candidate character plus the reserved character is exactly equal to the number of target characters.

[0083] The process of selecting target candidate characters can be as follows: For each candidate character, calculate the mean and variance for its corresponding target column in the target attention matrix, and then weight and combine these mean and variance to obtain the importance score corresponding to the candidate character. Afterwards, sort the candidate characters from highest to lowest importance score, and select the top-ranked target candidate characters. The target characters are then formed based on these target candidate characters and reserved characters. Here, the target number is determined by the aforementioned target character number B. i The difference between this number and the number of reserved characters is obtained.

[0084] Considering that the importance scores of each target character will be needed in the subsequent update of the target cache region corresponding to the i-th layer, the importance scores of the reserved characters not included in the target attention matrix can be assigned any large target number to ensure that the attention data corresponding to them will not be removed.

[0085] Finally, the attention data for the target character described in this scheme refers to the key vector and value vector determined using the character representation of the target character at layer (i-1) and the corresponding key matrix and value matrix at layer i.

[0086] Specifically, the key vector can be represented as: K = XW K The value vector can be represented as: V = XW v Here, X is the character representation output by the (i-1)th layer for the target character, and W... K and W v These are the key matrix and the value matrix (i.e., the attention parameters of the i-th layer), respectively.

[0087] In practice, the attention data mentioned above is also referred to as key-value pairs or KV pairs.

[0088] Step S310: Based on the i-th preference score, update the prior cache regions of each layer before the i-th layer, and update the attention data of the characters stored therein.

[0089] As mentioned earlier, this scheme performs caching operations layer by layer. Therefore, when reaching the i-th layer, all previous layers have already undergone caching operations, meaning that corresponding cache regions have been allocated for each of the previous layers. However, since the caching requirements of the i-th layer were not considered when allocating cache regions for the previous layers, it is necessary to update the existing cache regions of the previous layers.

[0090] It should be noted that in this scheme, the update methods for the prior cache regions of each layer before the i-th layer are similar. Therefore, the following uses any target layer in each layer as an example to illustrate the corresponding update method.

[0091] Specifically, for the target layer, the sum of the i-th preference score corresponding to the i-th layer and the sum of the preference scores corresponding to all layers preceding the i-th layer is calculated; that is, the sum of the i preference scores corresponding to the preceding i layers is calculated. Based on the ratio of the target attention score corresponding to the target layer to the summation result, and the preset total number of characters, the number of characters to be updated for the target layer is determined. Based on the number of characters to be updated, the target prior cache region corresponding to this target layer is updated.

[0092] It should be understood that the target attention score mentioned above is calculated when caching operations are performed on the target layer.

[0093] Secondly, the ratio of the target attention score to the summation result for the target layer can be understood as the update allocation ratio for the target layer, that is, the cache allocation ratio redefined for the target layer after considering the cache requirements of the i-th layer.

[0094] Finally, the number of updated characters mentioned above can be calculated based on Formula 3. It is only necessary to change P in Formula 3. i Simply replace it with the target attention score.

[0095] It should be noted that after calculating the number of updated characters, this number can be multiplied by the cache usage corresponding to a single character to obtain the update usage for the target layer. Then, the existing cache area of ​​the target layer can be updated to a cache area matching this update usage.

[0096] It should be understood that, for the target layer, the updated cache area will be smaller than the initially allocated cache area because the denominator in Formula 3, which calculates the number of updated characters, becomes larger (the i-th preference score has been added). Therefore, after performing the update cache area operation, some of the data stored within it needs to be removed. The following describes this removal process.

[0097] Specifically, the characters stored in the prior cache area corresponding to the target layer can be sorted from high to low according to their corresponding importance scores, and the remaining characters except for the characters whose number of updated characters is at the top of the sort can be identified as characters to be removed, and the attention data of the removed characters can be removed.

[0098] Assuming the target layer is represented as layer l, the corresponding removal operation can be represented as: EVICT({K l V l},B l ,I l ), where K l V is a matrix consisting of the key vectors of each character stored in the prior cache region of layer l. l B is a matrix consisting of the value vectors of each character stored in the prior cache region of layer l. lTo update the number of characters, I l This is a vector consisting of the importance scores of each character stored in the prior cache region of layer l. The removal operation means: retain the character corresponding to I... l Middle front B l The key-value pairs of the characters with the highest importance scores, the retained key-value pairs can be represented as:

[0099] in, D l =TopK(B l ,I l ).

[0100] Among them, D l Indicates corresponding to I l Middle front B l Index of the key-value pairs of the characters with the highest importance scores.

[0101] Similar to the update method for the target layer described above, the prior cache areas of other layers preceding the i-th layer can be updated, as well as the attention data of the characters stored therein. These details will not be elaborated upon here.

[0102] In practice, update operations for each layer preceding the i-th layer can be executed in parallel, thereby reducing the time complexity of update operations for each layer to the level of a single-layer update operation, thus improving the efficiency of cache operations.

[0103] It should be noted that this approach, when removing key-value pairs based on character importance scores, actually considers the changing importance of characters over time, overcoming the limitation of existing methods that often neglect temporal dynamics. Furthermore, since the character importance score is determined by comprehensively considering multiple factors such as the sustained importance and changes in attention, this approach is a robust removal strategy that can tolerate attention shifts.

[0104] At this point, the caching operation for the i-th layer is complete. Next, the caching operation can be performed for the (i+1)-th layer, until the last layer is reached.

[0105] Specifically, when performing caching operations on the (i+1)th layer, the prior cached area of ​​the previous i layers will be updated (for example, for the i-th layer, the corresponding target cached area will be updated; for the target layer, the cached area corresponding to the target layer will be updated again), and the attention data of the characters stored therein will be updated. The specific update method can be referred to the target layer above, which will not be repeated here.

[0106] In summary, this scheme is a dynamic cache management method. Only after performing caching operations on the last layer of all attention layers can the cache allocation ratio of each attention layer be finally determined, as well as the cache area of ​​each attention layer and the attention data of the characters stored therein.

[0107] It should be noted that the dynamic cache management method provided by this solution can maintain effective attention data with limited memory resources. In other words, this solution can ensure the generation quality of large models with limited memory resources.

[0108] The following examples illustrate the caching optimization method for large model inference provided in the embodiments of this specification.

[0109] Figure 5 illustrates a caching optimization method in one example of this specification. In Figure 5, the large model contains four attention layers, and caching operations are performed sequentially on these four attention layers:

[0110] First, we enter the first layer (not shown in the diagram). Since only the preference score P0 of the first layer exists, the corresponding cache allocation ratio is 1, thus allocating the total cache area to the first layer, specifically as follows: It stores key-value pairs of all characters in the input text.

[0111] Then, proceeding to the second layer, first calculate the corresponding preference score P1, and based on the preference scores P1 and P0, determine the cache allocation ratio for the second layer: 0.26, and the matching cache region. Next, a removal operation is performed on the key-value pairs in the second layer, and the remaining key-value pairs are stored in the cache area: Finally, based on preference scores P1 and P0, the cache allocation ratio for level 1 is determined to be 0.74, along with the corresponding cache region. And for the cache area of ​​level 1: After performing a removal operation on the key-value pairs in the cache, the first-level cache region after the removal operation is represented as follows:

[0112] Next, proceeding to the third layer, first calculate the corresponding preference score P2, and based on preference scores P2, P1, and P0, determine the cache allocation ratio for the third layer: 0.46, and the matching cache region. Next, a removal operation is performed on the key-value pairs at level 3, and the remaining key-value pairs are stored in the cache area: Finally, based on preference scores P2, P1, and P0, the cache allocation ratios for Layer 2 and Layer 1 are determined as 0.14 and 0.40, respectively, along with the corresponding cache regions. and And for the cache areas of layer 2 and layer 1 respectively: and After performing a removal operation on the key-value pairs in the cache, the cache regions of level 2 and level 1 after the removal operation are represented as follows: and

[0113] Finally, in the fourth layer, the corresponding preference score P3 is calculated first. Based on the preference scores P3, P2, P1, and P0, the cache allocation ratio for the fourth layer is determined to be 0.23, along with the corresponding cache region. Next, a removal operation is performed on the key-value pairs at level 4, and the remaining key-value pairs are stored in the cache area: Finally, based on the preference scores P3, P2, P1, and P0, the cache allocation ratios for layers 3, 2, and 1 are determined as 0.35, 0.11, and 0.31, respectively, along with the corresponding cache regions. and And for the cache regions of layer 3, layer 2, and layer 1 respectively: and After performing a removal operation on the key-value pairs in the cache, the cache regions of level 3, level 2, and level 1 after the removal operation are represented as follows: and

[0114] At this point, caching operations have been performed on all four attention layers, thus fixing the cache allocation ratios for layers 1-4 as {0.31, 0.11, 0.35, 0.23}, and their respective cache areas as follows:

[0115] As can be seen from Figure 5, at each layer, the KV cache budget is reallocated to the previous layers based on the preference score obtained at the current layer, so as to always maintain the given cache budget size.

[0116] It should be understood that Figure 5 is merely an illustrative example. In practice, large models may include more than four attention layers, such as ten or even more, and this specification does not limit this. Furthermore, the cache allocation ratio for each attention layer is not limited to the values ​​mentioned above.

[0117] In summary, the caching optimization method for large model inference provided in the embodiments of this specification has the following advantages:

[0118] First, this scheme allows for a global view of cache size allocation, thereby optimally allocating memory resources based on the different attention mechanisms of each layer. In other words, when allocating memory resources, this scheme can utilize a global attention model to consider the unique characteristics of each layer.

[0119] Secondly, this solution analyzes the specific key-value caching preferences of the analysis layer during the pre-filling stage and uses these preferences to manage the cache budget in a cascading manner, which can effectively reduce peak memory usage to the target level.

[0120] Furthermore, this solution proposes a robust removal strategy that can tolerate attention shifts. When attention shifts occur in long contexts, this solution can avoid the information loss problem caused by overly aggressive removal strategies.

[0121] Finally, our proposed solution consistently outperforms current benchmark methods across various models and memory constraints, particularly in low-memory scenarios.

[0122] Corresponding to the cache optimization method in the large model inference described above, one embodiment of this specification also provides a cache optimization apparatus, as shown in FIG6, which may include:

[0123] The acquisition unit 602 is used to acquire the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer.

[0124] The determining unit 604 is used to determine a first index value representing spatial dispersion based on the distribution of row data of the target attention matrix; and to determine a second index value representing temporal offset based on the distribution of column data of the target attention matrix.

[0125] The determining unit 604 is also used to determine the i-th preference score corresponding to the i-th layer based on the first index value and the second index value.

[0126] Storage unit 606 is used to determine the target cache region allocated for the i-th layer in the total cache region based on the i-th preference score, and to store the attention data of the target characters in the input text therein.

[0127] Update unit 608 is used to update the prior cache area of ​​each layer before the i-th layer according to the i-th preference score, and update the attention data of the characters stored therein.

[0128] In one embodiment, the determining unit 604 is specifically used for:

[0129] For each target row in the target attention matrix, calculate the corresponding information entropy, and then sum the information entropies to obtain the first index value;

[0130] For each target column in the target attention matrix, calculate the corresponding variance, and then sum the variances to obtain the second index value.

[0131] In one embodiment, the determining unit 604 is further specifically used for:

[0132] The product of the first index value raised to the power of T1 and the second index value raised to the power of T2 is determined as the i-th preference score, where T1 and T2 are used to control the influence of the first index value and the second index value on the i-th preference score, respectively.

[0133] In one embodiment, storage unit 606 includes:

[0134] The acquisition submodule 6062 is used to acquire the preference scores corresponding to each layer before the i-th layer, and sum the preference scores with the i-th preference score to obtain the summation result;

[0135] The calculation submodule 6064 is used to determine the number of target characters corresponding to the i-th layer based on the ratio of the i-th preference score to the summation result and the preset total number of characters;

[0136] The determination submodule 6066 is used to determine the target cache area based on the number of target characters.

[0137] In one embodiment, each target column in the target attention matrix corresponds to a character in the input text, and the apparatus further includes:

[0138] The calculation unit 610 is used to calculate the importance score of each character in the time dimension;

[0139] Selection unit 612 is used to select a number of target characters from each character based on importance score.

[0140] In one embodiment, the calculation unit 610 is specifically used for:

[0141] For reserved characters in the input text, any target large number is determined as the importance score of the reserved character, which includes a number of characters in the input text starting from the end and moving forward;

[0142] For the remaining characters excluding the reserved characters, calculate the mean and variance for each target column in the target attention matrix, and then weight and combine the mean and variance to obtain the importance score for the remaining character.

[0143] In one embodiment, each target column in the target attention matrix corresponds to a candidate character, and each candidate character is the remaining characters in the input text excluding reserved characters. The reserved characters include a number of characters in the input text starting from the end and moving forward. The above apparatus further includes:

[0144] The computing unit 610 is used to calculate the importance score of each candidate character in the time dimension based on each target column of the target attention matrix.

[0145] The selection unit 612 is used to select a target number of target candidate characters from each candidate character based on the importance score. The target number is obtained by subtracting the number of target characters from the number of reserved characters.

[0146] Forming unit 614 is used to form a target character based on the target candidate character and the reserved character.

[0147] In one embodiment, the attention data for the target character includes the key vector and value vector determined using the character representation for the target character at layer (i-1) and the corresponding key matrix and value matrix at layer i.

[0148] In one embodiment, the update unit 608 is specifically used for:

[0149] For any target layer, calculate the sum of the i-th preference score and the corresponding preference scores for each layer;

[0150] The number of characters to be updated for the target layer is determined based on the ratio of the target attention score to the summation result and the preset total number of characters.

[0151] Update the target prior cache area corresponding to the target layer based on the number of characters updated.

[0152] In one embodiment, the update unit 608 includes:

[0153] The sorting submodule 6082 is used to sort the characters stored in the target cache area from high to low according to the corresponding importance score, and to determine the remaining characters except for the characters whose number of updated characters are at the top of the sort as the characters to be removed.

[0154] Remove submodule 6084, which is used to remove attention data for characters from the target-pre-cached region.

[0155] The functions of each functional module of the apparatus in the above embodiments of this specification can be implemented through the steps of the above method embodiments. Therefore, the specific working process of the apparatus provided in one embodiment of this specification will not be repeated here.

[0156] The cache optimization apparatus provided in one embodiment of this specification can effectively perform key-value caching within limited memory resources, thereby ensuring the generation quality of large models.

[0157] According to another embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method described in conjunction with FIG3.

[0158] According to another embodiment, a computing device is also provided, including a memory and a processor, wherein executable code is stored in the memory, and the processor executes the executable code to implement the method described in conjunction with FIG3.

[0159] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the medium or device embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0160] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0161] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of this specification. It should be understood that the above description is only a specific embodiment of this specification and is not intended to limit the scope of protection of this specification. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of this specification should be included within the scope of protection of this specification.

Claims

1. A method for cache optimization in large model inference, comprising: During the pre-filling stage, caching operations are performed layer by layer for multiple attention layers in the large model. The caching operation for any i-th layer includes: Obtain the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer; Based on the distribution of the row data of the target attention matrix, a first index value representing spatial dispersion is determined; and based on the distribution of the column data of the target attention matrix, a second index value representing temporal offset is determined. Based on the first indicator value and the second indicator value, determine the i-th preference score corresponding to the i-th layer; Based on the i-th preference score, determine the target cache region allocated for the i-th layer in the total cache region, and store the attention data of the target characters in the input text therein; Based on the i-th preference score, update the prior cache regions of each layer before the i-th layer, and update the attention data of the characters stored therein.

2. The method according to claim 1, wherein, The determination of the first index value characterizing spatial dispersion includes: For each target row in the target attention matrix, calculate the corresponding information entropy, and then sum the information entropies to obtain the first index value; For each target column in the target attention matrix, calculate the corresponding variance, and then sum the variances to obtain the second index value.

3. The method of claim 1, wherein, Determining the i-th preference score corresponding to the i-th layer includes: The product of the first index value raised to the power of T1 and the second index value raised to the power of T2 is determined as the i-th preference score, wherein T1 and T2 are used to control the influence of the first index value and the second index value on the i-th preference score, respectively.

4. The method according to claim 1, wherein, The determination of the target cache region allocated for the i-th layer in the total cache area includes: Obtain the preference scores corresponding to each layer before the i-th layer, and sum the preference scores with the i-th preference score to obtain the summation result; The number of target characters corresponding to the i-th layer is determined based on the ratio of the i-th preference score to the summation result and the preset total number of characters. The target cache region is determined based on the number of target characters.

5. The method according to claim 4, wherein, Each target column in the target attention matrix corresponds to a character in the input text, and the method further includes: Calculate the importance score of each character in the time dimension; Based on the importance score, a number of characters equal to the target character number are selected from the characters as the target character.

6. The method of claim 5, wherein, The calculation of the importance score of each character in the time dimension includes: For the reserved characters in the input text, any target large number is determined as the importance score of the reserved character; the reserved characters include several characters in the input text starting from the end and moving forward; For the remaining characters other than the reserved characters, the mean and variance are calculated for each target column in the target attention matrix, and the mean and variance are weighted and combined to obtain the importance score corresponding to the remaining character.

7. The method of claim 4, wherein, Each target column in the target attention matrix corresponds to a candidate character; each candidate character is the remaining character in the input text excluding reserved characters; the reserved characters include several characters in the input text starting from the end and moving forward; the method further includes: Based on each target column of the target attention matrix, calculate the importance score of each candidate character in the time dimension; Based on the importance score, a target number of target candidate characters are selected from each candidate character; the target number is obtained by subtracting the target number of characters from the number of reserved characters. The target character is formed based on the target candidate character and the reserved character.

8. The method of claim 1, wherein, The attention data for the target character includes the character representation determined by the (i-1)th layer for the target character and the key vector and value vector determined by the key matrix and value matrix corresponding to the i-th layer.

9. The method according to claim 1, wherein, The update of the prior cache regions of each layer before layer i includes: For any target layer, calculate the sum of the i-th preference score and the corresponding preference scores for each layer; The number of characters to be updated for the target layer is determined based on the ratio of the target attention score to the summation result and the preset total number of characters. Based on the number of updated characters, update the target prior cache area corresponding to the target layer.

10. The method according to claim 9, wherein, The update of the attention data of the characters stored therein includes: According to the corresponding importance scores from high to low, the characters stored in the target in the prior cache area are sorted, and the remaining characters except for the characters with the number of updated characters at the top of the sort are determined as the characters to be removed. Remove the attention data of the removed character from the target's prior cache area.

11. A cache optimization apparatus, comprising: The acquisition unit is used to acquire the target attention matrix of the i-th layer, which contains the attention coefficients between characters in the input text calculated based on the parameters of the i-th layer; The determining unit is configured to determine a first index value representing spatial dispersion based on the distribution of the row data of the target attention matrix; and to determine a second index value representing temporal offset based on the distribution of the column data of the target attention matrix. The determining unit is further configured to determine the i-th preference score corresponding to the i-th layer based on the first index value and the second index value. A storage unit is used to determine the target cache region allocated for the i-th layer in the total cache region based on the i-th preference score, and to store attention data of the target character in the input text therein; The update unit is used to update the prior cache regions of each layer before the i-th layer according to the i-th preference score, and to update the attention data of the characters stored therein.

12. The apparatus according to claim 11, wherein, The determining unit is specifically used for: For each target row in the target attention matrix, calculate the corresponding information entropy, and then sum the information entropies to obtain the first index value; For each target column in the target attention matrix, calculate the corresponding variance, and then sum the variances to obtain the second index value.

13. The apparatus according to claim 11, wherein, The determining unit is also specifically used for: The product of the first index value raised to the power of T1 and the second index value raised to the power of T2 is determined as the i-th preference score, wherein T1 and T2 are used to control the influence of the first index value and the second index value on the i-th preference score, respectively.

14. The apparatus of claim 11, wherein, The storage unit includes: The acquisition submodule is used to acquire each preference score corresponding to each layer before the i-th layer, and sum each preference score with the i-th preference score to obtain the summation result; The calculation submodule is used to determine the number of target characters corresponding to the i-th layer based on the ratio of the i-th preference score to the summation result and the preset total number of characters; The determination submodule is used to determine the target cache area based on the number of target characters.

15. The apparatus according to claim 14, wherein, Each target column in the target attention matrix corresponds to a character in the input text, and the device further includes: A calculation unit is used to calculate the importance score of each character in the time dimension; The selection unit is used to select a number of characters from the characters that represent the target character based on the importance score.

16. The apparatus according to claim 15, wherein, The computing unit is specifically used for: For the reserved characters in the input text, any target large number is determined as the importance score of the reserved character; the reserved characters include several characters in the input text starting from the end and moving forward; For the remaining characters other than the reserved characters, the mean and variance are calculated for each target column in the target attention matrix, and the mean and variance are weighted and combined to obtain the importance score corresponding to the remaining character.

17. The apparatus according to claim 14, wherein, Each target column in the target attention matrix corresponds to a candidate character; each candidate character is the remaining character in the input text excluding reserved characters; the reserved characters include a number of characters in the input text starting from the end and moving forward; the device further includes: The calculation unit is used to calculate the importance score of each candidate character in the time dimension based on each target column of the target attention matrix. A selection unit is used to select a target number of target candidate characters from the candidate characters based on the importance score; the target number is obtained by subtracting the target number of characters from the number of reserved characters. A forming unit is used to form the target character based on the target candidate character and the reserved character.

18. The apparatus according to claim 11, wherein, The attention data for the target character includes the character representation determined by the (i-1)th layer for the target character and the key vector and value vector determined by the key matrix and value matrix corresponding to the i-th layer.

19. The apparatus according to claim 11, wherein, The update unit is specifically used for: For any target layer, calculate the sum of the i-th preference score and the corresponding preference scores for each layer; The number of characters to be updated for the target layer is determined based on the ratio of the target attention score to the summation result and the preset total number of characters. Based on the number of updated characters, update the target prior cache area corresponding to the target layer.

20. The apparatus according to claim 19, wherein, The update unit includes: The sorting submodule is used to sort the characters stored in the target's prior cache area according to their corresponding importance scores from high to low, and to determine the remaining characters, excluding the characters at the top of the sorted number of updated characters, as the characters to be removed. The removal submodule is used to remove the attention data of the removed character from the target's prior cache area.

21. A computer-readable storage medium having a computer program stored thereon, wherein, When the computer program is executed in the computer, it causes the computer to perform the method according to any one of claims 1-10.

22. A computing device comprising a memory and a processor, wherein, The memory stores executable code, and when the processor executes the executable code, it implements the method of any one of claims 1-10.