Data processing method and apparatus
By prefetching key KV caches for future generation steps and computing query vectors and importance scores in parallel, the low throughput of large language models is solved, improving the generation efficiency and accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2024-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
Large language models suffer from low throughput due to the serial nature of key-value cache (KV cache) computation during generation, making it impossible to process in parallel effectively and affecting the model's performance and efficiency.
By prefetching key key-value caches for future generation steps, parallel computation of query vectors and importance scores, and leveraging target information from historical generation steps to predict the relevance of future steps, the selection and use of key-value caches are optimized.
It significantly improves the throughput and generation efficiency of large language models, enhances the prediction accuracy and resource utilization efficiency of the models, and ensures performance stability and accuracy when processing large amounts of data.
Smart Images

Figure CN122309561A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and more particularly to data processing methods and apparatus. Background Technology
[0002] Large language models (LLMs) are deep learning models trained on large amounts of text data, enabling them to generate natural language text or understand the meaning of language text. These models can provide in-depth knowledge and language production on a wide range of topics by being trained on massive datasets.
[0003] Given an input token, the LLM (Limited Least Metric) processes it through multiple internal transformer layers to generate output tokens. Each transformer layer processes the input vector and generates three vectors: query (Q), key (K), and value (V). An importance score (e.g., attention score) is calculated from the Q and K vectors. This importance score is then applied to the V vector, serving as the output of the self-attention layer in each LLM transformer layer. The K and V vectors for each layer are cached layer-by-layer as a KVcache.
[0004] Key-value (KV) caching is an important technique for accelerating inference in large language models by trading space for time. During the computation of large language models, each layer generates a large number of K and V vectors, and the calculation of future importance scores depends on the corresponding Q vectors and all historical K and V vectors. Therefore, the historical K and V vectors of each layer can be stored in a KV cache to avoid redundant computation and reduce inference costs. Summary of the Invention
[0005] This application provides a data processing method and apparatus to improve the throughput of large language models. To achieve the above objectives, this application adopts the following technical solutions:
[0006] In a first aspect, embodiments of this application provide a data processing method applied to an edge device or a cloud testing device. The method includes: prefetching key KV caches for at least one future generation step at a first time, and calculating the attention score for the at least one future generation step based on the key KV caches. The first time is any time of the T-th generation step, which is used to generate token T, and the at least one future generation step includes the (T+N)-th generation step, where T and N are positive integers.
[0007] In the current technical architecture, the generation step includes calculating the query Q-vector, estimating the key KV cache, reading the key KV cache, and calculating the importance score. Since reading the key KV cache and calculating the query vector cannot be performed in parallel, the key KV cache must be read only after the query vector calculation is complete, thus delaying the process of calculating the importance score based on the key KV cache. To address this performance bottleneck, embodiments of this application propose an innovative solution. In this solution, while calculating the query vector or importance score, one or more key KV caches from future generation steps can be prefetched in parallel. That is, the generation step of this application includes calculating the query vector, calculating the importance score, and prefetching the key KV cache for at least one future generation step. In this way, after calculating the query vector in a future generation step, the prefetched key KV cache can be used immediately to calculate the importance score, thereby significantly improving the throughput of the large language model and enabling the large language model to generate more tokens per second.
[0008] In some implementations, the predicted target information for each generation step in at least one future generation step can be determined based on the target information of historical generation steps; the key KV cache can then be determined based on the predicted target information. The historical generation steps include at least one generation step from generation step 1 to generation step T, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector.
[0009] Understandably, during sequence generation, the target information between steps typically exhibits coherence and relevance. Leveraging this inherent connection, we can predict the target information for subsequent steps using the target information generated in previous steps. This prediction allows us to obtain predicted target information for future steps. Once we have this information, we can further utilize it to optimize our model. Specifically, importance scores are calculated based on query vectors and key-value pairs (K and V), so we can apply the predicted target information to filter out the key-value pairs with the highest relevance for caching. Furthermore, by using target information from historical steps to determine the predicted target information for a specific future step (e.g., step T+N), we can effectively reduce the accuracy loss caused by sparse compression and asynchronous prefetching operations in the key-value cache (KV cache). This approach not only improves the model's prediction accuracy but also optimizes resource utilization efficiency, ensuring the stability of model performance when processing large amounts of data.
[0010] In some implementations, the historical generation step includes a T-th generation step, which can determine the predicted target information based on the target information after the second time step. The second time step is the time when the query vector or importance score is calculated in the T-th generation step.
[0011] Understandably, the historical generation step includes the T-th generation step, which indicates that the target information for future generation steps needs to be estimated using the target information from the T-th generation step. Therefore, to ensure the accuracy of the estimation, the estimation of the target information for future generation steps can only begin after the target information has been obtained in the T-th generation step (i.e., after the second time step).
[0012] In some implementations, a first importance score can be determined based on the predicted importance score vector, wherein the first importance score is the largest of the multiple importance scores in the predicted importance score vector; and the key KV cache can be determined based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0013] Understandably, the importance score is directly proportional to the relevance of the generation step, indicating that the higher the predicted importance score, the stronger its relevance to future generation steps. Based on this logic, we can adopt a strategy of using the key-value caches that predict the highest importance scores as key-value caches for future generation steps. Furthermore, by utilizing importance scores to estimate key-value caches, we can not only effectively reduce memory consumption but also better model the spatiotemporal correlations in the attention mechanism. This method has strong universality and demonstrates significant advantages in improving model accuracy.
[0014] In some implementations, the predicted importance score vector can be divided into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; a second importance score is determined based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values among the multiple predicted importance score blocks; and the key KV cache is determined based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0015] Understandably, segmenting the predicted importance score vectors into blocks based on spatial location or similarity allows for more efficient analysis and processing of these scores. The size of an importance score block is directly proportional to its relevance to the generation step; this means that the larger the predicted importance score block, the higher its relevance to future generation steps. Based on this principle, we can adopt a strategy of using the key-value caches corresponding to the most important scores in the largest predicted importance score blocks as key key-value caches for future generation steps. The advantage of this approach is that it ensures that the most critical contextual information is prioritized and utilized during generation, thereby improving the quality and efficiency of generation. In this way, we can more precisely control and optimize the generation process, ensuring that the generated content not only conforms to the expected semantics and style but also excels in structure and coherence.
[0016] In some implementations, the importance score of the cached KV cache can be determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; a third importance score can be determined based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and the key KV cache can be determined based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0017] Understandably, by predicting the query vector, we can predict the importance scores of both the cached key-value cache and future generation steps. The importance score is directly proportional to the relevance of the generation step; that is, the higher the importance score of a cached key-value cache, the higher its relevance to future generation steps. Based on this logic, we can adopt a strategy of selecting the key-value caches corresponding to the highest importance scores of the cached key-value caches and using them as the key key-value caches for future generation steps. This method can effectively improve the cache hit rate, thereby optimizing overall query efficiency and performance.
[0018] In some implementations, the importance score of the cached KV cache can be determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; the importance score of the cached KV cache is divided into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; a fourth importance score is determined based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and the key KV cache is determined based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0019] Understandably, by deeply analyzing and processing the importance scores of cached key-value (KV) caches, we can more effectively assess the importance of these cached items. The magnitude of the importance score is directly proportional to its relevance to the generation step; this means that the larger the importance score block of a cached KV cache, the higher its relevance to future generation steps. Based on this logic, we can adopt a strategy of selecting multiple predicted cached KV cache blocks with the highest importance scores and considering the KV caches pointed to by the corresponding importance scores in these blocks as key KV caches for future generation steps.
[0020] In some implementations, the aforementioned predicted query vector can be input into the database to obtain the aforementioned key KVcache.
[0021] Understandably, databases used to output key-value caches based on input query vectors have relatively low development costs. This is because databases can effectively utilize existing data structures and indexing mechanisms when processing queries, thereby quickly locating and retrieving the required information. Therefore, obtaining key-value caches through database lookups not only improves data retrieval efficiency but also significantly reduces the cost of acquiring key-value caches. This approach has demonstrated its economic efficiency and practicality in many application scenarios, especially in situations requiring frequent access to and updates of large amounts of data.
[0022] In some implementations, the target information of the above-mentioned generation step T can be saved, including the importance score vector and / or query vector.
[0023] It is understandable that by continuously collecting and updating the target information to be predicted during the inference process of large models, we can significantly improve the prediction accuracy of key-value caches. This continuous information collection and updating mechanism is crucial for ensuring that the model can accurately identify and predict key information when processing large amounts of data. It not only enhances the model's ability to understand data, but also enables the model to more accurately predict data related to the target information, thereby providing more accurate and efficient services in practical applications such as natural language processing and image recognition.
[0024] In some implementations, the target information of the aforementioned TN generation step is deleted.
[0025] Understandably, target information with low relevance to future generation steps (such as the T+Nth generation step) will significantly affect the prediction accuracy of the key KV cache for those future steps. Therefore, to ensure the prediction accuracy of the key KV cache, we can take measures to remove portions of the target information that are low in relevance to future generation steps. This approach helps improve the model's efficiency and accuracy in processing future steps because it reduces unnecessary information interference, allowing the model to focus more on data closely related to future steps.
[0026] In some implementations, if the number of saved target information exceeds a threshold, the target information from the aforementioned TN generation step is deleted.
[0027] Understandably, prolonged estimation of critical KV cache can accumulate a large amount of useless data. This data not only consumes valuable storage resources but may also negatively impact the estimation results of critical KV cache, thereby reducing its accuracy. Therefore, to ensure the prediction accuracy of critical KV cache, measures should be taken to delete the oldest saved information when the amount of saved target information exceeds a certain threshold. In this way, outdated or irrelevant target information can be effectively removed, thus preventing it from adversely affecting cache performance and accuracy.
[0028] Secondly, embodiments of this application provide another data processing method that can be applied to an edge-cloud system, which includes a cloud-side device and an edge-side device. The method includes: the cloud-side device prefetching key KV cache for at least one future generation step at a first moment, wherein the first moment is any moment of the Tth generation step, and the Tth generation step is used to generate token T; and the edge-side device calculating the importance score of at least one future generation step based on the key KV cache, where T and N are positive integers.
[0029] In some implementations, the cloud-side device determines the predicted target information for each generation step in at least one future generation step based on the target information of the historical generation steps. The historical generation steps include at least one generation step from the first generation step to the Tth generation step. The target information includes an importance score vector and / or a query vector. The predicted target information includes a predicted importance score vector and / or a predicted query vector. The cloud-side device determines the key KV cache based on the predicted target information.
[0030] In some implementations, the aforementioned historical generation step includes a T-th generation step. After the second time point, the cloud-side device determines the aforementioned predicted target information based on the aforementioned target information. The aforementioned second time point is the time when the query vector or importance score is calculated in the T-th generation step.
[0031] In some implementations, the cloud-side device determines a first importance score based on the predicted importance score vector, wherein the first importance score is the largest of the multiple importance scores in the predicted importance score vector; the cloud-side device determines the key KV cache based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0032] In some implementations, the cloud-side device divides the predicted importance score vector into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; the cloud-side device determines a second importance score based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values among the multiple predicted importance score blocks; the cloud-side device determines the key KV cache based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0033] In some implementations, the cloud-side device determines the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; the cloud-side device determines a third importance score based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; the cloud-side device determines the key KV cache based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0034] In some implementations, the cloud-side device determines the importance score of the cached KV cache based on the predicted query vector. The cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step. The cloud-side device divides the importance score of the cached KV cache into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache. The cloud-side device determines a fourth importance score based on the multiple importance score blocks. The fourth importance score is the importance score of the multiple importance score blocks with the largest value. The cloud-side device determines the key KV cache based on the fourth importance score. The key KV cache includes the KV cache with the fourth importance score.
[0035] In some implementations, the cloud-side device inputs the predicted query vector into the database to obtain the key KV cache.
[0036] In some implementations, the cloud-side device stores the target information of the Tth generation step, which includes an importance score vector and / or a query vector.
[0037] In some implementations, the cloud-side device deletes the target information from the aforementioned TN generation step.
[0038] In some implementations, the cloud-side device deletes the target information from the TN generation step if the number of saved target information exceeds a threshold.
[0039] Thirdly, embodiments of this application provide a data processing apparatus, which may be an electronic device, a module applied to an electronic device (such as a processor, chip, or chip system), or a logic node, logic module, or software capable of implementing all or part of the functions of an electronic device. The apparatus includes a transceiver unit and a processing unit.
[0040] The transceiver unit is used to prefetch key KV caches for at least one future generation step at a first time, where the first time is any time of the T-th generation step, and the T-th generation step is used to generate token T, where T and N are positive integers.
[0041] The processing unit is used to calculate the importance score of at least one future generation step based on the aforementioned key KV cache.
[0042] In some implementations, the processing unit is further configured to: determine the predicted target information for each generation step in at least one future generation step based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector; and determine the key KV cache based on the predicted target information.
[0043] In some implementations, the processing unit is specifically used to: determine the predicted target information based on the target information after the second time step, wherein the second time step is the time when the query vector or importance score is calculated in the Tth generation step.
[0044] In some implementations, the processing unit is specifically used to: determine a first importance score based on the predicted importance score vector, wherein the first importance score is one of the largest importance scores in the predicted importance score vector; and determine the key KV cache based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0045] In some implementations, the processing unit is specifically used to: divide the predicted importance score vector into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; determine a second importance score based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values among the multiple predicted importance score blocks; and determine the key KV cache based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0046] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; determine a third importance score based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and determine the key KV cache based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0047] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; divide the importance score of the cached KV cache into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; determine a fourth importance score based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and determine the key KV cache based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0048] In some implementations, the processing unit is specifically used to: input the predicted query vector into the database to obtain the key KV cache.
[0049] In some implementations, the aforementioned transceiver unit is also used to: store the target information of the aforementioned generation step T, wherein the target information includes an importance score vector and / or a query vector.
[0050] In some implementations, the aforementioned transceiver unit is also used to: delete the target information of the aforementioned TN generation step.
[0051] In some implementations, the aforementioned sending and receiving is specifically used to: delete the target information from the aforementioned TN generation step when the number of saved target information exceeds a certain threshold.
[0052] Fourthly, embodiments of this application also provide a cloud-side device, which includes a transceiver unit and a processing unit.
[0053] The transceiver unit is used to prefetch key KV caches for at least one future generation step at a first time, where the first time is any time of the T-th generation step, and the T-th generation step is used to generate token T, where T and N are positive integers.
[0054] The processing unit is used to calculate the importance score of at least one future generation step based on the aforementioned key KV cache.
[0055] In some implementations, the processing unit is further configured to: determine the predicted target information for each generation step in at least one future generation step based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector; and determine the key KV cache based on the predicted target information.
[0056] In some implementations, the processing unit is specifically used to: determine the predicted target information based on the target information after the second time step, wherein the second time step is the time when the query vector or importance score is calculated in the Tth generation step.
[0057] In some implementations, the processing unit is specifically used to: determine a first importance score based on the predicted importance score vector, wherein the first importance score is one of the largest importance scores in the predicted importance score vector; and determine the key KV cache based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0058] In some implementations, the processing unit is specifically used to: divide the predicted importance score vector into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; determine a second importance score based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values among the multiple predicted importance score blocks; and determine the key KV cache based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0059] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; determine a third importance score based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and determine the key KV cache based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0060] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; divide the importance score of the cached KV cache into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; determine a fourth importance score based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and determine the key KV cache based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0061] In some implementations, the processing unit is specifically used to: input the predicted query vector into the database to obtain the key KV cache.
[0062] In some implementations, the aforementioned transceiver unit is also used to: store the target information of the aforementioned generation step T, wherein the target information includes an importance score vector and / or a query vector.
[0063] In some implementations, the aforementioned transceiver unit is also used to: delete the target information of the aforementioned TN generation step.
[0064] In some implementations, the aforementioned sending and receiving is specifically used to: delete the target information from the aforementioned TN generation step when the number of saved target information exceeds a certain threshold.
[0065] Fifthly, embodiments of this application also provide an end-side device, which includes a transceiver unit and a processing unit.
[0066] The transceiver unit is used to prefetch key KV caches for at least one future generation step at a first time, where the first time is any time of the T-th generation step, and the T-th generation step is used to generate token T, where T and N are positive integers.
[0067] The processing unit is used to calculate the importance score of at least one future generation step based on the aforementioned key KV cache.
[0068] In some implementations, the processing unit is further configured to: determine the predicted target information for each generation step in at least one future generation step based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector; and determine the key KV cache based on the predicted target information.
[0069] In some implementations, the processing unit is specifically used to: determine the predicted target information based on the target information after the second time step, wherein the second time step is the time when the query vector or importance score is calculated in the Tth generation step.
[0070] In some implementations, the processing unit is specifically used to: determine a first importance score based on the predicted importance score vector, wherein the first importance score is one of the largest importance scores in the predicted importance score vector; and determine the key KV cache based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0071] In some implementations, the processing unit is specifically used to: divide the predicted importance score vector into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; determine a second importance score based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values among the multiple predicted importance score blocks; and determine the key KV cache based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0072] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; determine a third importance score based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and determine the key KV cache based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0073] In some implementations, the processing unit is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; divide the importance score of the cached KV cache into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; determine a fourth importance score based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and determine the key KV cache based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0074] In some implementations, the processing unit is specifically used to: input the predicted query vector into the database to obtain the key KV cache.
[0075] In some implementations, the aforementioned transceiver unit is also used to: store the target information of the aforementioned generation step T, wherein the target information includes an importance score vector and / or a query vector.
[0076] In some implementations, the aforementioned transceiver unit is also used to: delete the target information of the aforementioned TN generation step.
[0077] In some implementations, the aforementioned sending and receiving is specifically used to: delete the target information from the aforementioned TN generation step when the number of saved target information exceeds a certain threshold.
[0078] Sixthly, embodiments of this application also provide an edge-cloud system, including at least one cloud-side device and at least one edge-side device, wherein the at least one cloud-side device includes the cloud-side device described in the fourth aspect above or any possible implementation thereof, and the at least one edge-side device includes the cloud-side device described in the fifth aspect above or any possible implementation thereof.
[0079] In a seventh aspect, embodiments of this application also provide an end-side device, which includes a plurality of processors and a memory, wherein the plurality of processors execute programs or instructions stored in the memory to enable the end-side device to implement the method described in the first aspect or any possible implementation thereof.
[0080] Eighthly, embodiments of this application also provide a cloud-side device, which includes a plurality of processors and a memory. The plurality of processors execute programs or instructions stored in the memory to enable the cloud-side device to implement the method described in the first aspect or any possible implementation thereof.
[0081] In a ninth aspect, embodiments of this application also provide a data processing apparatus, the apparatus comprising: at least one processor, which, when the at least one processor executes program code or instructions, implements the method described in the first aspect or any possible implementation thereof.
[0082] Alternatively, the data processing device may be a chip or a chip system.
[0083] Optionally, the device may further include at least one memory for storing the program code or instructions.
[0084] In a tenth aspect, embodiments of this application also provide a chip, including: an input interface, an output interface, and at least one processor. Optionally, the chip further includes a memory. The at least one processor is used to execute code in the memory, and when the at least one processor executes the code, the chip implements the method described in the first aspect or any possible implementation thereof.
[0085] Alternatively, the chip described above can also be an integrated circuit.
[0086] Eleventhly, embodiments of this application also provide a computer-readable storage medium for storing a computer program, the computer program including methods for implementing the first aspect or any possible implementation thereof.
[0087] In one possible implementation, the computer-readable storage medium is a non-transitory computer-readable medium.
[0088] In a twelfth aspect, embodiments of this application also provide a computer program product containing instructions that, when run on a computer, cause the computer to implement the method described in the first aspect or any possible implementation thereof.
[0089] The data processing apparatus, computer storage medium, computer program product, and chip provided in this application embodiment are all used to execute the data processing method provided above. Therefore, the beneficial effects they can achieve can be referred to the beneficial effects in the data processing method provided above, and will not be repeated here. Attached Figure Description
[0090] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0091] Figure 1 A schematic diagram of a large language model reasoning system;
[0092] Figure 2 A sequence diagram for reasoning in a large language model;
[0093] Figure 3 A flowchart illustrating a data processing method provided in an embodiment of this application;
[0094] Figure 4 A schematic diagram of a data processing system provided in an embodiment of this application;
[0095] Figure 5 A timing diagram of a data processing method provided in an embodiment of this application;
[0096] Figure 6 A timing diagram of another data processing method provided in the embodiments of this application;
[0097] Figure 7 A flowchart illustrating a data processing method provided in an embodiment of this application;
[0098] Figure 8 A schematic diagram of a timing prediction module provided in an embodiment of this application;
[0099] Figure 9 A timing diagram of yet another data processing method provided in an embodiment of this application;
[0100] Figure 10 A timing diagram of yet another data processing method provided in an embodiment of this application;
[0101] Figure 11 A flowchart illustrating another data processing method provided in an embodiment of this application;
[0102] Figure 12 A flowchart illustrating another data processing method provided in an embodiment of this application;
[0103] Figure 13 This is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;
[0104] Figure 14 This is a schematic diagram of the structure of a chip provided in an embodiment of this application;
[0105] Figure 15 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0106] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of the embodiments of this application.
[0107] In this article, the term "and / or" is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.
[0108] The terms "first" and "second," etc., in the specification and drawings of the embodiments of this application are used to distinguish different objects or to distinguish different treatments of the same object, rather than to describe a specific order of objects.
[0109] Furthermore, the terms "comprising" and "having," and any variations thereof, used in the description of the embodiments of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the steps or units listed, but may optionally include other steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices.
[0110] It should be noted that in the description of the embodiments of this application, the words "exemplarily" or "for example" are used to indicate examples, illustrations, or explanations. Any embodiment or design scheme described as "exemplarily" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design schemes. Specifically, the use of the words "exemplarily" or "for example" is intended to present the relevant concepts in a specific manner.
[0111] The following explains the terminology used in the solutions provided in the embodiments of this application:
[0112] LLM (Learning-Language Models) are a class of deep learning-based artificial intelligence models designed to process and generate natural language text. Trained on large-scale text data, these models can understand and generate text similar to human language, performing various natural language processing tasks including text generation, translation, and sentiment analysis.
[0113] The core of large language models lies in training with massive datasets. Through layered neural network structures, they learn and simulate the complex rules of human language, thereby achieving near-human-level text generation capabilities. Compared to traditional natural language processing models, large language models can better understand and generate natural text, while also demonstrating a certain level of logical thinking and reasoning ability.
[0114] Large language models typically employ a similar Transformer architecture and pre-training objective as small models, but the main difference lies in increasing model size, training data, and computational resources.
[0115] The large language model operation process includes a prefilling stage and a decoding stage.
[0116] Prefilling refers to the process of generating the first token from a large model based on the complete input sequence (prompt).
[0117] Prefilling focuses more on efficient allocation of computing resources and is suitable for using more complex parallelization techniques (such as tensor parallelism or pipeline parallelism) to improve overall performance.
[0118] Decoding refers to the process from generating the first token until the termination condition is met (such as encountering the end character or reaching the maximum length limit).
[0119] Decoding mainly focuses on optimization measures to effectively manage a large number of intermediate variables (especially key-value caches) to ensure good response speed and service quality even in large-scale deployment scenarios.
[0120] Attention models are widely used in deep learning. Their core idea is to assign different weights to different parts of the input sequence, thereby helping the model better focus on important information. Attention models typically employ an encoder-decoder framework, where the encoder converts the input sequence into a fixed-length vector representation, and the decoder generates the output sequence based on this vector representation.
[0121] The core of the Attention model lies in introducing an attention mechanism, enabling the model to assign different weights to different parts of the input sequence when generating the output. The specific steps are as follows:
[0122] Calculate attention weights: For each output word, calculate its similarity to each word in the input sequence to determine the contribution of each input word to the output word. Common similarity calculation methods include dot product and cosine similarity.
[0123] Weighted summation: Based on the calculated attention weights, the vectors of the input words are summed to obtain a weighted vector representation.
[0124] Context vector: The weighted vector representation is concatenated with the previous hidden state of the Decoder to obtain the context vector. This vector contains information from the input sequence related to the current output word, helping the Decoder to generate the output word more effectively.
[0125] Attention models can be used for machine translation, speech recognition, image processing, question answering systems, text classification, or recommendation systems.
[0126] The Transformer model is a deep learning architecture primarily used in the field of natural language processing (NLP). By introducing a self-attention mechanism, this model can weigh the importance of different positions in the input sequence when generating the output, thus better handling long-range dependencies.
[0127] At the heart of the Transformer is its self-attention mechanism, which allows the model to consider all positions in the input sequence simultaneously, rather than processing them step-by-step as in traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The self-attention mechanism calculates attention weights between each position and other positions, then outputs these weighted position vectors. Furthermore, the Transformer includes multiple encoder and decoder layers, each consisting of multiple attention mechanism modules and feedforward neural network modules, used to encode the input sequence into a high-dimensional feature vector representation and decode the vector into the target sequence.
[0128] Attention Score is a key concept in machine learning and natural language processing, especially in Transformer models. It represents the degree of attention a word or token gives to other words or tokens when processing text.
[0129] The calculation of the Attention Score typically involves the following steps:
[0130] Query (Q), Key (K), and Value (V): In the Self-Attention process, each input token (usually a word vector) is first transformed into three vectors through a linear transformation: Query, Key, and Value.
[0131] Dot product and scaling: The query vector for each token is multiplied by the key vectors of all tokens and then scaled by a scaling factor (usually the reciprocal of the square root of the input dimension).
[0132] Softmax function: The scaled dot product result is processed by the Softmax function to obtain the attention score of the token to all other tokens.
[0133] In the Transformer model, the core of the self-attention mechanism utilizes these attention scores. Each token has a corresponding attention score with all other tokens, and these scores are used to weight and sum the value vectors of all tokens to obtain the contextual representation of the current token. This mechanism enables the model to better understand the contextual relationships in the text, thereby improving the performance of NLP tasks.
[0134] KV cache is a technique used to accelerate the inference process of large language models, especially in Transformer models. By caching the key and value matrices (K and V) in the attention mechanism, KV cache avoids redundant calculations, thereby significantly improving inference efficiency.
[0135] During the inference process of the Transformer model, each time a new token is generated, the query matrix (Q) of that token needs to be calculated and its key matrix in the cached memory is multiplied by a dot product. Attention weights are then calculated using the softmax function. These weights are applied to the value matrix in the cache, and the final output is obtained through a weighted summation. This avoids repeatedly calculating the key and value matrices of all previous tokens.
[0136] Key KV cache: KV cache that is highly relevant to the token to be generated among the cached KV caches.
[0137] The generation step is the step used to generate the token, which includes the token reasoning process (such as calculating the query vector and calculating the Attention Score).
[0138] Large language models can generate tokens based on input. The token generation process requires calculating the importance score of each token to be generated. Calculating the importance score requires the Q-vector, K-vector, and V-vector of the token to be generated, as well as the K-vector and V-vector of the already generated tokens. To avoid repeatedly calculating the K-vector and V-vector of the already generated tokens, the K-vector and V-vector of the already generated tokens can be saved as a KV cache.
[0139] For example, to calculate the importance score of token 1000, we need the Q vector, K vector, and V vector of token 1000, as well as the K vector and V vector of tokens 1 to 999 that have already been generated. The K vector and V vector of tokens 1 to 999 can be saved as a KV cache.
[0140] For long output sequences, due to the large number of output tokens, the corresponding key-value cache is large, potentially occupying hundreds of GB or even hundreds of TB of memory. This makes it impossible to store the data in the memory of graphics processing units (GPUs) / neural network processing units (NPUs), which typically only require a few GB or tens of GB. Therefore, the industry usually offloads the key-value cache from GPU / NPU memory to central processing unit (CPU) memory or large-capacity storage media such as solid-state drives (SSDs) or hard disk drives (HDDs).
[0141] Correspondingly, each time the GPU / NPU calculates the importance score of the token to be generated, it needs to read the entire cached KV cache from the large-capacity storage medium. However, due to the large size of the KV cache, reading the entire cached KV cache takes a long time, resulting in a large KV cache transmission latency.
[0142] To reduce KV cache transmission latency, only key KV caches that are highly relevant to the token to be generated can be read from the cached KV cache. When calculating the importance score of the token to be generated, the key KV cache is used.
[0143] like Figure 1As shown, in the current technological field, the first processor of an electronic device possesses a crucial function: during the prefilling stage of a large language model, it can offload all key-value caches (KV caches) stored in the first processor's memory and transfer them to the memory of the second processor. This process is essential for optimizing the performance and resource management of the electronic device. It not only improves the operating efficiency of the electronic device but also effectively manages memory resources, ensuring the smoothness of the electronic device when performing complex tasks. Subsequently, during the decoding stage of the large language model, the second processor of the electronic device comes into play. After calculating the query vector for each generation step, it loads the key KV cache corresponding to that generation step. By interacting with these key KV caches using the query vector of each generation step, the second processor can calculate the importance score of each generation step and output the corresponding result.
[0144] like Figure 2 As shown, the generation steps (such as the T-th and T+1-th generation steps in the figure) include calculating the query vector, estimating the key KV cache, reading the key KV cache, and calculating the importance score. Since the reading of the key KV cache and the calculation of the query vector cannot be performed in parallel, the key KV cache cannot be read until the query vector is calculated, which delays the process of calculating the importance score based on the key KV cache.
[0145] To address this performance bottleneck, embodiments of this application provide a data processing method and apparatus for improving the throughput of large models. By optimizing the data processing flow and algorithms, embodiments of this application aim to reduce data transmission latency and improve data processing efficiency, thereby increasing the throughput of large models.
[0146] The solutions provided in this application can be used for online inference acceleration scenarios in long-sequence, multimodal, and high-order inference systems. They can be combined with lookup-based computation (i.e., KV cache retrieval such as vector database retrieval) systems and KV cache offloading systems. These systems and methods can work together to further improve the speed and accuracy of data processing, providing strong support for complex inference tasks.
[0147] The solutions provided in this application can also be used for key KV cache identification and KV cache transmission volume compression in scenarios such as KV cache sparsity and inference acceleration, KV cache retrieval, prefill decoding (PD) separation, edge-cloud collaboration, and distributed KV cache management. By applying these technologies, data transmission volume can be effectively reduced, system latency lowered, and overall processing capacity improved, thereby achieving more efficient data processing and inference acceleration in various application scenarios.
[0148] The solutions provided in this application can be executed by cloud-side devices, edge devices, or edge-cloud systems.
[0149] For example, edge devices may include servers, computers, laptop computers, mobile phones, tablets, mice, remote controls, styluses, set-top boxes, routers, cameras, monitors, smart displays, wireless data cards, personal digital assistants (PDAs), smartwatches, smart bracelets, wireless headphones, electronic whiteboards, virtual reality (VR) devices, augmented reality (AR) devices, smart home devices (e.g., refrigerators, televisions, air conditioners, washing machines, rice cookers, table lamps, electricity meters, etc.), smart robots, robotic arms, workshop equipment, or other electronic devices.
[0150] Cloud-side devices can include servers, computers, laptops, mobile phones, tablets, mice, remote controls, styluses, set-top boxes, routers, cameras, monitors, smart displays, wireless data cards, personal digital assistants, smartwatches, smart bracelets, wireless headphones, electronic whiteboards, virtual reality devices, augmented reality devices, smart home devices (e.g., refrigerators, televisions, air conditioners, washing machines, rice cookers, desk lamps, electricity meters, etc.), smart robots, robotic arms, workshop equipment, or other electronic devices.
[0151] Figure 3 This application illustrates a data processing method provided by an embodiment of the present application, such as... Figure 3 As shown, the method includes:
[0152] S301, Prefetch the key KV cache for at least one future generation step at the first moment.
[0153] Here, the first moment refers to any moment in the T-th generation step. T and N are positive integers, for example, N can be 1.
[0154] Here, the Tth generation step specifically refers to the step used to generate token T. For example, the 999th generation step is specifically used to generate token 999, while the 1000th generation step is specifically used to generate token 1000.
[0155] At least one future generation step includes the T+Nth generation step.
[0156] Furthermore, the key KV caches for future generation steps (such as the T+Nth generation step) are those KV caches that are most relevant to future generation steps (such as the T+Nth generation step) among all cached KV caches.
[0157] In this process, at least one future generation step may contain multiple subsequent generation steps.
[0158] These subsequent generation steps can be sequential or discontinuous.
[0159] For example, subsequent generation steps can be the T+1th generation step, the T+2nd generation step, and the T+3rd generation step.
[0160] Another example is that subsequent generation steps could be the T+1th generation step, the T+3rd generation step, and the T+5th generation step.
[0161] In cases involving multiple subsequent generation steps, the critical KV caches for these subsequent steps can be prefetched in parallel or serially.
[0162] In the case of serially prefetching the critical KV cache of multiple generation steps, prefetching can be performed in the chronological order of these generation steps.
[0163] For example, assuming that the future generation steps include generation step T+1, generation step T+2, and generation step T+3, when serially prefetching the key KV cache of multiple generation steps, the key KV cache of generation step T+1 is prefetched first. After the key KV cache of generation step T+1 is prefetched, the key KV cache of generation step T+2 is prefetched next. After the key KV cache of generation step T+2 is prefetched, the key KV cache of generation step T+3 is prefetched last.
[0164] The following description uses one of the generation steps in at least one future generation step as an example to illustrate the solution provided in this application. For specific implementation methods of other generation steps in at least one future generation step, please refer to the description of that generation step.
[0165] For example, the electronic device can prefetch the key KVcache of the (T+1)th generation step at any time during the Tth generation step.
[0166] like Figure 4 As shown, the second processor in the electronic device can, at any point during the T-th generation step of the decoding phase, pre-acquire the key KV cache of the T+1 generation step through an efficient data transmission mechanism. This mechanism allows the key KV cache of the T+1 generation step to be transferred (loaded) from the second processor's memory to the first processor's memory, thereby optimizing the data processing flow and improving overall processing efficiency. This innovative data transmission method not only reduces the data exchange time between processors but also significantly improves the continuity and smoothness of data processing. In this way, the electronic device can respond more quickly to various complex data processing needs, ensuring efficient and stable operating performance even under high-load working environments. The application of this technology can bring users a smoother and more efficient user experience.
[0167] like Figure 5 As shown, at the first moment of generation step T, we can proactively prefetch the critical KV cache required for generation step T+1. This prefetching strategy is based on a deep understanding of the processing flow and forward-looking prediction, ensuring that critical KV cache data is ready when needed, thereby effectively reducing waiting time and significantly improving processing speed.
[0168] S302. Calculate the importance score of at least one future generation step based on the key KV cache of at least one future generation step.
[0169] Specifically, the importance score of each generation step can be calculated based on the key KV cache of each generation step in at least one future generation step.
[0170] For example, assuming that at least one future generation step includes generation step T+1, generation step T+2, and generation step T+3, the importance score of generation step T+1 can be calculated based on the key KV cache of generation step T+1, the importance score of generation step T+1 can be calculated based on the key KV cache of generation step T+2, and the importance score of generation step T+1 can be calculated based on the key KV cache of generation step T+3.
[0171] For example, an electronic device can use the key KV cache of the T+Nth generation step to calculate the importance score of the T+Nth generation step, a process based on a time series prediction mechanism.
[0172] like Figure 4As demonstrated, the first processor in the electronic device is able to use the pre-fetched key KV cache of the T+1 generation step and the query vector of the T+1 generation step to calculate and output the importance score of the T+N generation step, a process that embodies the prediction and optimization of future steps.
[0173] like Figure 5 As shown, after obtaining the key KV cache for the T+Nth generation step, the electronic device can calculate the importance score of the T+Nth generation step based on the query vector and key KV cache of the T+Nth generation step. This process further enhances the evaluation of the importance of the generation step.
[0174] In some implementations, the aforementioned importance score can be defined as the Attention Score, a scoring method used for attention mechanisms in deep learning models.
[0175] refer to Figure 6 It is clear that in the current technical architecture, the generation step includes calculating the query vector, estimating the key-value cache, reading the key-value cache, and calculating the importance score. Since the reading of the key-value cache and the calculation of the query vector cannot be performed in parallel, the key-value cache must be read only after the query vector calculation is complete, thus delaying the process of calculating the importance score based on the key-value cache. To address this performance bottleneck, embodiments of this application propose an innovative solution. In this solution, while calculating the query vector or importance score, the key-value cache for at least one future generation step can be prefetched in parallel. That is, the generation step of this application includes calculating the query vector, calculating the importance score, and prefetching the key-value cache for at least one future generation step. In this way, after calculating the query vector in a future generation step, the prefetched key-value cache can be used immediately to calculate the importance score, thereby significantly improving the throughput of the large language model and enabling it to generate more tokens per second.
[0176] like Figure 7 As shown, in some implementations, the following may be included before prefetching the key KV cache for at least one future generation step:
[0177] S303. Determine the predicted target information for at least one future generation step based on the target information of the historical generation steps.
[0178] For example, such as Figure 4 As shown, we can predict the target information of at least one future generation step based on the target information of the historical generation steps (that is, the historical target information).
[0179] In this process, the history generation step covers at least one generation step from the first generation step to the Tth generation step.
[0180] For example, the history generation steps can be the entire sequence from the 1st generation step to the Tth generation step.
[0181] Another example is that the history generation step can also be the segment from the TNth generation step to the Tth generation step.
[0182] In these steps, the target information typically includes an importance score vector and / or a query vector.
[0183] For example, the target information could include only the importance score vector.
[0184] For example, the target information can also include only the query vector.
[0185] Of course, the target information can also include both importance score vectors and query vectors.
[0186] Accordingly, the prediction target information includes a prediction importance score vector and / or a prediction query vector.
[0187] For example, the target information for prediction could include only the prediction importance score vector.
[0188] Alternatively, the target information can include only the predicted query vector.
[0189] Similarly, the target information can also include both the predicted importance score vector and the predicted query vector.
[0190] In this field, the process of determining the predicted target information for at least one future generation step using the target information of historical generation steps can employ any method conceived by those skilled in the art, and the embodiments of this application do not specifically limit this. In other words, the embodiments of this application do not restrict what specific technical means can be used to achieve the conversion from historical data to future predicted target information.
[0191] For example, a time-series prediction module can be used to determine the prediction target information for at least one future generation step based on the target information of historical generation steps. The structure of the time-series prediction module is as follows: Figure 8As shown in the figure, the square diagram represents a heatmap of the target information (attention score vector and / or query vector). The vertical axis represents the input-output sequence, the horizontal axis represents the KVcache token sequence, and the filled pattern represents the size of the target information. The scheme provided in this application models the sequence dimension as a time series and the token dimension as a spatial sequence, using a temporal prediction module to predict the target information for subsequent steps. Specifically, the target information of the historical generation steps is provided as input to the temporal prediction model, and after processing, the model can output the predicted target information for step t+N.
[0192] In the implementation of the time series prediction module, a deep learning model can be used to learn the patterns in information through supervised training. This deep learning model can capture complex patterns and dependencies in the data, thereby improving the accuracy of predictions.
[0193] For example, the top-k operation can be used to determine the predicted target information for at least one future generation step based on the target information of historical generation steps. The top-k operation involves selecting the k elements with the highest scores or values from a given dataset. This method predicts the target information for future steps by filtering out the most important information.
[0194] In some implementations, the historical generation step includes a T-th generation step, which can determine the predicted target information based on the target information after the second time step. The second time step is the time when the query vector or importance score is calculated in the T-th generation step.
[0195] like Figure 9 As shown, assuming the historical generation steps include the Tth generation step, the target information includes the query vector, and the prediction information includes the predicted query vector, then it is necessary to calculate the time when the query vector is obtained within the time period of calculating the query vector in the Tth generation step (i.e., Figure 9 Only after the second moment (in the middle) can the predicted query vector for the T+Nth generation step be determined based on the query vectors of the historical generation steps, and the KV cache for the T+Nth generation step be determined based on the predicted query vector. This process involves in-depth analysis and calculation of the query vector, ensuring the accuracy of the predicted query vector, thus providing a solid foundation for subsequent steps.
[0196] like Figure 10 As shown, assuming the historical generation steps include the Tth generation step, the target information includes the importance score, and the prediction information includes the predicted importance score, then it is necessary to calculate the moment when the importance score is calculated within the time period of calculating the importance score in the Tth generation step (i.e., Figure 10Only after the second moment (in the middle of the time frame) can the predicted importance score of the T+Nth generation step be determined based on the importance scores of the historical generation steps, and the KV cache of the T+Nth generation step be determined based on the predicted importance score of the T+Nth generation step. In this way, we can ensure that the prediction of the importance score is based on sufficient historical data and information, thereby improving the reliability of the prediction results.
[0197] Understandably, the historical generation step includes the Tth generation step, which indicates that the target information for the T+Nth generation step needs to be estimated using the target information from the Tth generation step. Therefore, to ensure the accuracy of the estimation, the estimation of the target information for the T+Nth generation step can only begin after the target information has been obtained in the Tth generation step (i.e., after the second time step).
[0198] S304. Determine the key KV cache for at least one future generation step based on the predicted target information of at least one future generation step.
[0199] For example, such as Figure 4 As shown, we can further deduce the specific location of the key token based on the predicted target information. Once the location of the key token is determined, we can locate and prefetch the key KVcache. This process is crucial for optimizing model performance because it ensures rapid access to key information during the generation process.
[0200] Understandably, during sequence generation, the target information between generation steps typically exhibits coherence and correlation. Leveraging this inherent connection, we can predict the target information for subsequent generation steps using the target information generated in previous steps. This prediction allows us to obtain predicted target information for future generation steps. Once we have this information, we can further utilize it to optimize our model. Specifically, importance scores are calculated based on query vectors and key-value pairs (K and V), so we can apply the predicted target information to filter out the key-value pairs with the highest relevance for caching. Furthermore, by using the target information from historical generation steps to determine the predicted target information for a specific future generation step (e.g., step T+N), we can effectively reduce the accuracy loss caused by sparse compression and asynchronous prefetching operations in the key-value cache (KV cache). This approach not only improves the model's prediction accuracy but also optimizes resource utilization efficiency, ensuring the stability of model performance when processing large amounts of data.
[0201] In some implementations, the first importance score can be determined based on the predicted importance score vector; and the key KV cache can be determined based on the first importance score.
[0202] The predicted importance score vector contains multiple predicted importance scores.
[0203] Among them, the first importance score is the largest of the multiple importance scores in the predicted importance score vector.
[0204] The aforementioned critical KV cache includes the KV cache with the first importance score.
[0205] For example, K first importance scores can be selected from the predicted importance score vector; the key KV cache can be determined based on the K first importance scores.
[0206] Where K is the pre-defined number of KV caches that need to be loaded.
[0207] For example, assuming K is set to 128, the predicted importance score vector contains 1000 different importance scores. In this case, we can select the top 128 scores from these 1000 importance scores and consider them as the first importance scores, i.e., the key tokens. Based on these 128 first importance scores, we can determine the location information of the key KV cache for the T+Nth generation step. Once we have this location information, we can pre-load the key KV cache required for the T+Nth generation step to ensure the smoothness of the generation process.
[0208] Understandably, the importance score is directly proportional to the relevance of the generation step, indicating that the higher the predicted importance score, the stronger its relevance to future generation steps (e.g., the T+Nth generation step). Based on this logic, we can adopt a strategy of using the key-value caches that predict the highest importance scores as key-value caches for future generation steps (e.g., the T+Nth generation step). Furthermore, by using importance scores to estimate key-value caches, we can not only effectively reduce memory consumption but also better model the spatiotemporal correlations in the attention mechanism. This method has strong universality and demonstrates significant advantages in improving model accuracy.
[0209] In some implementations, the predicted importance score vector can be divided into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the next generation step; a second importance score is determined based on the multiple predicted importance score blocks, and the key KV cache is determined based on the second importance score.
[0210] Each predicted importance score block contains multiple predicted importance scores, so each predicted importance score block can also be viewed as a dataset or vector.
[0211] The second importance score refers to the importance score of the block with the highest value among multiple predicted importance score blocks.
[0212] The critical key-value cache refers to the key-value cache that includes the second most important score.
[0213] Spatial location refers to the position of the predicted importance score in the predicted importance score vector, or the position of the KV cache corresponding to the predicted importance score in the KV cache vector.
[0214] Similarity refers to the cosine similarity, high-dimensional spatial distance, or other similarity measures between the predicted importance score and other scores in the predicted importance score vector; similarly, it can also refer to the cosine similarity, high-dimensional spatial distance, or other similarity measures between the KVcache corresponding to the predicted importance score and other KV caches in the KV cache vector.
[0215] For example, we can divide the predicted importance score vector into several predicted importance score blocks. Then, based on the values of these predicted importance score blocks, we select the K / b blocks with the largest values from among them. Subsequently, we use the importance score blocks contained in these selected K / b largest predicted importance score blocks as the K second importance scores. Through these second importance scores, we can further determine the specific content of the aforementioned key KVcache.
[0216] For example, the predicted importance score vector consists of 120 predicted importance scores, which are divided into 12 predicted importance score blocks, each containing 10 (i.e., b) predicted importance scores. Next, the value of each block is determined based on the 10 predicted importance scores in each block. Subsequently, the two highest-valued predicted importance score blocks are selected, and the 20 (i.e., K) predicted importance scores from these two blocks are designated as the second importance scores.
[0217] In this process, the value of the predicted importance score block can be the maximum, minimum, or average of the multiple predicted importance scores it contains, which reflects the characteristics of the predicted importance score block.
[0218] The size of the block is represented by b, which is the number of importance scores contained in each predicted importance score block.
[0219] For example, assuming we set K to 128 and b to 2, when the predicted importance score vector contains 1000 importance scores, we can divide these scores into 500 predicted importance score blocks. Then, we select the 64 blocks with the highest values from these 500 blocks. The 128 importance scores within these 64 blocks will serve as the second importance scores (key tokens). Based on these 128 first importance scores, we can determine the location information of the key KV cache for at least one future generation step. With this location information, we can prefetch the key KV cache for at least one future generation step.
[0220] Understandably, segmenting the predicted importance score vectors into blocks based on spatial location or similarity allows for more efficient analysis and processing of these scores. The size of an importance score block is directly proportional to its relevance to the generation step; this means that the larger the predicted importance score block, the higher its relevance to future generation steps (e.g., the T+Nth generation step). Based on this principle, we can adopt a strategy of using the key-value cache corresponding to the importance scores in the largest blocks of predicted importance scores as the key-value cache for future generation steps (e.g., the T+Nth generation step). The advantage of this approach is that it ensures that the most critical contextual information is prioritized and utilized during the generation process, thereby improving the quality and efficiency of generation. In this way, we can more precisely control and optimize the generation process, ensuring that the generated content not only conforms to the expected semantics and style but also excels in structure and coherence.
[0221] In some implementations, the importance score of the cached KV cache can be determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; a third importance score can be determined based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and the key KV cache can be determined based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0222] Understandably, by predicting the query vector, we can predict the importance scores of the cached key-value cache and future generation steps (e.g., the T+Nth generation step). The importance score is directly proportional to the relevance of the generation step; that is, the higher the importance score of the cached key-value cache, the higher its relevance to future generation steps (e.g., the T+Nth generation step). Based on this logic, we can adopt a strategy of selecting the key-value caches corresponding to the highest importance scores of the cached key-value caches and using them as the key key-value caches for future generation steps (e.g., the T+Nth generation step). This method can effectively improve the cache hit rate, thereby optimizing overall query efficiency and performance.
[0223] In some implementations, the importance score of the cached KV cache can be determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; the importance score of the cached KV cache is divided into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; a fourth importance score is determined based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and the key KV cache is determined based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0224] Understandably, by deeply analyzing and processing the importance scores of cached key-value (KV) caches, we can more effectively assess the importance of these cached items. The magnitude of the importance score is directly proportional to the relevance to the generation step, meaning that the larger the importance score block of a cached KV cache, the higher its relevance to future generation steps (such as the T+Nth generation step). Based on this logic, we can adopt a strategy of selecting multiple predicted cached KV cache blocks with the highest importance scores and considering the KV caches pointed to by the corresponding importance scores in these blocks as key KV caches for future generation steps (such as the T+Nth generation step).
[0225] In some implementations, the aforementioned predicted query vector can be input into the database to obtain the aforementioned key KVcache.
[0226] Understandably, databases used to output key-value caches based on input query vectors have relatively low development costs. This is because databases can effectively utilize existing data structures and indexing mechanisms when processing queries, thereby quickly locating and retrieving the required information. Therefore, obtaining key-value caches through database lookups not only improves data retrieval efficiency but also significantly reduces the cost of acquiring key-value caches. This approach has demonstrated its economic efficiency and practicality in many application scenarios, especially in situations requiring frequent access to and updates of large amounts of data.
[0227] like Figure 4 As shown, after calculating and outputting the importance score, the historical target information can be updated.
[0228] In some implementations, the target information of the above-mentioned generation step T can be saved, including the importance score vector and / or query vector.
[0229] It is understandable that by continuously collecting and updating the target information to be predicted during the inference process of large models, we can significantly improve the prediction accuracy of key-value caches. This continuous information collection and updating mechanism is crucial for ensuring that the model can accurately identify and predict key information when processing large amounts of data. It not only enhances the model's ability to understand data, but also enables the model to more accurately predict data related to the target information, thereby providing more accurate and efficient services in practical applications such as natural language processing and image recognition.
[0230] In some implementations, the target information of the aforementioned TN generation step is deleted.
[0231] Understandably, target information with low relevance to future generation steps (such as the T+Nth generation step) will significantly affect the prediction accuracy of the key KV cache for those future steps. Therefore, to ensure the prediction accuracy of the key KV cache, we can take measures to remove portions of the target information that are low in relevance to future generation steps. This approach helps improve the model's efficiency and accuracy in processing future steps because it reduces unnecessary information interference, allowing the model to focus more on data closely related to future steps.
[0232] In some implementations, if the number of saved target information exceeds a threshold, the target information from the aforementioned TN generation step is deleted.
[0233] Understandably, prolonged estimation of critical KV cache can accumulate a large amount of useless data. This data not only consumes valuable storage resources but may also negatively impact the estimation results of critical KV cache, thereby reducing its accuracy. Therefore, to ensure the prediction accuracy of critical KV cache, measures should be taken to delete the oldest saved information when the amount of saved target information exceeds a certain threshold. In this way, outdated or irrelevant target information can be effectively removed, thus preventing it from adversely affecting cache performance and accuracy.
[0234] In some implementations, historical target information can be a queue, where the latest target information is added each time, and the oldest target information is deleted.
[0235] The method provided in this application embodiment can be applied to electronic devices including a first processor and a second processor.
[0236] The method provided in the embodiments of this application is described below with reference to the above-mentioned electronic device, such as... Figure 11 As shown, the method includes:
[0237] S1101, The first processor determines the predicted target information for at least one future generation step based on the target information of the historical generation steps.
[0238] The first processor can be a CPU or a switch chip.
[0239] S1102, The first processor determines the key KV cache for at least one future generation step based on the predicted target information of at least one future generation step.
[0240] S1103, the first processor prefetches the key KV cache for at least one future generation step at the first moment.
[0241] S1104, The second processor calculates the importance score of at least one future generation step based on the key KV cache of at least one future generation step.
[0242] The second processor can be a GPU, NPU, artificial intelligence (AI) accelerator, heterogeneous accelerator (X processing unit, XPU), or data processing unit (DPU).
[0243] The implementation methods of S1101 to S1104 described above can be referred to S301 to S304 described above, and will not be repeated here in the embodiments of this application.
[0244] It should be noted that the implementation described above is merely illustrative and does not represent the only execution mode in actual operation. In actual execution, steps S1101 to S1104 can be completed by the first processor alone, by the second processor independently, or even further, by the first and second processors working together to complete these steps.
[0245] When steps S1101 to S1104 are executed collaboratively by the first processor and the second processor, the first processor will be responsible for executing at least one step from S1101 to S1104, while the second processor will be responsible for executing the remaining steps. This division of labor can improve processing efficiency and ensure the successful completion of the task.
[0246] The method provided in this application embodiment can also be applied to end-to-cloud systems that include cloud-side devices and end-side devices.
[0247] The method provided in the embodiments of this application is described below in conjunction with the aforementioned edge-cloud system, such as... Figure 12 As shown, the method includes:
[0248] S1201, The end-side device determines the predicted target information for at least one future generation step based on the target information of the historical generation steps.
[0249] S1202, The end-side device determines the key KV cache for at least one future generation step based on the predicted target information of at least one future generation step.
[0250] S1203, The end-side device prefetches the key KV cache of at least one future generation step at the first moment.
[0251] S1204. The cloud-side device calculates the importance score of at least one future generation step based on the key KV cache of at least one future generation step.
[0252] The implementation methods of S1201 to S1204 described above can be referred to S301 to S304 described above, and will not be repeated here in the embodiments of this application.
[0253] It should be noted that the implementation method described above is merely an illustrative example and does not represent the only execution mode in actual operation. In actual execution, steps S1201 to S1204 can be completed by the cloud-side device alone, by the end-side device alone, or even further, by the cloud-side device and the end-side device working together to complete these steps.
[0254] When steps S1201 to S1204 are executed collaboratively by the cloud-side device and the edge-side device, the cloud-side device will be responsible for executing at least one of steps S1201 to S1204, while the edge-side device will be responsible for executing the remaining steps. This division of labor can improve processing efficiency and ensure the successful completion of the task.
[0255] The following describes a data processing apparatus used to perform the above data processing methods.
[0256] It is understood that, in order to achieve the above-mentioned functions, the data processing device includes hardware and / or software modules corresponding to the execution of each function. Based on the algorithm steps of the various examples described in conjunction with the embodiments disclosed herein, the embodiments of this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in a manner that drives hardware or computer software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application in conjunction with the embodiments, but such implementation should not be considered beyond the scope of the embodiments of this application.
[0257] This application embodiment can divide the data processing device into functional modules according to the above method example. For example, each function can be divided into its own functional modules, or two or more functions can be integrated into one processing module. The integrated modules can be implemented in hardware. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.
[0258] Figure 13 The diagram illustrates a possible configuration of the data processing apparatus involved in the above embodiments. This apparatus can be an edge device or a cloud device, or a module (such as a processor, chip, or chip system) applied to the edge device or cloud device, or a logic node, logic module, or software capable of implementing all or part of the functions of the edge device or cloud device. Figure 13 As shown, the data processing device 1300 may include a transceiver unit 1301 and a processing unit 1302.
[0259] The transceiver unit 1301 is used to prefetch key KV cache for at least one future generation step at a first time. The first time is any time of the T-th generation step, which is used to generate token T. The at least one future generation step includes the T+N-th generation step, where T and N are positive integers.
[0260] Processing unit 1302 is used to calculate the importance score of at least one future generation step based on the aforementioned key KV cache.
[0261] In some implementations, the processing unit 1302 is further configured to: determine the predicted target information for each generation step in the at least one future generation step based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector; and determine the key KV cache based on the predicted target information.
[0262] In some implementations, the processing unit 1302 is specifically used to: determine the predicted target information based on the target information after the second time step, wherein the second time step is the time when the query vector or importance score is calculated in the Tth generation step.
[0263] In some implementations, the processing unit 1302 is specifically used to: determine a first importance score based on the predicted importance score vector, wherein the first importance score is one of the largest importance scores in the predicted importance score vector; and determine the key KV cache based on the first importance score, wherein the key KV cache includes the KV cache of the first importance score.
[0264] In some implementations, the processing unit 1302 is specifically used to: divide the predicted importance score vector into blocks according to the spatial position or similarity of the predicted importance scores in the predicted importance score vector blocks to obtain multiple predicted importance score blocks for each generation step in the at least one future generation step; determine a second importance score based on the multiple predicted importance score blocks, wherein the second importance score is the importance score of the multiple predicted importance score blocks with the largest values; and determine the key KV cache based on the second importance score, wherein the key KV cache includes the KV cache of the second importance score.
[0265] In some implementations, the processing unit 1302 is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; determine a third importance score based on the importance score of the cached KV cache, wherein the third importance score is one of the largest importance scores among the importance scores of the cached KV cache; and determine the key KV cache based on the third importance score, wherein the key KV cache includes the KV cache with the third importance score.
[0266] In some implementations, the processing unit 1302 is specifically used to: determine the importance score of the cached KV cache based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one generation step from the first generation step to the Tth generation step; divide the importance score of the cached KV cache into multiple importance score blocks based on the spatial location or similarity of the importance score of the cached KV cache; determine a fourth importance score based on the multiple importance score blocks, wherein the fourth importance score is the importance score of the multiple importance score blocks with the largest value; and determine the key KV cache based on the fourth importance score, wherein the key KV cache includes the KV cache with the fourth importance score.
[0267] In some implementations, the processing unit 1302 is specifically used to: input the predicted query vector into the database to obtain the key KV cache.
[0268] In some implementations, the transceiver unit 1301 is further configured to: store the target information of the Tth generation step, wherein the target information includes an importance score vector and / or a query vector.
[0269] In some implementations, the transceiver unit 1301 is also used to: delete the target information of the TN generation step.
[0270] In some implementations, the aforementioned sending and receiving is specifically used to: delete the target information from the aforementioned TN generation step when the number of saved target information exceeds a certain threshold.
[0271] This application also provides a chip, which can be the chip of the data processing device described above. Figure 14 A schematic diagram of a chip 1600 is shown. Chip 1400 includes one or more processors 1401 and interface circuits 1402.
[0272] Optionally, the chip 1400 may also include a bus 1403.
[0273] Processor 1401 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above data processing method can be completed through integrated logic circuits in the hardware of processor 1401 or through software instructions.
[0274] Optionally, the processor 1401 described above may be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the various methods and steps disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor.
[0275] The interface circuit 1402 can be used to send or receive data, instructions or information. The processor 1401 can use the data, instructions or other information received by the interface circuit 1402 to process the data, instructions or other information, and can send the processed information out through the interface circuit 1402.
[0276] Optionally, the chip may also include memory, which may include read-only memory and random access memory, providing operation instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (NVRAM).
[0277] Optionally, the memory stores executable software modules or data structures, and the processor can execute corresponding operations by calling the operation instructions stored in the memory (which may be stored in the operating system).
[0278] Optionally, the chip can be used in the data processing apparatus or data processing device involved in the embodiments of this application. Optionally, the interface circuit 1402 can be used to output the execution result of the processor 1401. For the data processing methods provided by one or more embodiments of this application, please refer to the foregoing embodiments, which will not be repeated here.
[0279] It should be noted that the functions of processor 1401 and interface circuit 1602 can be implemented through hardware design, software design, or a combination of hardware and software; no restrictions are imposed here.
[0280] Figure 15 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device can be a data processing device, a chip within the data processing device, or a functional module. For example... Figure 15 As shown, the electronic device 1500 includes a processor 1501, a transceiver 1502, and a communication line 1503.
[0281] The processor 1501 is used to execute any step of the data processing method provided in the embodiments of this application, and in the process of executing any step of the data processing method provided in the embodiments of this application, it may choose to call the transceiver 1502 and the communication line 1503 to complete the corresponding operation.
[0282] Furthermore, the electronic device 1500 may also include a memory 1504. The processor 1501, the memory 1504, and the transceiver 1502 can be connected via a communication line 1503.
[0283] The processor 1501 can be a processor, a general-purpose processor, a network processor (NP), a digital signal processor (DSP), a microprocessor, a microcontroller, a programmable logic device (PLD), or any combination thereof. The processor 1501 can also be other devices with processing capabilities, such as circuits, devices, or software modules, without limitation.
[0284] Transceiver 1502 is used to communicate with other devices or other communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. Transceiver 1502 can be a module, circuit, transceiver, or any device capable of enabling communication.
[0285] The transceiver 1502 is mainly used for sending and receiving commands and information, and may include a transmitter and a receiver to send and receive commands and information, respectively; operations other than sending and receiving commands and information are implemented by the processor.
[0286] Communication line 1503 is used to transmit information between the various components included in electronic device 1500.
[0287] In one design, the processor can be viewed as a logic circuit, and the transceiver as an interface circuit.
[0288] Memory 1504 is used to store instructions. These instructions can be computer programs.
[0289] The memory 1504 can be volatile memory or non-volatile memory, or it can include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM). Memory 1504 can also be a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices. It should be noted that the memory in the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
[0290] It should be noted that the memory 1504 can exist independently of the processor 1501, or it can be integrated with the processor 1501. The memory 1504 can be used to store instructions, program code, or some data, etc. The memory 1504 can be located inside or outside the electronic device 1500, without limitation. The processor 1501 is used to execute the instructions stored in the memory 1504 to implement the methods provided in the above embodiments of this application.
[0291] In one example, processor 1501 may include one or more processor cores, for example Figure 15 The processor cores are 0 and 1.
[0292] As an optional implementation, the electronic device 1500 includes multiple processors, for example, besides Figure 15 In addition to processor 1501, it may also include processor 1507.
[0293] As an optional implementation, the electronic device 1500 also includes an output device 1505 and an input device 1506. For example, the input device 1506 is a device such as a keyboard, mouse, microphone, or joystick, and the output device 1505 is a device such as a display screen or speaker.
[0294] It should be noted that the electronic device 1500 can be a chip system or... Figure 15 Devices with similar structures. The chip system can be composed of chips or include chips and other discrete components. Actions, terminology, etc., involved in the various embodiments of this application can be referenced interchangeably without limitation. The message names or parameter names in the messages used for interaction between devices in the embodiments of this application are merely examples; other names can be used in specific implementations without limitation. Furthermore, Figure 15 The structural composition shown does not constitute a limitation on the electronic device 1500, except... Figure 15 In addition to the components shown, the electronic device 1500 may include more than Figure 15 This may indicate more or fewer components, or combinations of certain components, or different component arrangements.
[0295] The processor and transceiver described in this application can be implemented on integrated circuits (ICs), analog ICs, radio frequency integrated circuits, mixed-signal ICs, application-specific integrated circuits (ASICs), printed circuit boards (PCBs), electronic devices, etc. The processor and transceiver can also be manufactured using various IC process technologies, such as complementary metal-oxide semiconductors (CMOS), n-metal-oxide-semiconductor (NMOS), positive-channel metal-oxide semiconductors (PMOS), bipolar junction transistors (BJTs), bipolar CMOS (BiCMOS), silicon germanium (SiGe), gallium arsenide (GaAs), etc.
[0296] This application also provides a data processing apparatus, which includes at least one processor. When the at least one processor executes program code or instructions, it implements the data processing method described above.
[0297] Optionally, the device may further include at least one memory for storing the program code or instructions.
[0298] This application also provides a computer storage medium storing computer instructions. When the computer instructions are executed on a data processing device, the data processing device performs the aforementioned related method steps to implement the data processing method described above.
[0299] This application also provides a computer-readable storage medium for storing a computer program, the computer program including the data processing method used in the above embodiments.
[0300] In specific implementations, the computer-readable storage medium in the above embodiments can be a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium can include non-volatile media such as ROM, or some volatile media such as some RAM.
[0301] This application also provides a computer program product that, when run on a computer, causes the computer to perform the aforementioned steps to implement the data processing method described in the above embodiments.
[0302] This application also provides a data processing apparatus, which may specifically be a chip, integrated circuit, component, or module. Specifically, the apparatus may include a connected processor and a memory for storing instructions, or the apparatus may include at least one processor for fetching instructions from external memory. When the apparatus is running, the processor can execute the instructions to cause the chip to perform the data processing methods described in the above method embodiments.
[0303] It should be understood that in various embodiments of this application, the sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of this application.
[0304] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application.
[0305] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0306] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0307] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0308] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0309] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of this application, essentially, or the parts that contribute to the prior art, or parts of the technical solutions, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0310] The above description is merely a specific implementation of the embodiments of this application, but the protection scope of the embodiments of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the embodiments of this application should be included within the protection scope of the embodiments of this application. Therefore, the protection scope of the embodiments of this application should be determined by the protection scope of the claims.
Claims
1. A data processing method, the method being applied to an end-side device or a cloud measurement device, characterized by, include: At a first moment, a key-value cache (KV cache) for at least one future generation step is prefetched. The first moment is any moment of the T-th generation step, which is used to generate the token T. The at least one future generation step includes the T+N-th generation step, where T and N are positive integers. The importance score of at least one future generation step is calculated based on the key KV cache.
2. The method of claim 1, wherein, The method further includes: The predicted target information for at least one future generation step is determined based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query Q vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector. The key KV cache is determined based on the predicted target information.
3. The method of claim 2, wherein, The historical generation step includes the Tth generation step, and the step of determining the predicted target information for at least one future generation step based on the target information of the historical generation step includes: After the second time point, the predicted target information is determined based on the target information, where the second time point is the time when the query vector or importance score is calculated in the T-th generation step.
4. The method according to claim 2 or 3, characterized in that, Determining the key KV cache based on the predicted target information includes: A first importance score is determined based on the predicted importance score vector, wherein the first importance score is the largest of the multiple importance scores in the predicted importance score vector; The critical KV cache is determined based on the first importance score, and the critical KV cache includes the KV cache based on the first importance score.
5. The method according to any one of claims 2 to 4, characterized in that, Determining the key KV cache based on the predicted target information includes: The predicted importance score vector is divided into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector to obtain multiple predicted importance score blocks for at least one future generation step; A second importance score is determined based on the plurality of predicted importance score blocks, wherein the second importance score is the importance score of the plurality of predicted importance score blocks with the largest values among the plurality of predicted importance score blocks; The critical KV cache is determined based on the second importance score, and the critical KV cache includes the KV cache with the second importance score.
6. The method according to any one of claims 2 to 5, characterized in that, Determining the key KV cache based on the predicted target information includes: The importance score of the cached KV cache is determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one of the generation steps from the first generation step to the Tth generation step; A third importance score is determined based on the importance scores of the cached KV cache, wherein the third importance score is the largest of the multiple importance scores among the importance scores of the cached KV cache; The critical KV cache is determined based on the third importance score, and the critical KV cache includes the KV cache of the third importance score.
7. The method according to any one of claims 2 to 6, characterized in that, Determining the key KV cache based on the predicted target information includes: The importance score of the cached KV cache is determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one of the generation steps from the first generation step to the Tth generation step; The importance scores of the cached KV cache are divided into multiple importance score blocks based on the spatial location or similarity of the importance scores of the cached KV cache. A fourth importance score is determined based on the plurality of importance score blocks, wherein the fourth importance score is the importance score of the plurality of importance score blocks with the largest value among the plurality of importance score blocks; The critical KV cache is determined based on the fourth importance score, and the critical KV cache includes the KV cache of the fourth importance score.
8. The method according to any one of claims 2 to 7, characterized in that, Determining the key KV cache based on the predicted target information includes: The predicted query vector is input into the database to obtain the key KV cache.
9. The method according to any one of claims 1 to 8, characterized in that, The method further includes: Save the target information of the Tth generation step, the target information including the importance score vector and / or query vector.
10. The method according to any one of claims 1 to 9, characterized in that, The method further includes: Delete the target information for the TN generation step.
11. The method of claim 10, wherein, The deletion of the target information in the TN generation step includes: If the number of saved target information exceeds the threshold, delete the target information in the TN generation step.
12. A data processing method applied to an edge-cloud system, the edge-cloud system comprising edge-side devices and cloud-side devices, characterized in that, include: The cloud-side device prefetches key KV cache for at least one future generation step at a first moment, where the first moment is any moment of the T-th generation step. The T-th generation step is used to generate token T, and the at least one future generation step includes the T+N-th generation step, where T and N are positive integers. The end-side device calculates the importance score of the T+Nth generation step based on the key KV cache.
13. The method according to claim 12, characterized in that, The method further includes: The cloud-side device determines the predicted target information for at least one future generation step based on the target information of the historical generation steps. The historical generation steps include at least one generation step from the first generation step to the Tth generation step. The target information includes an importance score vector and / or a query vector. The predicted target information includes a predicted importance score vector and / or a predicted query vector. The cloud-side device determines the key KV cache based on the predicted target information.
14. The method according to claim 13, characterized in that, The historical generation step includes the Tth generation step. The cloud-side device determines the predicted target information for at least one future generation step based on the target information of the historical generation step, including: The cloud-side device determines the predicted target information based on the target information after the second time point, where the second time point is the time when the query vector or importance score is calculated in the Tth generation step.
15. The method according to any one of claims 12 to 14, characterized in that, The method further includes: The cloud-side device stores the target information of the Tth generation step, and the target information includes an importance score vector and / or a query vector.
16. The method according to any one of claims 12 to 15, characterized in that, The method further includes: The cloud-side device deletes the target information of the TN generation step.
17. A data processing apparatus, characterized in that, The data processing device is an end-side device or a cloud-side device, including: a transceiver unit and a processing unit; The transceiver unit is used to prefetch key KV cache for at least one future generation step at a first time, where the first time is any time of the T-th generation step, the T-th generation step is used to generate token T, and the at least one future generation step includes the T+N-th generation step, where T and N are positive integers. The processing unit is used to calculate the importance score of the T+Nth generation step based on the key KV cache.
18. The apparatus according to claim 17, characterized in that, The processing unit is also used for: The predicted target information for at least one future generation step is determined based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector. The key KV cache is determined based on the predicted target information.
19. The apparatus according to claim 18, characterized in that, The history generation step includes the Tth generation step, and the processing unit is specifically used for: After the second time point, the predicted target information is determined based on the target information, where the second time point is the time when the query vector or importance score is calculated in the Tth generation step.
20. The apparatus according to claim 18 or 19, characterized in that, The processing unit is specifically used for: A first importance score is determined based on the predicted importance score vector, wherein the first importance score is the largest of the multiple importance scores in the predicted importance score vector; The critical KV cache is determined based on the first importance score, and the critical KV cache includes the KV cache based on the first importance score.
21. The apparatus according to any one of claims 18 to 20, characterized in that, The processing unit is specifically used for: The predicted importance score vector is divided into blocks based on the spatial location or similarity of the predicted importance scores in the predicted importance score vector to obtain multiple predicted importance score blocks for at least one future generation step; A second importance score is determined based on the plurality of predicted importance score blocks, wherein the second importance score is the importance score of the plurality of predicted importance score blocks with the largest values among the plurality of predicted importance score blocks; The critical KV cache is determined based on the second importance score, and the critical KV cache includes the KV cache with the second importance score.
22. The apparatus according to any one of claims 18 to 21, characterized in that, The processing unit is specifically used for: The importance score of the cached KV cache is determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one of the generation steps from the first generation step to the Tth generation step; A third importance score is determined based on the importance scores of the cached KV cache, wherein the third importance score is the largest of the multiple importance scores among the importance scores of the cached KV cache; The critical KV cache is determined based on the third importance score, and the critical KV cache includes the KV cache of the third importance score.
23. The apparatus according to any one of claims 18 to 22, characterized in that, The processing unit is specifically used for: The importance score of the cached KV cache is determined based on the predicted query vector, wherein the cached KV cache includes the KV cache of at least one of the generation steps from the first generation step to the Tth generation step; The importance scores of the cached KV cache are divided into multiple importance score blocks based on the spatial location or similarity of the importance scores of the cached KV cache. A fourth importance score is determined based on the plurality of importance score blocks, wherein the fourth importance score is the importance score of the plurality of importance score blocks with the largest value among the plurality of importance score blocks; The critical KV cache is determined based on the fourth importance score, and the critical KV cache includes the KV cache of the fourth importance score.
24. The apparatus according to any one of claims 18 to 23, characterized in that, The processing unit is specifically used for: The predicted query vector is input into the database to obtain the key KV cache.
25. The apparatus according to any one of claims 17 to 24, characterized in that, The transceiver unit is also used for: Save the target information of the Tth generation step, the target information including the importance score vector and / or query vector.
26. The apparatus according to any one of claims 17 to 25, characterized in that, The transceiver unit is also used for: Delete the target information of the TN generation step.
27. The apparatus according to claim 26, characterized in that, The sending and receiving are specifically used for: If the number of saved target information exceeds the threshold, delete the target information in the TN generation step.
28. A cloud-side device applied to an edge-cloud system, the edge-cloud system comprising an edge-side device and a cloud-side device, characterized in that, Includes a transceiver unit and a processing unit; The transceiver unit is used to prefetch key KV cache for at least one future generation step at a first time, where the first time is any time of the T-th generation step, the T-th generation step is used to generate token T, and the at least one future generation step includes the T+N-th generation step, where T and N are positive integers. The processing unit is used to calculate the importance score of the T+Nth generation step based on the key KV cache.
29. The device according to claim 28, characterized in that, The processing unit is used for: The predicted target information for at least one future generation step is determined based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector. The key KV cache is determined based on the predicted target information.
30. The device according to claim 29, characterized in that, The history generation step includes the Tth generation step, and the processing unit is specifically used for: After the second time point, the predicted target information is determined based on the target information, where the second time point is the time when the query vector or importance score is calculated in the Tth generation step.
31. The device according to any one of claims 28 to 30, characterized in that, The transceiver unit is also used for: Save the target information of the Tth generation step, the target information including the importance score vector and / or query vector.
32. The device according to any one of claims 28 to 31, characterized in that, The transceiver unit is also used for: Delete the target information of the TN generation step.
33. An edge-side device applied to an edge-cloud system, the edge-cloud system comprising an edge-side device and a cloud-side device, characterized in that, Includes a transceiver unit and a processing unit; The transceiver unit is used to prefetch key KV cache for at least one future generation step at a first time, where the first time is any time of the T-th generation step, the T-th generation step is used to generate token T, and the at least one future generation step includes the T+N-th generation step, where T and N are positive integers. The processing unit is used to calculate the importance score of the T+Nth generation step based on the key KV cache.
34. The device according to claim 33, characterized in that, The processing unit is used for: The predicted target information for at least one future generation step is determined based on the target information of the historical generation steps, wherein the historical generation steps include at least one generation step from the first generation step to the Tth generation step, the target information includes an importance score vector and / or a query vector, and the predicted target information includes a predicted importance score vector and / or a predicted query vector. The key KV cache is determined based on the predicted target information.
35. The device according to claim 34, characterized in that, The history generation step includes the Tth generation step, and the processing unit is specifically used for: After the second time point, the predicted target information is determined based on the target information, where the second time point is the time when the query vector or importance score is calculated in the Tth generation step.
36. The device according to any one of claims 33 to 35, characterized in that, The transceiver unit is also used for: Save the target information of the Tth generation step, the target information including the importance score vector and / or query vector.
37. The device according to any one of claims 33 to 36, characterized in that, The transceiver unit is also used for: Delete the target information of the TN generation step.
38. An edge-cloud system, comprising at least one cloud-side device and at least one edge-side device, characterized in that, The at least one cloud-side device includes the cloud-side device according to any one of claims 28 to 32, and the at least one end-side device includes the end-side device according to any one of claims 33 to 37.
39. A cloud-side device, comprising multiple processors and a memory, characterized in that, The plurality of processors execute programs or instructions stored in memory to cause the cloud-side device to implement the method of any one of claims 1 to 11.
40. An edge device comprising multiple processors and a memory, characterized in that, The plurality of processors execute programs or instructions stored in memory to cause the end-side device to implement the method of any one of claims 1 to 11.
41. A data processing apparatus, comprising at least one processor and a memory, characterized in that, The at least one processor executes a program or instructions stored in a memory to cause the data processing apparatus to implement the method of any one of claims 1 to 11.
42. A chip comprising at least one processor and a memory, characterized in that, The at least one processor executes a program or instructions stored in a memory to cause the chip to implement the method of any one of claims 1 to 11.
43. A computer-readable storage medium for storing a computer program, characterized in that, When the computer program is run on a computer or processor, it causes the computer or processor to perform the method according to any one of claims 1 to 11.
44. A computer program product, the computer program product comprising instructions, characterized in that, When the instructions are executed on a computer or processor, the computer or processor causes the computer or processor to perform the method of any one of claims 1 to 11.