Data processing method and apparatus
By separating the key-value cache between the computing nodes and the storage nodes, and using speculative tokens and attention scores to select key-value pairs with high attention, the storage and computing overhead problem of the key-value cache mechanism is solved without changing the model structure, thereby improving the inference efficiency of the model.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-11-11
- Publication Date
- 2026-06-18
AI Technical Summary
Existing key-value caching mechanisms cannot effectively reduce training overhead or latency without changing the model structure, and may also lead to information loss or increase inter-node communication overhead.
Separate key-value (KV) caches between compute nodes and storage nodes. Prefetch and filter high-interest KV pairs through storage nodes, extract only necessary KVs for computation on compute nodes, and use speculative tokens and attention scores to select relevant KVs, thereby reducing computation and storage overhead.
It improves the inference efficiency of the model, reduces the storage and communication overhead of computing nodes, and maintains the performance of the model without changing the model structure.
Smart Images

Figure CN2025134006_18062026_PF_FP_ABST
Abstract
Description
A data processing method and apparatus
[0001] This application claims priority to Chinese Patent Application No. 202411815269.5, filed on December 10, 2024, entitled "A Data Processing Method and Apparatus", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of artificial intelligence, and more particularly to a data processing method and apparatus. Background Technology
[0003] Large language models (LLMs) based on Transformers are a significant innovation in deep learning. They utilize self-attention mechanisms to process input data, effectively capturing long-range dependencies in sequences. The core structure of a Transformer model includes an encoder and a decoder, but many LLMs, such as the GPT series, primarily use the decoder. Self-attention allows the model to consider information from all other words in the sequence when processing each word, thus generating context-sensitive representations. The key-value (KV) cache mechanism is an optimization of the Transformer model, mainly used in the inference stage. When generating text, the model needs to predict the next word step by step. Traditional methods recalculate all self-attention keys and values at each step. The KV cache, however, stores the calculated keys and values after the first calculation. In subsequent steps, the model only needs to add new keys and values to the existing ones, significantly reducing computation and improving inference speed.
[0004] Existing key-value (KV) caching mechanisms can alter the model structure, render untrained optimization inapplicable, compress KV values leading to information loss, or store KV values in the storage space of other computing nodes, increasing communication overhead between nodes. Therefore, how to maintain KV caching without introducing additional training overhead or latency has become a pressing issue. Summary of the Invention
[0005] This application provides a data processing method and apparatus that can be used to achieve higher model inference performance without changing the model structure or without training.
[0006] In view of this, in a first aspect, this application provides a data processing method applied to a computing node, comprising: firstly, acquiring an input sequence, which can be determined based on data to be processed, the data to be processed may include, but is not limited to, text, image, audio data, or video data; subsequently, inputting the input sequence into a large model, outputting multiple words, the large model including at least one attention module, each attention module outputting at least one pair of key-value pairs (KV); wherein, during inference, the output process for a word other than the first word among the multiple words includes: storing the KV generated during previous or multiple inferences in a storage node, and based on a first speculative token... At least one first key-value pair is obtained from the storage node; then, a large model is run based on the first output token and at least one first key-value pair to obtain a second output token and a second speculative token. The first output token represents the first word, the at least one first key-value pair is obtained in advance from the storage node, the first speculative token is output synchronously during the calculation of the first output token, the first speculative token is an approximation token of the second output token, the second output token represents the second word, the first word and the second word are adjacent words among multiple words, the second speculative token is used to calculate the next word of the second word and to obtain the key-value pair associated with the next word of the second word in advance from the storage node; then the second word is output.
[0007] In this embodiment, the computing node offloads the key-value pairs (KV pairs) to the storage node. During inference, only the KV pairs with high interest need to be extracted. Since the parts of the model that require attention during inference typically affect most of the model's output performance, this embodiment extracts the more important KV pairs, thus maintaining model performance while reducing node storage and computational overhead, thereby improving the overall inference efficiency of the model.
[0008] In one possible implementation, the aforementioned method further includes: after calculating the first speculative token, obtaining the attention score of the first speculative token for each key-value pair; and based on the attention score of the first speculative token, selecting at least one first key-value pair from the key-value pairs stored in the storage node. The first speculative token can be understood as an approximate token for the next word, thus the approximate token can be used to select the key-value pairs that may be needed for the next word inference. In this embodiment, key-value pairs can be selected based on the attention score of the speculative token, thus requiring only the reading of the relevant key-value pairs for the next calculation, rather than reading all key-value pairs.
[0009] In one possible implementation, the aforementioned process of selecting at least one pair of first KVs from the KVs stored in the storage node based on the attention score of the first speculative token includes: selecting the top k pairs of KVs from the KVs stored in the storage node, arranged in descending order of the attention score of the first speculative token, as at least one pair of first KVs, where k is a positive integer.
[0010] In this embodiment, the top-k key-value pairs can be selected according to the attention score, thereby pre-extracting the key-value pairs required for the next word inference, reducing the storage and computational overhead of the computing nodes, and improving the overall inference efficiency of the model.
[0011] In one possible implementation, the aforementioned acquisition of the attention score of the first speculative token for each key value may include: determining the attention score of the first speculative token based on the low-precision key values when the computing node also caches the quantized low-precision key values.
[0012] In this embodiment of the application, the quantized low-precision key-value pairs are cached in the computing node. The attention score corresponding to the first speculative token can be determined based on the low-precision key-value pairs, thereby determining the key-value pairs that may be related to the reasoning of the next word.
[0013] In one possible implementation, the aforementioned method further includes: if the currently calculated word is the first word, inputting the data in the input sequence into the large model to obtain the third output token and multiple KV pairs; and unloading the multiple KV pairs into the storage node.
[0014] In this embodiment of the application, for the reasoning of the first word, the input sequence can be input into a large model to execute a complete computation-based full-precision KV reasoning process, thereby outputting the accurate token corresponding to the first word, so as to provide accurate initial data for the reasoning of subsequent words.
[0015] In one possible implementation, the aforementioned method further includes: obtaining at least one second KV from multiple KV pairs based on a third output token; calculating a third speculative token based on the at least one second KV pair, the third speculative token being used to determine the KV associated with the second word.
[0016] In the implementation method of this application, after reasoning the first word, in order to facilitate the reasoning of subsequent words, the output token of the first word can be used to calculate the approximate token of the next word with the low-precision cached key-value pair, so that before the reasoning of the next word, the key-value pair with higher attention can be selected based on the approximate token.
[0017] In one possible implementation, the aforementioned method further includes: quantizing multiple pairs of key-value pairs to obtain quantized multiple pairs of low-precision key-value pairs, wherein the quantized multiple pairs of low-precision key-value pairs are used to determine the speculative token corresponding to the next word of the current word.
[0018] In this embodiment, the computing node can store the quantized low-precision key-value pairs locally, which will occupy less storage space in the computing node. For key-value pairs that are of higher interest during inference, higher-precision key-value pairs can be prefetched from the storage node.
[0019] In one possible implementation, the aforementioned computing nodes may include, but are not limited to, GPUs, NPUs, or TPUs, and the storage nodes may include, but are not limited to, CPU storage space, disks, or other nodes that can be used to store data.
[0020] Secondly, this application provides a data processing apparatus applied to a computing node, comprising:
[0021] The acquisition module is used to acquire the input sequence;
[0022] The inference module is used to input the input sequence into the large model and output multiple words. The large model includes at least one attention module, and the output of each attention module includes at least one key-value pair.
[0023] The inference module's output process for a word other than the first word includes: obtaining at least one pair of first key-value pairs from the storage node based on a first speculative token; running a large model based on a first output token and at least one pair of first key-value pairs to obtain a second output token and a second speculative token, where the first output token represents the first word, the at least one pair of first key-value pairs is obtained in advance from the storage node, the first speculative token is output synchronously during the calculation of the first output token, the first speculative token is an approximation token of the second output token, the second output token represents the second word, the first word and the second word are adjacent words among multiple words, the second speculative token is used to calculate the next word of the second word and to obtain the key-value pairs associated with the next word of the second word in advance from the storage node; and outputting the second word.
[0024] The effects achieved by the second aspect or any optional implementation of the second aspect can be referred to the description of the first aspect or any optional implementation of the first aspect, and will not be repeated hereafter.
[0025] In one possible implementation, the aforementioned inference module is further configured to: after calculating the first speculative token, obtain the attention score of the first speculative token for each key value (KV); and based on the attention score of the first speculative token, select at least one pair of first key values (KV) from the KV stored in the storage node.
[0026] In one possible implementation, the aforementioned inference module is used to select the top k pairs of KVs from the KVs stored in the storage node, arranged in descending order of the attention score of the first speculative token, as at least one first KV pair, where k is a positive integer.
[0027] In one possible implementation, the aforementioned inference module is used to determine the attention score of the first speculative token based on the low-precision key-value pairs, while the computing node also caches the quantized low-precision key-value pairs.
[0028] In one possible implementation, the aforementioned reasoning module is used to: if the currently calculated word is the first word, input the data in the input sequence into the large model to obtain the third output token and multiple pairs of key-value pairs; and unload the multiple pairs of key-value pairs into the storage node.
[0029] In one possible implementation, the aforementioned reasoning module is configured to: obtain at least one second KV from multiple pairs of KVs based on a third output token; and calculate a third speculative token based on the at least one second KV, the third speculative token being used to determine the KV associated with the second word.
[0030] In one possible implementation, the aforementioned inference module is further configured to: quantize multiple pairs of key-value pairs to obtain quantized multiple pairs of low-precision key-value pairs, which are then used to determine the speculative token corresponding to the next word of the current word.
[0031] Thirdly, embodiments of this application provide a computing device including a processor and a memory, wherein the processor and the memory are interconnected via a circuit, and the processor calls program code in the memory to perform the function of the method shown in any of the first aspects above.
[0032] Fourthly, embodiments of this application provide a digital processing chip or chip, the chip including a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, the program instructions are executed by the processing unit, and the processing unit is used to perform processing-related functions as described in the first aspect or any optional embodiment of the first aspect.
[0033] Fifthly, embodiments of this application provide a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to perform the method described in the first aspect or any optional implementation thereof.
[0034] In a sixth aspect, embodiments of this application provide a computer program product comprising a computer program / instructions, which, when executed by a processor, causes the processor to perform the method described in the first aspect or any optional implementation thereof. Attached Figure Description
[0035] Figure 1 is a schematic diagram of the architecture of a cloud service system provided in an embodiment of this application;
[0036] Figure 2 is a schematic diagram of the structure of a computing device provided in an embodiment of this application;
[0037] Figure 3 is a flowchart illustrating a data processing method provided in an embodiment of this application;
[0038] Figure 4 is a flowchart illustrating another data processing method provided in an embodiment of this application;
[0039] Figure 5 is a schematic diagram of the structure of a data processing device provided in an embodiment of this application;
[0040] Figure 6 is a schematic diagram of another computing device provided in an embodiment of this application. Detailed Implementation
[0041] The technical solutions of the embodiments of this application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0042] First, for ease of understanding, some terms or concepts involved in the embodiments of this application will be introduced.
[0043] (1) Neural Network
[0044] Neural networks can be composed of neural units, which can refer to units represented by x. s The output of an operation unit that takes an intercept of 1 as input can be expressed as:
[0045] Where s = 1, 2, ..., n, n is a natural number greater than 1, W s For x s The weights are denoted by b, where b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer; the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together; that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.
[0046] (2) Deep Neural Networks
[0047] A deep neural network (DNN), also known as a multilayer neural network, can be understood as a neural network with multiple intermediate layers. Based on the position of these layers, the internal neural network of a DNN can be divided into three categories: input layer, intermediate layers, and output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are considered intermediate layers, or hidden layers.
[0048] Although DNNs appear complex, each layer can be represented as a linear relational expression: in, It is the input vector. It is the output vector. is the offset vector, also known as the bias parameter; w is the weight matrix (also called coefficients); and α() is the activation function. Each layer is simply an adjustment of the input vector. The output vector is obtained through such a simple operation. Because DNNs have many layers, the coefficients W and the offset vector... The number of these parameters is also quite large. The definitions of these parameters in DNNs are as follows: Taking the coefficient w as an example: Assuming a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as... The superscript 3 represents the layer number where coefficient W is located, while the subscript corresponds to the third layer index 2 of the output and the second layer index 4 of the input.
[0049] In summary, the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as...
[0050] It's important to note that the input layer does not have a W parameter. In deep neural networks, more intermediate layers allow the network to better represent complex real-world situations. Theoretically, the more parameters a model has, the higher its complexity and "capacity," meaning it can perform more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrix of all layers in the trained deep neural network (a weight matrix formed by the vectors W from many layers).
[0051] (3) Large Model
[0052] Large models are large-scale models. The "large" in large models can be reflected in many aspects, such as large data scale, large-scale parallel computing capabilities, and larger model structures.
[0053] (4) Language model (LM)
[0054] Language models play a crucial role in Natural Language Processing (NLP), where their task is to predict the probability of a sentence occurring in a language. For example, a language model is typically constructed as a probability distribution p(s) of a string s, where p(s) attempts to reflect the frequency of string s as a sentence. It can be applied to scenarios such as text recognition or machine translation. In the embodiments of this application, the NLP models mentioned below include language models.
[0055] (5) Large Language Model (LLM)
[0056] A large language model (LLM) refers to a language model containing hundreds of billions (or more) parameters trained on massive amounts of text data. It is a natural language processing model based on deep learning. These models can process large amounts of text data to learn the grammatical and semantic rules of natural language. LLMs can be applied to text generation, machine translation, question answering systems, text summarization, and sentiment analysis, offering advantages such as strong generative capabilities, high adaptability, accurate prediction, and strong scalability. For example, in movie recommendation scenarios, a large language model can generate descriptions of movie scenes, including genre, main actors, and plot, enabling the system to better recommend similar movies. Large language models can also generate recommendation reasons; for example, e-commerce websites can use large language models to generate reasons for recommending products, such as product quality, price, and features, allowing users to better understand the value of the product.
[0057] (6) transformer
[0058] A transformer architecture is a feature extraction network that includes both an encoder and a decoder (classified as a convolutional neural network). Of course, in some cases, a transformer architecture may not include an encoder but may include a decoder. The aforementioned language models or large language models can be models based on a transformer architecture.
[0059] Encoder: Learns features, such as pixel features, within the global receptive field using self-attention.
[0060] Decoder: Learns the features of the desired module, such as the features of the output box, through self-attention and cross-attention.
[0061] For example, a Transformer layer structure may include an attention network and a feedforward network module. Taking natural language processing as an example, the attention network obtains corresponding weight values by calculating the relevance between words based on the attention mechanism, thus obtaining context-related word representations, which is the core part of the Transformer structure. The feedforward network further transforms the obtained representations to obtain the final output of the Transformer layer. In addition to these two important components, residual layers (ADD) and linear normalization (Norm) are also stacked on these two components to optimize the output of the Transformer layer.
[0062] (7) Attention mechanism
[0063] Attention mechanisms can quickly extract important features from sparse data. They provide an effective modeling approach for capturing global contextual information through QKV (Queries, Keys, Values). Assuming the input is Q(query), and the context is stored as key-value pairs (K, V), then the attention mechanism is essentially a mapping function from the query to a series of key-value pairs. The essence of the attention function can be described as a mapping from a query to a series of (key-value) pairs. Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in (K, V) form, then attention performs addressing by calculating the similarity between Q and K. The similarity calculated between Q and K reflects the importance of the extracted V values, i.e., the weights, and then a weighted sum is obtained to obtain the final feature value.
[0064] Attention calculation mainly consists of three steps. The first step is to calculate the similarity between the query and each key to obtain weights. Common similarity functions include dot product, concatenation, and perceptron. The second step typically uses a softmax function (which can normalize the weights, resulting in a probability distribution where the sum of all weight coefficients is 1, and also highlights the weights of important elements) to normalize these weights. Finally, the weights and their corresponding key values are weighted and summed to obtain the final feature value. The specific calculation formula is as follows:
[0065] Where d is the dimension of matrix Q,K.
[0066] Furthermore, attention includes self-attention and cross-attention. Self-attention can be understood as a special type of attention where the inputs to the QKV features are consistent. Cross-attention, on the other hand, involves inconsistent inputs to the QKV features. Attention integrates the queried features as updated values for the current features using the similarity between features (e.g., inner product) as weights. Self-attention is attention extracted based on the attention drawn from the feature map itself.
[0067] For convolutional networks, the kernel size limits the receptive field, often requiring multiple layers to focus on the entire feature map. Self-attention, on the other hand, has the advantage of global focus; it can obtain global spatial information about the feature map through simple lookups and assignments.
[0068] (8) Loss Function
[0069] In training deep neural networks, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value and update the weight vector of each layer based on the difference. (Of course, there's usually an initialization process before the first update, pre-configuring parameters for each layer.) For example, if the network's prediction is too high, the weight vector is adjusted to predict a lower value. This adjustment continues until the deep neural network can predict the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, so training the deep neural network becomes a process of minimizing this loss. Common loss functions include mean squared error, cross-entropy, logarithmic, and exponential loss functions. For example, mean squared error can be used as the loss function, defined as... The specific loss function can be selected based on the actual application scenario.
[0070] (9) Backpropagation algorithm
[0071] Neural networks can employ backpropagation (BP) to correct the parameters in the initial neural network model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation (also known as forward pass-through) generates error loss from the input signal to the output. Backpropagating this error loss information updates the parameters in the initial neural network model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix. Backpropagation includes gradient backpropagation and parameter update. Gradient backpropagation, also called gradient reverse propagation, refers to calculating the gradient values for each parameter in reverse order (i.e., the reverse of forward propagation). Parameter update involves using the calculated gradient values to further calculate new parameters, which are then used as the model parameters output in the next iteration of training.
[0072] Secondly, the method provided in this application can be deployed on server clusters, cloud platforms, or other devices with computing capabilities.
[0073] For example, in one scenario, the method provided in this application can be deployed on a cloud platform to provide cloud services to users through a client.
[0074] For example, Figure 1 shows a schematic diagram of the structure of a cloud service system provided in this application. As shown in Figure 1, the cloud service system 10 may include a data center 11, a cloud platform 12, and a client 13.
[0075] The cloud platform 12 may specifically include a server cluster or a cloud platform, or the computing device may be other devices with computing capabilities. Optionally, the cloud platform 12 can work with other computing devices, such as data storage, routers, load balancers, etc. The cloud platform 12 can use data from the data storage system or call program code in the data storage system to implement the method steps provided in the embodiments of this application.
[0076] Data center 11 can be used to store data for cloud platform 12 to query or write data, etc.
[0077] The cloud platform 12 can provide services to users in the form of a client. Users can operate on the client 13 to interact with the cloud platform 12 on data or to request services from the cloud platform 12. This client can be deployed on personal computers, computer workstations, smartphones, tablets, laptops, and smart cars, etc.
[0078] In one possible implementation, the cloud platform 12 is used to implement the method provided in the embodiments of this application. That is, a large model is deployed in the cloud platform 12, the data processing method provided in the embodiments of this application is executed based on the large model, output data is obtained based on the data input by the client, and the output data is sent to the client.
[0079] It should be noted that client 13 is an optional device. For example, the inference process can be executed directly in the cloud platform 12 using the data stored in the data center 11, without the need for client 13.
[0080] The aforementioned large-scale model can specifically refer to a Transformer-based LLM. Transformer-based LLMs are a significant innovation in deep learning, utilizing a self-attention mechanism to process input data and effectively capture long-range dependencies in sequences. While the core structure of a Transformer includes an encoder and a decoder, many LLMs, such as the GPT series, primarily utilize the decoder portion. The self-attention mechanism allows the model to consider information from all other words in the sequence when processing each word, thereby generating context-sensitive representations.
[0081] The Key-Value (KV) cache mechanism is an optimization of the Transformer model, primarily used in the inference phase. When generating text, the model needs to predict the next word progressively. Traditional approaches typically recalculate all self-attention keys and values at each step. The KV cache, however, stores the calculated keys and values after the initial computation. In subsequent steps, the model only needs to add new keys and values to the existing set, significantly reducing computation and improving inference speed. This makes Transformer-based LLMs more efficient in text generation, capable of handling longer input sequences while maintaining high performance. By combining self-attention and KV caching, LLMs achieve outstanding performance in complex natural language processing tasks.
[0082] The ability to handle long sequences is crucial for LLMs, as it directly impacts their performance in tasks such as document processing, retrieval augmentation generation, and context learning. Common transformer architectures in LLMs rely on key-value (KV) caches to avoid redundant computations during decoding. However, the size of the KV cache grows linearly with the sequence length, leading to significant memory overhead. For example, when processing sequences of length 2k with a batch size of 16, the KV cache size of LLaMA 2-7B reaches 8.4B, exceeding the number of parameters in the model itself. Since the memory of computational units (e.g., GPU VRAM) is typically limited, the KV cache becomes a bottleneck restricting LLM deployment, especially on edge devices.
[0083] For example, in an existing efficient KV architecture, the size of the KV cache is directly determined by the model architecture. By modifying the model structure, the size of the KV cache can be reduced. Multi-Query Attention (MQA) reduces cache size by sharing keys and values across all attention heads, while Grouped Query Attention (GQA) groups attention heads and shares keys and values only within each group. YOCO allows the latter layers of an LLM to reuse keys and values computed in the former layers. Multi-Head Latent Attention (MLA) reparameterizes keys and values as linear projections of low-rank space vectors. Furthermore, some schemes modify the attention mechanism to avoid linear growth of the KV cache, such as RWKV, RetNet, or state-space models. These schemes alter the model architecture and therefore must be applied before pre-training begins, making them unsuitable for training-free optimization during the existing LLM inference phase.
[0084] For example, some existing post-training compression KV cache schemes employ methods such as deletion, merging, and quantization of KV pairs. StreamLLM retains only the most recent tokens and a small number of initial tokens; H2O, Scissorhands, and RoCo use attention scores to measure the importance of KV pairs and greedily delete unimportant tokens; FastGen introduces four strategies to determine which tokens to retain and selects the optimal strategy combination for each attention head during the pre-filling phase; CaM and D2O selectively merge KV pairs about to be deleted with those currently being retained, thus preserving some information; MiniCache leverages the similarity of KV caches between adjacent layers for cross-layer merging; KIVI and KVQuant perform channel-wise quantization of keys and token-wise quantization of values, compressing the KV cache to 2 bits; ZipCache introduces channel-separable token quantization for even higher compression ratios. These schemes can reduce the KV cache size during the inference phase without training. However, since compression inherently leads to information loss, greedy compression in the current step may discard potentially useful information for future steps, potentially reducing LLM performance.
[0085] For example, some existing key-value offloading and prefetching schemes, such as FlexGen, offload the key-value cache to CPU memory or even disk and find the optimal offloading strategy; Huggingface's transformers library also implements a simple key-value cache offloading method, where each layer prefetches the key-value cache required by the next layer; InfLLM divides long-range contexts into offloaded memory units and, during inference, only retrieves units related to the current token for attention computation. Although these schemes reduce the use of video random-access memory (VRAM) without losing key-value cache information, frequent CPU-GPU communication significantly increases inference latency.
[0086] Therefore, this application provides a data processing method that can cache key-value pairs in storage nodes without model training. The computing node simultaneously decodes the current token and the approximate token of the next token, and selects to move more relevant key-value pairs from the storage node based on the approximate token, thereby reducing the amount of key-value pairs that need to be moved and achieving lower key-value access overhead without changing the model structure.
[0087] First, the device that performs the method provided in the embodiments of this application can be referred to as a computing device. The computing device may include the aforementioned cloud platform, or other devices with computing capabilities, such as servers or server clusters, or even terminals with computing capabilities.
[0088] For example, the structure of the computing device provided in this application embodiment can be as shown in FIG2. The computing device may include multiple nodes. In this application embodiment, the multiple nodes are divided into computing nodes and storage nodes.
[0089] In this embodiment, a computing node refers to a node in a computing device that has computing capabilities. This may include, but is not limited to, GPUs, TPUs, NPUs, or other neural network accelerators. In some scenarios, a neural network accelerator may be simply referred to as an XPU.
[0090] In this embodiment, a storage node refers to a node in a computing device that can be used to store data. This may include, but is not limited to, the CPU's storage space, a disk, or other spaces available for data storage. Figure 2 uses the CPU as an example of a storage node, but it can be replaced with a disk or other spaces available for data storage.
[0091] The method provided in this application embodiment can be executed by a computing node, that is, the steps of running a large model can be executed by a computing node.
[0092] The methods provided in the embodiments of this application will be described below.
[0093] Referring to Figure 3, a flowchart of a data processing method provided in an embodiment of this application is shown below.
[0094] 31. Obtain the input sequence.
[0095] The input sequence can be a one-dimensional or multi-dimensional vector, which can be used to represent the input corpus.
[0096] Specifically, the input sequence can be the output sequence after processing input data such as text, images, audio, or video through a neural network, or it can be a sequence directly input by the user.
[0097] For example, in one possible scenario, if the input data is text, the text can be directly converted into a representation vector using the embedding layer. That is, the vector corresponding to each word in the text can be queried from the embedding vocabulary to obtain the input sequence.
[0098] For example, in one possible scenario, if the input data is an image or video, a visual model can be used to extract features from the image or video data and encode them into vectors to obtain the input sequence.
[0099] For example, in one possible scenario, if the input data is audio data, the audio can be converted into text, and then the text can be converted into a representation vector using an embedding layer to obtain the input sequence.
[0100] For example, in one possible scenario, the large model in this application embodiment can be a multimodal large model, and the input data can include multimodal data. Using the multimodal model, the multimodal data is mapped to the same sequence to obtain the input sequence.
[0101] 32. Input the input sequence into the large model and output multiple words.
[0102] After obtaining the input sequence, it is fed into a large model, which outputs multiple words, each of which can be represented as one or more characters. Within the large model, these words can be represented by tokens. For different downstream tasks, corresponding downstream task modules can be deployed within the large model, which can perform operations on the tokens. For example, this large model can be used to perform downstream tasks such as end-to-end speech translation, end-to-end speech dialogue, image title generation, optical character recognition (OCR) reading aloud, or video-to-text conversion. The specific downstream tasks performed can be determined based on the actual application scenario.
[0103] Specifically, a large model may include multiple network layers, and each network layer may include one or more modules. Each network layer may include one or more attention modules, which are modules deployed based on attention mechanisms. The attention mechanism can be referred to in the aforementioned terminology introduction section, and will not be repeated here.
[0104] For the output process of the first word, the input sequence can be fed into a large model, and the model infers layer by layer to output the first word, or the first output token. During inference, the key-value values output by the attention module are stored in the storage nodes. Before decoding the next word, an approximate token for the second word is calculated based on the data cached by the computing nodes; this is called a speculative token for easy distinction. During the decoding process of words other than the first word, the current word can be decoded and output based on the output token of the previous word and the speculative token.
[0105] The following describes the operation process of the attention module during the decoding of one of the words that is not the first word.
[0106] 321. Obtain at least one first KV pair from the storage node based on the first speculative token.
[0107] In the process of decoding the previous word, the first output token and the first speculative token of the previous word can be decoded simultaneously. After obtaining the first speculative token, at least one pair of first key-value pairs can be obtained from the storage node.
[0108] Optionally, during the process of reasoning and outputting the first output token, when selecting the first KV, it is also possible to consider selecting from the KV calculated during the process of outputting the first output token.
[0109] Typically, after obtaining the first speculative token during the reasoning process of a word, the remaining reasoning steps in the current reasoning process can be executed. Based on the first speculative token, at least one pair of first key-value pairs can be determined, thereby obtaining the key-value pairs required for the reasoning process of the next word from the storage node in advance. In other words, during the computation process of the computing node, the communication resources between the computing node and the storage node can be fully utilized to transmit key-value pairs, thereby improving the utilization rate of the computing node's communication resources and improving the overall reasoning efficiency.
[0110] Optionally, when retrieving at least one first KV pair from the storage node, all KVs can be retrieved from the storage node, or the more relevant KV can be selected as the first KV based on the attention score of the first speculative token.
[0111] In one possible implementation, after calculating the first speculative token, the attention score of the first speculative token for each key-value pair can be obtained; based on the attention score of the first speculative token, at least one pair of first key-value pairs can be selected from the key-value pairs stored in the storage node. In this embodiment, the speculative token can be used as an approximate token for the next word, thereby selecting the key-value pair corresponding to the next word based on the approximate token, thus selecting a more relevant key-value pair for the next word without having to obtain all key-value pairs, reducing the communication overhead between the computing node and the storage node.
[0112] Furthermore, from the key-value pairs (KVs) stored in the storage node, the top k pairs of KVs, arranged from highest to lowest according to the attention score of the first speculative token, can be selected as at least one first KV pair. Here, k is a positive integer, which can be a pre-set value or a value determined during training. In this embodiment, attention scores can be calculated based on approximate tokens and each KV, thereby selecting the KV with the highest attention score as the first KV. This means that a more relevant KV can be selected as the KV corresponding to the next word, without needing to extract all KVs, thus reducing communication overhead between the computing node and the storage node.
[0113] Furthermore, to predict the attention score of the next token, the compute node can cache the quantized low-precision key-value pairs. If the compute node also caches the quantized low-precision key-value pairs, the low-precision key-value pairs can be used to calculate the attention score of the first speculative token. This can be combined with the aforementioned attention score calculation formula: The first speculative token is used as an approximation of Q. Attention scores for each pair of KVs are calculated by combining the cached low-precision KVs. Based on these low-precision KV attention scores, the top k pairs of KVs, ranked from highest to lowest, are selected from the storage nodes as the first KV. In this embodiment, the cached low-precision KVs in the compute node and the approximate Q output in the previous inference process can be used to filter the KVs required for the next inference process. Therefore, the compute node does not need to store all KVs in full precision; it can extract the more relevant KVs from the storage node before inference, reducing storage requirements on the compute node and improving the utilization of communication resources between the compute node and the storage node.
[0114] 322. Obtain the second output token and the second speculative token by running the large model based on the first output token and at least one pair of first KV.
[0115] During the decoding process, a large model is run based on the first output token from the previous decoding output and at least one prefetched first key-value pair to obtain the second output token and the second speculative token for the current inference.
[0116] After the second speculative token is calculated, at least one relevant key-value pair can be filtered from the storage node for use in the next inference.
[0117] 323. Output the second word.
[0118] After obtaining the second output token, the second word can be obtained based on the second output token.
[0119] Specifically, the downstream task module in the large model can be used to output the second word, which is the word following the first word mentioned above. The second word can be any word other than the first word among the multiple words output by the large model.
[0120] In this embodiment, the calculated key-value pairs (KVs) can be stored in a storage node. Before using the KVs during the decoding process, the relevant KVs can be prefetched from the storage node based on the speculative tokens from the previous decoding process. That is, the computing node does not need to store all KVs or store high-precision KVs; it can prefetch the KVs required for the next decoding. This can reduce the consumption of computing node resources and make full use of the communication resources between the computing node and the storage node, thereby maximizing the use of the storage resources of the computing node, the storage resources of the storage node, and the communication resources between the computing node and the storage node, and improving the overall training efficiency of the model.
[0121] Furthermore, for the inference of the first word, the data in the input sequence can be fed into the large model to obtain a third output token and multiple key-value pairs. This third output token represents the first word, and then the multiple key-value pairs are unloaded into the storage node. Therefore, during the inference process of the first word, after the computation of each attention layer is completed, the key-value pairs are unloaded into the storage node, and the full set of high-precision key-value pairs are used to obtain the accurate first output token during the inference process of the first word.
[0122] For the approximate token of the second word, at least one second key-value pair can be obtained from multiple key-value pairs based on the third output token. A third speculative token is then calculated based on this third pair of second key-value pairs, and this third speculative token is used to determine the key-value pair associated with the second word. In other words, full-precision key-value pairs are used to calculate the approximate token of the next token. Therefore, when inferring about the second word, it is not necessary to extract all key-value pairs from the storage node; only the key-value pairs with higher interest need to be extracted. This reduces the memory access overhead of key-value pairs during inference and improves the overall inference efficiency of the model.
[0123] Furthermore, during the reasoning process for the first word, the computing node can quantize the multiple key-value pairs output during the reasoning process, obtaining multiple quantized low-precision key-value pairs, and store the quantized low-precision key-value pairs locally. These quantized low-precision key-value pairs are used to determine the speculative token corresponding to the next word of the current word. Therefore, in this embodiment, the computing node can locally store low-precision key-value pairs, reducing the storage footprint of the computing node and improving the utilization rate of communication resources between the computing node and the storage node. This further enhances the utilization rate of storage and communication resources in the computing device, thereby improving the overall reasoning efficiency of the model.
[0124] The foregoing has described the method flow provided by the embodiments of this application. The method flow provided by the embodiments of this application will be further described below in conjunction with specific application scenarios.
[0125] For example, the method flow provided in this application can be shown in Figure 4, in which the method provided in the embodiments of this application is divided into an initial stage and an intermediate stage for description.
[0126] In the initial stage, the input sequence is used as input to the LLM, and the complete inference process is performed for the first word.
[0127] In the intermediate stage, the output token and speculation token obtained from the previous reasoning can be used to perform reasoning and output the current output token and speculation token.
[0128] Typically, the attention mechanism in LLM is quite sparse, meaning that during decoding, it's crucial to ensure that the small number of key-value pairs most relevant to the current query are fully present in GPU memory. Based on this, this application considers reducing inference costs from multiple dimensions. Two strategies can be used to avoid latency caused by data offloading: 1. Reduce the number of key-value pairs prefetched to the GPU, only retrieving the few most important to the current query; 2. Select important key-value pairs well before the attention layer, allowing prefetching to be performed in parallel with GPU computation.
[0129] Therefore, this application provides a training-free inference scheme to achieve the above objectives. A low-precision copy of the KV cache is stored in VRAM, while the original KV cache is kept in CPU memory. Before each attention layer computation begins, it is assumed that the top k most relevant high-precision KV pairs guessed in the previous step have been prefetched into VRAM. In each decoding step, the LLM simultaneously decodes two tokens: one for calculating the model's output, and the other for predicting the most likely KV pair to be focused on in the next decoding step. These 16-bit KV pairs most relevant to the next step are prefetched into VRAM in parallel before the next attention computation.
[0130] Since the main bottleneck in the LLM decoding process is memory access speed, GPU utilization is relatively low. This means that parallel decoding of two tags will hardly add any additional latency. Furthermore, since the embodiments of this application can prefetch the KV pairs required for attention one step in advance, prefetching and computation can be performed in parallel, thereby avoiding an increase in inference latency.
[0131] The following example uses a storage space with GPUs as computing nodes and CPUs as storage nodes to illustrate the initial and intermediate stages.
[0132] I. Initial Stage
[0133] In the initial stage, the input sequence can be fed into the LLM, and inference can be performed once. Layer by layer, the calculated key-value pairs are used to output the accurate token corresponding to the first word, i.e., the output token T1 corresponding to the first word. During inference, a layer-by-layer offloading approach can be used to offload the key-value pairs to the CPU's storage space. The GPU can locally quantize full-precision key-value pairs into low-precision key-value pairs. Therefore, after offloading the full-precision key-value pairs from the previous layer to the CPU, the full-precision key-value pairs are compressed into low-precision key-value pairs to provide cache space for the full-precision key-value pairs of the next layer.
[0134] Before decoding the next word, since the GPU's memory only caches low-precision key-value pairs, in order to extract the high-precision key-value pairs needed for the next word's calculation, T1 can be used as the input to the large model. Combined with the cached low-precision key-value pairs, the approximate token T2' for the next word is output, which is the aforementioned speculative token for the second word. Furthermore, when using T1 as the input to the large model, the attention score of T1 is recorded to calculate the top k key-value pairs that are more important when inferring the second word. And after this inference is completed, these top k key-value pairs are prefetched from the CPU's memory.
[0135] That is, after the initial stage is completed, the tokens T1 and T2' and the first k KV pairs are stored in the GPU cache, and the next word is inferred based on the cache.
[0136] II. Intermediate Stage
[0137] In the intermediate stage, taking the decoding process of the (t+1)th word as an example, before decoding (i.e., after decoding the previous word), the GPU memory includes: the output token Tt, the speculative token T'{t+1}, and the first k key-value pairs associated with T'{t+1}. After decoding, two tokens are output: T{t+1} and T'{t+2}.
[0138] During the decoding stage of the previous word, the first k KV pairs arranged according to attention scores during the attention calculation process of T'{t+1} are recorded, and these first k KV pairs are prefetched from the CPU storage space immediately after the end of this inference. That is, the first k KV pairs associated with T'{t+1} are likely to be needed in the next step of decoding T{t+1}.
[0139] Specifically, during the decoding stage of the (t+1)th word, Tt is input into the LLM. After inference based on the first k key-value pairs associated with T'{t+1}, T{t+1} is output. To provide more accurate data for the inference of the next word, T{t+1} is used as input to the LLM, and inference is performed using locally cached key-value pairs to output the approximate token T'{t+2} for the next word. During attention calculation, the first k key-value pairs with higher attention scores are recorded, and these pairs are prefetched from the CPU's storage space after the current inference is completed. That is, while decoding the current token, the top-k key-value pairs of the next token are guessed, allowing for prefetching of the top-k key-value pairs. This enables CPU-GPU communication to run in parallel with model computation on the GPU, reducing inference latency.
[0140] Meanwhile, during inference, the non-first k KV pairs generated by the GPU are offloaded to the CPU's storage space, reducing the storage space occupied by the GPU.
[0141] After this inference is completed, the GPU caches the first k key-value pairs associated with T{t+1}, T'{t+2}, and T'{t+2}. Decoding is performed by storing a low-precision copy of the key-value cache in VRAM and using accurate-inaccurate tokens together.
[0142] The decoding process for subsequent words can be deduced from the decoding process for the (t+1)th word, and will not be elaborated further below.
[0143] Typically, to achieve parallel prefetching and computation, key-value pairs (KV pairs) need to be prefetched as early as possible. This raises a key challenge: how to determine which KV pairs are important before the attention operation occurs? In practice, it's not necessary to prefetch the exact top k KV pairs; it's sufficient to prefetch KV pairs with high hit rates (attention scores), ensuring they contain the vast majority of significantly relevant KV pairs. Therefore, this application provides a scheme for pre-guessing the top-k KV pairs of the next token: by storing a low-precision copy of the KV cache in VRAM and decoding it together with accurate-inaccurate tokens, the top-k KV pairs of the next token can be guessed with a high hit rate.
[0144] When predicting the attention score for the next word, the required data includes a history key cache in video memory (VRAM) and an approximate representation of the next query. For the history key cache, training-free KV cache quantization can be trained during the training phase; for example, a 2-bit or even 1-bit KV cache approximation can be stored in VRAM. For the approximate representation of the next query, this application decodes an additional "speculative token" in parallel with the input token in each decoding step to approximate the next output token. Therefore, the entire set of KV pairs can be offloaded to CPU storage. With the KV cache offloaded, the GPU only needs to load the top-k KV cache to recover most of the attention performance, achieving high-performance LLM inference with lower CPU-GPU communication overhead.
[0145] Compared to existing solutions that perform LLM inference on the GPU, this application adds only one pre-decoding step and decodes two tokens simultaneously during decoding. This additional pre-decoding step is negligible in the overall sentence generation. Although the number of decoded tokens increases, the model weights and KV cache used by both tokens are the same. Since the decoding process of large language models is limited by memory access speed, decoding two tokens simultaneously allows for shared access to the model weights and KV cache without introducing additional latency. Furthermore, since prefetching is performed entirely in parallel with GPU operations, the overall inference latency is almost unaffected. Therefore, in this embodiment, a high-precision LLM inference process can be achieved with lower GPU storage costs and without increasing computational load.
[0146] After fully decoding the input sequence, the complete token can be processed for downstream tasks. For example, in tasks such as LLM-based natural language processing, dialogue systems, and machine translation, efficient key-value cache management can improve the inference speed and performance of the model; and running LLM on resource-constrained devices (such as smartphones and IoT devices) can improve operational efficiency and reduce power consumption.
[0147] For example, it can be deployed on an LLM inference engine or library to provide efficient LLM inference services based on the solutions provided in the embodiments of this application, supporting various application scenarios; customized GPUs or accelerators for high-performance computing and AI applications, with corresponding software pre-installed, etc. The specific form of service provided to users can be cloud services, software products, software tools, devices deploying the solutions provided in the embodiments of this application, or providing LLM inference services to users, etc., which can be determined according to the actual application scenario.
[0148] The foregoing has described the method flow provided in the embodiments of this application. The following describes the structure of the apparatus for executing the foregoing method flow.
[0149] Referring to Figure 5, a schematic diagram of a data processing device provided in an embodiment of this application is shown. This data processing device can be applied to a computing node, and the device includes:
[0150] Module 501 is used to acquire the input sequence;
[0151] The inference module 502 is used to input the input sequence into the large model and output multiple words. The large model includes at least one attention module, and the output of each attention module includes at least one pair of key-value pairs (KV).
[0152] The inference module 502 executes the following output process for a word other than the first word among multiple words: obtaining at least one pair of first key-value pairs from the storage node based on the first speculative token; running a large model based on the first output token and at least one pair of first key-value pairs to obtain a second output token and a second speculative token, wherein the first output token represents the first word, the at least one pair of first key-value pairs is obtained in advance from the storage node, the first speculative token is output synchronously during the calculation of the first output token, the first speculative token is an approximate token of the second output token, the second output token represents the second word, the first word and the second word are adjacent words among multiple words, the second speculative token is used to calculate the next word of the second word and to obtain in advance from the storage node the key-value pairs associated with the next word of the second word; and outputting the second word.
[0153] In one possible implementation, the aforementioned inference module 502 is further configured to: after calculating the first speculative token, obtain the attention score of the first speculative token for each key value (KV); and based on the attention score of the first speculative token, select at least one pair of first key values (KV) from the KV stored in the storage node.
[0154] In one possible implementation, the aforementioned inference module 502 is used to filter out the top k pairs of KVs arranged from high to low according to the attention score of the first speculative token from the KVs stored in the storage node as at least one pair of first KVs, where k is a positive integer.
[0155] In one possible implementation, the aforementioned inference module 502 is used to determine the attention score of the first speculative token based on the low-precision KV, while the computing node still caches the quantized low-precision KV.
[0156] In one possible implementation, the aforementioned reasoning module 502 is used to: if the currently calculated word is the first word, input the data in the input sequence into the large model to obtain the third output token and multiple pairs of key-value pairs; and unload the multiple pairs of key-value pairs into the storage node.
[0157] In one possible implementation, the aforementioned reasoning module 502 is configured to: obtain at least one second KV from multiple pairs of KVs based on a third output token; and calculate a third speculative token based on the at least one pair of second KVs, the third speculative token being used to determine the KV associated with the second word.
[0158] In one possible implementation, the aforementioned inference module 502 is further configured to: quantize multiple pairs of key-value pairs to obtain quantized multiple pairs of low-precision key-value pairs, which are then used to determine the speculative token corresponding to the next word of the current word.
[0159] Figure 6 shows a schematic diagram of the hardware structure of a computing device 60 provided in an embodiment of this application. This computing device 60 can be used to implement the steps of the methods shown in Figures 2 to 4, and may specifically include the aforementioned computing nodes or computing devices.
[0160] The computing device 60 shown in Figure 6 may include a processor 601, a memory 602, a communication interface 603, and a bus 604. The processor 601, the memory 602, and the communication interface 603 can be connected via the bus 604.
[0161] The processor 601 is the control center of the computing device 60. It can be a general-purpose central processing unit (CPU) or other general-purpose processors. The general-purpose processor can be a microprocessor or any conventional processor, such as a GPU or NPU, and can be adapted to the actual application scenario.
[0162] As an example, processor 601 may include one or more CPUs, and may also include other processors, such as the CPU, NPU or GPU shown in Figure 6.
[0163] The memory 602 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto.
[0164] In one possible implementation, the memory 602 may exist independently of the processor 601. The memory 602 can be connected to the processor 601 via a bus 604 and is used to store data, instructions, or program code. When the processor 601 calls and executes the instructions or program code stored in the memory 602, it can implement the methods provided in the embodiments of this application, such as the methods shown in Figures 2 to 4.
[0165] In another possible implementation, the memory 602 can also be integrated with the processor 601.
[0166] The communication interface 603 is used for connecting the computing device 60 to other devices via a communication network, which may be Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. The communication interface 603 may include a receiving unit for receiving data and a transmitting unit for sending data.
[0167] Bus 604 can be an industry standard architecture (ISA) bus, a peripheral component interconnect (PCI) bus, or an extended industry standard architecture (EISA) bus. This bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used in Figure 6, but this does not indicate that there is only one bus or one type of bus.
[0168] It should be noted that the structure shown in FIG6 does not constitute a limitation on the computing device 60. In addition to the components shown in FIG6, the computing device 60 may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0169] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, etc., including several instructions to cause a data quantization device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0170] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
[0171] This application also provides a computer-readable storage medium storing a program for training a model or performing inference tasks, which, when run on a computer, causes the computer to perform all or part of the steps in the methods described in the embodiments shown in Figures 2 to 4 above.
[0172] This application also provides a digital processing chip. This digital processing chip integrates circuitry for implementing the aforementioned processor or processor functions, and one or more interfaces. When the digital processing chip integrates a memory, it can perform the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate a memory, it can be connected to an external memory via a communication interface. The digital processing chip implements the method steps of any one or more of the foregoing embodiments based on the program code stored in the external memory.
[0173] This application also provides a computer program product comprising one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., a solid-state disk (SSD)).
[0174] The target sensing device or target sensing device provided in this application embodiment can be a chip, which includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip in the server to execute the method described in the embodiments shown in Figures 2 to 4 above. Optionally, the storage unit is a storage unit within the chip, such as a register or cache. The storage unit can also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, such as random access memory (RAM).
[0175] Specifically, the aforementioned processing unit or processor can be a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0176] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
[0177] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0178] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
[0179] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0180] The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. The term "and / or" in this application is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Additionally, the character " / " generally indicates that the preceding and following related objects are in an "or" relationship. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps in this application does not imply that the steps in the method flow must be executed in the time / logical order indicated by the naming or numbering. The execution order of the named or numbered process steps can be changed according to the technical purpose to be achieved, as long as the same or similar technical effect can be achieved. The division of modules in this application is a logical division. In actual applications, there may be other division methods. For example, multiple modules may be combined into or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some ports, and the indirect coupling or communication connection between modules may be electrical or other similar forms, which are not limited in this application. Furthermore, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in multiple circuit modules. Some or all of the modules can be selected to achieve the purpose of the solution in this application according to actual needs.
Claims
1. A data processing method, characterized in that, Applied to compute nodes, including: Obtain the input sequence; The input sequence is fed into a large model, which outputs multiple words. The large model includes at least one attention module, and the output of each attention module includes at least one key-value pair (KV). The output process for the non-first word among the multiple words includes: Obtain at least one first KV pair from the storage node based on the first speculative token; The large model is run based on the first output token and the at least one pair of first key-value pairs (KV) to obtain a second output token and a second speculative token. The first output token represents a first word, the at least one pair of first KVs are obtained in advance from the storage node, the first speculative token is output synchronously during the calculation of the first output token, the first speculative token is an approximate token of the second output token, the second output token represents a second word, the first word and the second word are adjacent words among the plurality of words, and the second speculative token is used to calculate the next word of the second word and to obtain in advance from the storage node the KV associated with the next word of the second word. Output the second word.
2. The method according to claim 1, characterized in that, The method further includes: After calculating the first speculative token, obtain the attention score of the first speculative token for each key value; Based on the attention score of the first speculative token, the at least one pair of first KVs is selected from the KVs stored in the storage node.
3. The method according to claim 2, characterized in that, The step of filtering the at least one pair of first key-value pairs from the key-value pairs stored in the storage node based on the attention score of the first speculative token includes: From the KV stored in the storage node, the top k pairs of KV, arranged from high to low according to the attention score of the first speculative token, are selected as the at least one pair of first KV, where k is a positive integer.
4. The method according to claim 2 or 3, characterized in that, The step of obtaining the attention score of the first speculative token for each key value includes: If the computing node also caches the quantized low-precision key-value pairs, the attention score of the first speculative token is determined based on the low-precision key-value pairs.
5. The method according to any one of claims 1-4, characterized in that, The method further includes: If the word being calculated is the first word, the data in the input sequence is input into the large model to obtain the third output token and multiple pairs of key-value pairs; The multiple pairs of KV are offloaded to the storage node.
6. The method according to claim 5, characterized in that, The method further includes: At least one second KV is obtained from the plurality of KV pairs based on the third output token; A third speculative token is calculated based on the at least one pair of second key values (KVs), and the third speculative token is used to determine the KV associated with the second word.
7. The method according to claim 5 or 6, characterized in that, The method further includes: The multiple pairs of key-value pairs are quantized to obtain multiple pairs of low-precision key-value pairs. The quantized multiple pairs of low-precision key-value pairs are used to determine the speculative token corresponding to the next word of the current word.
8. A data processing apparatus, characterized in that, Applied to compute nodes, including: The acquisition module is used to acquire the input sequence; An inference module is used to input the input sequence into a large model and output multiple words. The large model includes at least one attention module, and the output of each attention module includes at least one pair of key-value pairs (KV). The inference module's output process for the non-first word among the multiple words includes: Obtain at least one first KV pair from the storage node based on the first speculative token; The large model is run based on the first output token and the at least one pair of first key-value pairs (KV) to obtain a second output token and a second speculative token. The first output token represents a first word, the at least one pair of first KVs are obtained in advance from the storage node, the first speculative token is output synchronously during the calculation of the first output token, the first speculative token is an approximate token of the second output token, the second output token represents a second word, the first word and the second word are adjacent words among the plurality of words, and the second speculative token is used to calculate the next word of the second word and to obtain in advance from the storage node the KV associated with the next word of the second word. Output the second word.
9. The apparatus according to claim 8, characterized in that, The inference module is also used for: After calculating the first speculative token, obtain the attention score of the first speculative token for each key value; Based on the attention score of the first speculative token, the at least one pair of first KVs is selected from the KVs stored in the storage node.
10. The apparatus according to claim 9, characterized in that, The inference module is used to select the top k pairs of KVs from the KVs stored in the storage node, arranged in descending order of the attention score of the first speculative token, as the at least one pair of first KVs, where k is a positive integer.
11. The apparatus according to claim 9 or 10, characterized in that, The inference module is used to determine the attention score of the first speculative token based on the low-precision key-value pairs when the computing node still caches the quantized low-precision key-value pairs.
12. The apparatus according to any one of claims 8-11, characterized in that, The inference module is used for: If the word being calculated is the first word, the data in the input sequence is input into the large model to obtain the third output token and multiple pairs of key-value pairs; The multiple pairs of KV are offloaded to the storage node.
13. The apparatus according to claim 12, characterized in that, The inference module is used for: At least one second KV is obtained from the plurality of KV pairs based on the third output token; A third speculative token is calculated based on the at least one pair of second key values (KVs), and the third speculative token is used to determine the KV associated with the second word.
14. The apparatus according to claim 12 or 13, characterized in that, The inference module is also used for: The multiple pairs of key-value pairs are quantized to obtain multiple pairs of low-precision key-value pairs. The quantized multiple pairs of low-precision key-value pairs are used to determine the speculative token corresponding to the next word of the current word.
15. A computing device, characterized in that, The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the device performs the steps of the method as described in any one of claims 1-7.
16. A computer storage medium, characterized in that, The computer storage medium stores instructions that, when executed by the computer, cause the computer to perform the method according to any one of claims 1 to 7.
17. A computer program product, characterized in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 7.