Key-value (KV) cache memory management
By using DSP or NPU to perform ping-pong switching in KV cache management, the problem of high CPU power consumption in LLM text generation tasks is solved, improving computational efficiency and performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2023-11-30
- Publication Date
- 2026-06-19
AI Technical Summary
Large language models (LLMs) rely on a central processing unit (CPU) to perform memory shift operations in text generation tasks, resulting in high power consumption and performance impact.
Employing computing engines other than the CPU, such as digital signal processors (DSPs) or neural processing units (NPUs), ping-pong switching is performed between two KV cache buffers of the same size to manage the input and output operations of the AI model, and splicing and slicing operations are performed through the computing engine.
This reduces the time and power consumption of the CPU in KV cache management, improving computing efficiency and performance.
Smart Images

Figure CN122249788A_ABST
Abstract
Description
Technical Field
[0001] All aspects of this disclosure relate in general to generative artificial intelligence (AI) systems, and more specifically to a key-value (KV) cache memory management system for generative AI systems. Background Technology
[0002] Artificial neural networks can comprise interconnected groups of artificial neurons (e.g., neuron models). An artificial neural network (ANN) can be a computing device or represented as a method to be performed by a computing device. A convolutional neural network (CNN) is a type of feedforward ANN. A CNN can comprise an ensemble of neurons, where each neuron has a receptive field and collectively constructs the input space. CNNs such as deep convolutional neural networks (DCNs) have numerous applications. Specifically, these neural network architectures are used in a variety of technologies, such as image recognition, speech recognition, acoustic scene classification, keyword retrieval, autonomous driving, and other classification tasks.
[0003] Given the many useful applications of neural networks, the need for them to solve increasingly complex problems in other application areas is growing. One area being explored is generative artificial intelligence.
[0004] Large Language Models (LLMs) have made significant strides in the field of natural language understanding and are gaining popularity for text-generative tasks and tasks involving modeling information from both textual and visual domains. LLMs can receive prompts from users and then generate responses or completions. In generating responses, LLMs compute inferences. Inference is the process of using a trained model to generate predictions, perform computations, or make evaluations based on input data. During inference, a pre-trained language model receives input data (which can be a sequence of words, sentences, or paragraphs) and processes that information through its neural network architecture. The model then produces outputs based on its learned parameters and the patterns it captures from the training data. Summary of the Invention
[0005] In various aspects of this disclosure, a processor-implemented method includes: allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size. The method further includes: computing a first inference using a first neural network. The method further includes: outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer. The method further includes: computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference. The method further includes: outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer. The method further includes: computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference. The method further includes: outputting a third set of cache data computed during the third inference to the first KV cache buffer.
[0006] Other aspects of this disclosure relate to an apparatus. The apparatus has at least one memory and one or more processors coupled to the at least one memory. The processor is configured to: allocate a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size. The processor is further configured to: compute a first inference using a first neural network. The processor is further configured to: output a first set of new KV cache data computed during the first inference to the first KV cache buffer. The processor is further configured to: compute a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference. The processor is further configured to: output a second set of new KV cache data computed during the second inference to the second KV cache buffer. The processor is further configured to: compute a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference. The processor is further configured to: output a third set of cache data computed during the third inference to the first KV cache buffer.
[0007] In other aspects of this disclosure, a non-transitory computer-readable medium having program code recorded thereon is disclosed. The program code is executed by at least one processor and includes program code for allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first and second KV cache buffers having the same size. The program code also includes program code for computing a first inference using a first neural network. The program code further includes program code for outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer. The program code also includes program code for computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference. The program code also includes program code for outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer. The program code further includes program code for computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference. The program code also includes program code for outputting the third set of cached data calculated during the third inference to the first KV cache buffer.
[0008] Other aspects of this disclosure relate to an apparatus. The apparatus includes components for allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first and second KV cache buffers having the same size. The apparatus also includes components for computing a first inference using a first neural network. The apparatus further includes components for outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer. The apparatus also includes components for computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference. The apparatus also includes components for outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer. The apparatus further includes components for computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference. The apparatus also includes components for outputting a third set of cache data computed during the third inference to the first KV cache buffer.
[0009] Additional features and advantages of this disclosure will be described below. Those skilled in the art will understand that this disclosure can be readily used as the basis for modifying or designing other structures for implementing the same purposes as this disclosure. Those skilled in the art will also recognize that such equivalent constructions do not depart from the teachings of this disclosure as set forth in the appended claims. Novel features considered characteristic of this disclosure, in both their organization and manner of operation, along with further objects and advantages, will be better understood when considered in conjunction with the accompanying drawings. However, it is to be clearly understood that each drawing is provided for illustrative and descriptive purposes only and is not intended to be a definition of a limitation of this disclosure. Attached Figure Description
[0010] The features, substance, and advantages of this disclosure will become more apparent when understood in conjunction with the accompanying drawings, in which the same reference numerals are always used to identify the parts of the drawings.
[0011] Figure 1 Example implementations of neural networks using a system-on-a-chip (SoC) (including a general-purpose processor) according to certain aspects of this disclosure are illustrated.
[0012] Figure 2A , Figure 2B and Figure 2C These are illustrations of neural networks according to various aspects of this disclosure.
[0013] Figure 2D This is a diagram illustrating exemplary deep convolutional networks (DCNs) according to various aspects of this disclosure.
[0014] Figure 3 This is a block diagram illustrating exemplary deep convolutional networks (DCNs) according to various aspects of this disclosure.
[0015] Figure 4 This is a block diagram illustrating exemplary software architectures that enable modularization of artificial intelligence (AI) functions according to various aspects of this disclosure.
[0016] Figure 5 and Figure 6 This is a block diagram illustrating the lexical generation model.
[0017] Figure 7 This is a block diagram illustrating a key-value (KV) cache management pipeline for Natural Language Processing (NLP).
[0018] Figure 8 This is a network diagram illustrating improvements to a KV cache management pipeline based on various aspects of this disclosure.
[0019] Figure 9A This is a flowchart illustrating the process for updating the KV cache according to various aspects of this disclosure.
[0020] Figure 9B This is a block diagram illustrating the KV cache memory during updates according to various aspects of this disclosure.
[0021] Figure 10 This is a table illustrating a comparison between conventional KV cache management pipelines and improved KV cache management pipelines according to various aspects of this disclosure.
[0022] Figure 11 This is a flowchart illustrating a processor implementation of a method for storing KV cache information according to various aspects of this disclosure. Detailed Implementation
[0023] The detailed description that follows, taken in conjunction with the accompanying drawings, is intended as a description of various configurations and not as representing only configurations in which the described concepts can be practiced. To provide a comprehensive understanding of the various concepts, the detailed description includes specific details. However, it will be apparent to those skilled in the art that these concepts can be practiced without these specific details. In some instances, to avoid obscuring such concepts, well-known structures and components are shown in block diagram form.
[0024] Based on the teachings, those skilled in the art will recognize that the scope of this disclosure is intended to cover any aspect of this disclosure, whether implemented independently of or in combination with any other aspect of this disclosure. For example, an apparatus or method may be implemented using any number of the aspects described. Furthermore, the scope of this disclosure is intended to cover such apparatuses or methods practiced using other structures, functionalities, or structures and functionalities that complement or differ from the various aspects of this disclosure described. It should be understood that any aspect of this disclosure may be embodied by one or more elements of the claims.
[0025] The word “exemplary” is used to mean “serving as an example, instance, or illustration.” Any aspect described as “exemplary” need not be interpreted as superior to or better than other aspects.
[0026] While specific aspects have been described, numerous variations and substitutions of these aspects fall within the scope of this disclosure. Although some benefits and advantages of preferred aspects have been mentioned, the scope of this disclosure is not intended to be limited to a particular benefit, use, or purpose. Rather, aspects of this disclosure are intended to be broadly applicable to different technologies, system configurations, networks, and protocols, some of which are illustrated by way of example in the accompanying drawings and the following description of preferred aspects. The detailed description and drawings are merely illustrative and not limiting of this disclosure, the scope of which is defined by the appended claims and their equivalents.
[0027] As described, large language models (LLMs) have made significant progress in the field of natural language understanding and are becoming increasingly popular for text-generative tasks as well as tasks involving modeling information from textual and visual domains.
[0028] One type of LLM is the Bidirectional Encoder Representation (BERT) model derived from a transformer. The BERT model applies bidirectional training of the transformer to language modeling. The BERT model has become a popular choice for LLM due to its broader Natural Language Processing (NLP) capabilities. NLP tasks involve reading, analyzing, and deriving meaning from text and spoken words by combining linguistic, statistical, and machine learning methods to understand human language. While individual NLP tasks are typically solved by separate models created for each specific task, the BERT model solves many common NLP tasks.
[0029] However, LLMs (such as LLaMa models that perform autoregressive inference to generate lexical units one by one) typically rely on a central processing unit (CPU) to perform model input and output operations. To perform these input and output operations, the CPU performs memory shifts on the key-value (KV) cache buffer at each inference step. Memory shifts consume significant power and impact the overall performance of the CPU.
[0030] Various aspects of this disclosure relate to an improved key-value cache management system that utilizes a computing engine other than the CPU to manage the input and output operations of an artificial intelligence (AI) model. In some aspects, a processor implementing the improved KV cache management system may allocate two KV cache buffers of equal size and perform a ping-pong switch between the two KV cache buffers after each inference. That is, after a first inference, the first inference output is stored in the first KV cache buffer. Information from the first KV cache buffer is used as input for a second inference. After the second inference, the second inference output is stored in the second KV cache buffer. Information from the second KV cache buffer is used as input for a third inference, and the third inference output is stored in the first KV cache buffer. Then, information from the first KV cache buffer is used as input for a fourth inference, and the fourth inference output is stored in the second KV cache buffer. The processor may continue to perform the ping-pong switch between the two KV cache buffers until the final inference of the AI model.
[0031] During the ping-pong switching operation, a computational engine other than the CPU (such as a digital signal processor (DSP) or neural processing unit (NPU)) can perform splicing and slicing operations on the KV cache memory. First, the computational engine can splice the model output with information stored in the previous KV cache buffer (or input cache buffer) and store the spliced data in a temporary cache buffer. Then, the computational engine can perform a slicing operation by copying only a portion of the temporary cache buffer to the output KV cache buffer.
[0032] Specific aspects of the subject matter described in this disclosure can be implemented to achieve one or more of the following potential advantages. In some examples, the described techniques (such as allocating two KV cache buffers and alternating between them) improve power consumption and performance because the computation engine, rather than the CPU, performs the slicing and splicing operations. The CPU only switches the pointers for the first and second KV cache buffers before making new inferences. Therefore, the time and power required for the CPU to implement the improved KV cache management system are less than the time and power required for the CPU to implement a conventional KV cache management system.
[0033] Figure 1 An example implementation of a System-on-a-Chip (SOC) 100 is illustrated, which may include a Central Processing Unit (CPU) 102 or a multi-core CPU configured for KV cache management. Variables (e.g., neural signals and synaptic weights), system parameters associated with the computing device (e.g., a weighted neural network), latency, frequency window (bin) information, and task information may be stored in a memory block associated with a Neural Processing Unit (NPU) 108, a memory block associated with the CPU 102, a memory block associated with a Graphics Processing Unit (GPU) 104, a memory block associated with a Digital Signal Processor (DSP) 106, a memory block 118, or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from the program memory associated with the CPU 102 or from memory block 118.
[0034] The SOC 100 may also include additional processing blocks tailored for specific functions, such as a GPU 104, a DSP 106, a connectivity block 110 (which may include fifth-generation (5G) connectivity, fourth-generation LTE (4G) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, etc.), and a multimedia processor 112 capable of, for example, detecting and recognizing gestures. In one specific implementation, an NPU 108 is implemented within a CPU 102, a DSP 106, and / or a GPU 104. The SOC 100 may also include a sensor processor 114, an image signal processor (ISP) 116, and / or a navigation module 120, which may include a global positioning system.
[0035] The SOC 100 can be based on ARM, RISC-V (RISC-5), or any Reduced Instruction Set Computing (RISC) architecture. In various aspects of this disclosure, the instructions loaded into the CPU 102 may include code for allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first and second KV cache buffers having the same size. The instructions loaded into the CPU 102 may also include code for computing a first inference using a first neural network. The instructions loaded into the CPU 102 may also include code for outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer. The instructions loaded into the CPU 102 may also include code for computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference. The instructions loaded into the CPU 102 may also include code for outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer. The instructions loaded into CPU 102 may include code for using a second neural network to compute a third inference, which occurs after the second inference, and the second neural network receives input from a second KV cache buffer for the third inference. The instructions loaded into CPU 102 may also include code for outputting a third set of cached data computed during the third inference to a first KV cache buffer.
[0036] Deep learning architectures perform object recognition tasks by learning to represent inputs at progressively higher levels of abstraction in each layer, thereby constructing useful feature representations of the input data. In this way, deep learning addresses a major bottleneck of traditional machine learning. Before deep learning, machine learning methods for object recognition problems often relied heavily on human-designed features, possibly in conjunction with shallow classifiers. Shallow classifiers could be two-class linear classifiers, where a weighted sum of feature vector components is compared to a threshold to predict which class the input belongs to. Human-designed features could be templates or kernels customized for a specific problem domain by engineers with domain expertise. In contrast, while deep learning architectures can learn to represent features similar to those that human engineers might design, this requires training. Furthermore, deep networks can learn to represent and recognize novel types of features that humans might not have considered.
[0037] Deep learning architectures can learn hierarchical structures of features. For example, if presented with visual data, the first layer can learn to recognize relatively simple features in the input stream, such as edges. In another example, if presented with auditory data, the first layer can learn to recognize spectral power at specific frequencies. The second layer, taking the output of the first layer as input, can learn to recognize combinations of features, such as simple shapes in visual data or combinations of sounds in auditory data. For example, higher layers can learn to represent complex shapes in visual data or words in auditory data. Even higher layers can learn to recognize common visual objects or spoken phrases.
[0038] Deep learning architectures perform particularly well when applied to problems with a natural hierarchical structure. For example, the classification of motorized vehicles can benefit from first learning to identify features such as wheels, windshields, and others. These features can then be combined in different ways at higher levels to identify cars, trucks, and airplanes.
[0039] Neural networks can be designed to have multiple connectivity patterns. In feedforward networks, information is passed from lower layers to higher layers, where each neuron in a given layer communicates with neurons in higher layers. As described above, hierarchical representations can be built in successive layers of a feedforward network. Neural networks can also have recurrent or feedback (also known as top-down) connections. In recurrent connections, the output from a neuron in a given layer can be passed to another neuron in the same layer. Recurrent architectures can help identify patterns across more than one block of input data that is sequentially delivered to the neural network. Connections from neurons in a given layer to neurons in lower layers are called feedback (or top-down) connections. Networks with many feedback connections can be helpful when the recognition of higher-level concepts can aid in discerning specific lower-level features of the input.
[0040] The connections between layers in a neural network can be fully connected or locally connected. Figure 2A An example of a fully connected neural network 202 is illustrated. In the fully connected neural network 202, neurons in the first layer can transmit their outputs to each neuron in the second layer, such that each neuron in the second layer receives input from each neuron in the first layer. Figure 2B An example of a locally connected neural network 204 is illustrated. In the locally connected neural network 204, neurons in a first layer can connect to a finite number of neurons in a second layer. More generally, the locally connected layers of the locally connected neural network 204 can be configured such that each neuron in the layer will have the same or similar connectivity pattern, but the connection strength can have different values (e.g., 210, 212, 214, and 216). The connectivity pattern of locally connected layers can produce spatially different receptive fields in higher layers because neurons in higher layers in a given region can receive inputs that are tuned to the characteristics of a restricted portion of the total input to the network through training.
[0041] An example of a locally connected neural network is a convolutional neural network. Figure 2C An example of a convolutional neural network 206 is illustrated. The convolutional neural network 206 can be configured such that the connection strength associated with the input for each neuron in the second layer is shared (e.g., 208). Convolutional neural networks may be well-suited for problems where the spatial location of the input is meaningful.
[0042] One type of convolutional neural network is the deep convolutional network (DCN). Figure 2D A detailed example of a DCN 200 designed to recognize visual features from an image 226 input from an image capture device 230 (such as an in-vehicle camera) is illustrated. The DCN 200 in this example can be trained to identify traffic signs and the numbers provided on them. Of course, the DCN 200 can be trained for other tasks, such as identifying lane markings or traffic lights.
[0043] Supervised learning can be used to train the DCN 200. During training, an image (such as image 226 of a speed limit sign) can be presented to the DCN 200, and a forward pass can then be computed to produce an output 222. The DCN 200 may include a feature extraction part and a classification part. Upon receiving image 226, convolutional layer 232 may apply a convolutional kernel (not shown) to image 226 to generate a first set 218 of feature maps. As an example, the convolutional kernel used for convolutional layer 232 may be a 5x5 kernel that generates a 28x28 feature map. In this example, because four different feature maps are generated in the first set of feature maps 218, four different convolutional kernels are applied to image 226 at convolutional layer 232. A convolutional kernel may also be referred to as a filter or convolutional filter.
[0044] The first set of feature maps 218 can be subsampled by a max-pooling layer (not shown) to generate a second set of feature maps 220. The max-pooling layer reduces the size of the first set of feature maps 218. That is, the size of the second set of feature maps 220 (e.g., 14×14) is smaller than the size of the first set of feature maps 218 (e.g., 28×28). The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 can be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).
[0045] exist Figure 2D In the example, the second set of feature maps 220 is convolved to generate a first feature vector 224. Furthermore, the first feature vector 224 is further convolved to generate a second feature vector 228. Each feature of the second feature vector 228 may include a number corresponding to a possible feature of image 226, such as "sign", "60", and "100". A softmax function (not shown) converts the numbers in the second feature vector 228 into probabilities. Thus, the output 222 of DCN 200 can be the probability that image 226 includes one or more features.
[0046] In this example, the probabilities for "sign" and "60" in output 222 are higher than the probabilities for other numbers in output 222 (such as "30", "40", "50", "70", "80", "90", and "100"). Before training, output 222 generated by DCN 200 may be incorrect. Therefore, the error between output 222 and the target output can be calculated. The target output is the baseline ground truth (e.g., "sign" and "60") of image 226. The weights of DCN 200 can then be adjusted so that output 222 of DCN 200 is more closely aligned with the target output.
[0047] To adjust the weights, the learning algorithm computes the gradient vector of the weights. The gradient indicates by how much the error will increase or decrease as the weights are adjusted. At the top layers, the gradient corresponds directly to the values of the weights connecting the activated neurons in the penultimate layer to the neurons in the output layer. In lower layers, the gradient depends on the values of the weights and the error gradient computed in the higher layers. The weights can then be adjusted to reduce the error. This method of adjusting weights is called "backpropagation" because it involves "passing backward" through the neural network.
[0048] In practice, the error gradient of the weights can be calculated using a small number of examples to make the calculated gradient approximate the true error gradient. This approximation method is called stochastic gradient descent. Stochastic gradient descent can be repeated until the achievable error rate of the entire system stops decreasing or until the error rate reaches a target level. After learning, a new image (e.g., a speed limit sign in image 226) can be presented to DCN 200, and output 222 can be generated through the forward pass of DCN 200. This output can be considered as an inference or prediction of DCN 200.
[0049] Deep Belief Networks (DBNs) are probabilistic models that include multiple layers of hidden nodes. DBNs can be used to extract hierarchical representations of training datasets. DBNs are obtained by stacking layers of Restricted Boltzmann Machines (RBMs). An RBM is a type of artificial neural network that learns a probability distribution from a set of inputs. Because RBMs can learn a probability distribution without information about the class each input should be classified into, they are often used for unsupervised learning. Using a hybrid paradigm of supervised and unsupervised learning, the bottom RBM of a DBN can be trained unsupervised and used as a feature extractor, while the top RBM can be trained supervisedly (on the joint distribution of inputs from the previous layer and the target class) and used as a classifier.
[0050] DCN is a network of convolutional networks configured with additional pooling and normalization layers. DCN has achieved state-of-the-art performance on many tasks. DCN can be trained using supervised learning, where both the input and output targets are known for many paradigms and are used to modify the network's weights using gradient descent.
[0051] DCNs can be feedforward networks. Furthermore, as described above, connections from neurons in the first layer of a DCN to a set of neurons in the next higher layer are shared across neurons in the first layer. The feedforward and shared connections of a DCN can be used for fast processing. For example, the computational cost of a DCN may be much smaller than that of a similarly sized neural network that includes recurrent or feedback connections.
[0052] The processing at each layer of a convolutional network can be thought of as a spatially invariant template or base projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, then a convolutional network trained on that input can be thought of as three-dimensional, with two spatial dimensions along the image's axes and a third dimension capturing color information. The output of the convolutional connections can be thought of as forming a feature map in the next layer, where each element in the feature map (e.g., 220) receives input from a range of neurons in the previous layer (e.g., feature map 218) and from each of the multiple channels. The values in the feature map can be further processed using non-linear methods (e.g., rectified, max(0,x)). Values from neighboring neurons can be further pooled, which corresponds to downsampling and provides additional local invariance and dimensionality reduction. Normalization corresponding to whitening can also be applied through lateral inhibition between neurons in the feature map.
[0053] Figure 3 This is a block diagram illustrating a DCN 350. A DCN 350 can include multiple layers of different types based on connectivity and weight sharing. For example... Figure 3 As shown, DCN 350 includes convolutional blocks 354A and 354B. Each convolutional block in convolutional blocks 354A and 354B can be configured with a convolutional layer (CONV) 356, a normalization layer (LNorm) 358, and a max pooling layer (MAX POOL) 360.
[0054] Although only two convolutional blocks 354A and 354B are shown, this disclosure is not limited thereto, and any number of convolutional blocks 354A and 354B may be included in the DCN 350 according to design preferences.
[0055] Convolutional layer 356 may include one or more convolutional filters that can be applied to the input data to generate feature maps. Normalization layer 358 may normalize the output of the convolutional filters. For example, normalization layer 358 may provide whitening or lateral suppression. Max pooling layer 360 may provide spatial downsampling aggregation to achieve local invariance and dimensionality reduction.
[0056] Parallel filter banks of deep convolutional networks can be loaded onto an SOC 100 (e.g., Figure 1 The CPU 102 or GPU 104 of the SOC 100 can be used to achieve high performance and low power consumption. In an alternative implementation, a parallel filter bank can be loaded onto the DSP 106 or ISP 116 of the SOC 100. In addition, the DCN 350 can access other processing blocks that may exist on the SOC 100, such as the sensor processor 114 and navigation module 120, which are dedicated to sensors and navigation, respectively.
[0057] The DCN 350 may also include one or more fully connected layers 362 (FC1 and FC2). The DCN 350 may also include logistic regression (LR) layers 364. Weights (not shown) to be updated are located between each of the layers 356, 358, 360, 362, and 364 of the DCN 350. The output of each layer (e.g., 356, 358, 360, 362, and 364) can be used as input to the next layer in the DCN 350 (e.g., 356, 358, 360, 362, and 364) to learn hierarchical feature representations from the input data 352 (e.g., images, audio, video, sensor data, and / or other input data) supplied at the first convolutional block in convolutional block 354A. The output of the DCN 350 is a classification score 366 of the input data 352. The classification score 366 may be a set of probabilities, where each probability is the probability that the input data includes a feature from the feature set.
[0058] Figure 4 This is a block diagram illustrating an exemplary software architecture 400 that enables modularization of artificial intelligence (AI) functionality. Using architecture 400, applications can be designed that enable SOC 420 (which may be similar to...) Figure 1 Various processing blocks of the SOC 100 (e.g., CPU 422, DSP 424, GPU 426, and / or NPU 428) support KV cache management for accelerating lexical generation in AI applications 402 according to various aspects of this disclosure. Architecture 400 may, for example, be included in a computing device such as a smartphone.
[0059] AI application 402 can be configured to invoke functions defined in user space 404, which may, for example, provide the detection and recognition of a scene indicating the current location of the computing device (including architecture 400). For example, AI application 402 may configure microphones and cameras differently depending on whether the recognized scene is an office, lecture hall, restaurant, or outdoor environment (such as a lake). AI application 402 may make requests to compiled program code associated with libraries defined in AI Function Application Programming Interface (API) 406. This request may ultimately rely on the output of a deep neural network configured to provide inferred responses based on, for example, video and location data.
[0060] Runtime engine 408 (which may be compiled code of a runtime framework) may be further accessible to AI application 402. AI application 402 may cause runtime engine 408 to request inferences, for example, at specific time intervals or when triggered by an event detected by the user interface of AI application 402. When runtime engine 408 provides an inference response, it may then signal to the operating system (OS) space 410 running on SOC 420, such as kernel 412. In some examples, kernel 412 may be a LINUX kernel. The operating system may then manage the KV cache memory for CPU 422, DSP 424, GPU 426, NPU 428, or some combination thereof. CPU 422 may be directly accessible by the operating system, while other processing blocks may be accessed via drivers, such as drivers 414, 416, or 418 for DSP 424, GPU 426, or NPU 428, respectively. In an exemplary example, the deep neural network may be configured to run on a combination of processing blocks such as CPU 422, DSP 424 and GPU 426, or on NPU 428.
[0061] Figure 5 and Figure 6 This is a diagram illustrating an example of a lexical generation model. (Reference) Figure 5 The diagram illustrates a Bidirectional Encoder Representation (BERT) Key-Value (KV) model 500 from a transformer. The BERT_KV model 500 provides a pipeline for lexical generation for the BERT Large Language Model (LLM) 512. Since some neural processors, such as the NPU 108, may only support static models with a fixed input shape, the BERT_KV model 500 can fix the input shape based on the maximum input length. In doing so, the BERT_KV model 500 can take data from user prompts (e.g., valid data) and add padding (e.g., zeros) to the remaining portion of the input, such that the input size can be fixed at the maximum length (e.g., 1024 characters).
[0062] The BERT_KV model 500 can receive input text 502. Input text 502 can include system prompts and user prompts. System prompts can be standard prompts, which may, for example, indicate a greeting to the user and / or instructions for operating the model (e.g., "Please enter a question"). User prompts can include user input, such as LLM tasks. User prompts can have variable lengths.
[0063] Input text 502 can be provided to a tokenizer 504. The tokenizer 504 can divide the input text 502 into multiple parts called tokens 506. Tokens 506 can include sub-parts of the input text 502, such as character sequences (e.g., the average length of a token can be about four characters), words, or phrases. In the BERT_KV model 500, the tokenizer 504 processes the input text 502, which can include, but is not limited to, words, sentences, paragraphs, or documents, and can generate all tokens in tokens 506 of the input text 502. The tokenizer 504 can then provide all tokens in tokens 506 to the BERT_KV model 512 at a time.
[0064] Position embedding 510 can be applied to maintain information related to the order of lexical 506. Combined attention masks 508 can also be applied to identify more salient lexical units (e.g., 506) corresponding to the input text 502. BERT LLM 512 can then generate predictions for individual subsequent lexical units 514. The generated lexical unit 514 can be considered as a completion, such as a subsequent word in a response or output. BERT LLM 512 can be configured to generate multiple lexical units 514 in an autoregressive manner by writing the generated lexical units 514 to memory (such as a KV tensor buffer 516) to preserve an internal state KV$ (e.g., a data structure called a KV cache that may represent the key (K) and value (V) of previously generated lexical units) 518. The internal state KV$ 518 can be read from the KV tensor buffer 516, which can append current KV cache data to past cache data. The run ID is a unique identifier assigned to each execution of the BERT_KV model 500, and it is used to update the attention mask and positional embedding 520. The attention mask and positional embedding 520 can be updated and fed as input to the internal state KV$ 518. As the BERT LLM 512 generates each term 514, the generated term 514 can be processed by the inverse tokenizer 524 in a manner reversible from the tokenizer 504 to generate output text 526 (e.g., a sequence of characters, a word, or a phrase). This process implements the KV$ model and continues to repeat in this manner. During this process, KV$ can generate the generated term 522 as model output. With each iteration, the generated term 522 can update the internal state KV$ 518, which can be written to the KV tensor buffer 516 to update the KV tensor buffer, which can then be loaded as subsequent input.
[0065] Figure 6 A KV model 600 for lexical generation is shown. KV model 600 includes components similar to the BERT_KV model 500, which can perform the same operations as the reference model. Figure 5The described function is similar to that of the function described above. However, in Figure 6 In this model, the KV model 600 feeds tokens 606 one token at a time to the BERT LLM 512, instead of feeding all tokens at once. In the KV model 600, there is one input, namely, token 606. The BERT LLM 512 generates a first inference loop to process the input tokens one by one until all text input has been processed, as shown by the last generated token 614, which can be de-lexicalized to generate an output 626. The KV tensor buffer 516 can be updated based on the latest previously generated KV cache data. The first inference loop process can be repeated until all tokens in the token 606 corresponding to the input have been processed by the BERT LLM 512. A second loop performs autoregressive inference that can begin after the first loop. The second loop generates tokens 622 at each step of the second inference 618, which can be de-lexicalized by the inverse tokenizer 624 and output as output 626. The KV tensor buffer 516 can then be updated based on the latest and previously generated KV cache data. Throughout the process, the run IDs used for both BERT LLM 512 and the second inference 618 can be used to update the attention mask and the location embedding 520.
[0066] Figure 7 This is a block diagram illustrating a KV cache management pipeline 700 for Natural Language Processing (NLP). Pipeline 700 can be configured by a CPU (such as...) Figure 1 The CPU 102) can be used for processing, and can be used by various NLP models (such as Figure 5 The BERT_KV model 500 is used for implementation. Figure 7 In this process, pipeline 700 begins when the neural network receives a cue 702. The cue 702 may include one or more lexical units. A lexical unit may include a sub-part of the cue, such as a sequence of characters (e.g., the average length of a lexical unit may be about four characters), a word, or a phrase. Figure 7 The illustrated prompt 702 has a length of " n ",in n This indicates the amount of cue words in cue 702. Additionally, cue 702 may be padded with one or more padding segments. "Padding" refers to the technique of adding extra bytes or bits to align data elements or structures with specific memory boundaries. Memory segments are padded for optimization purposes, especially in architectures where memory access is more efficient when data is aligned on certain byte boundaries.
[0067] Upon receiving cue 702, the AI model (such as the BERT model) performs the first inference. As a result of the first inference, the AI model outputs the first generated term 706. The AI model also outputs the key-value tensor to a key-value cache buffer 704, which is managed as an application-level component by a key-value manager. Figure 7 In the illustrated example, the computed KV tensor is mapped to the AI model output and stored by the KV manager in the KV cache buffer 704. After the first inference, the KV cache buffer 704 contains... n KV tensors, of which n This indicates the amount of the cue token in cue 702. The KV cache buffer 704 may also be padded with one or more padding segments such that the combination of the KV tensor and the padding segments equals a predefined maximum context length. The maximum context length may be predefined based on, for example, the architecture of the memory hosting the KV cache buffer 704.
[0068] After the first inference, the AI model performs a second inference. The second inference, and all subsequent inferences, can be performed by the KV cache model instead of the BERT model. During the second inference, the AI model takes KV cache buffer 704 and the first generated term 706 as input and outputs a first KV tensor 708 and a second generated term 710. The first KV tensor 708 is concatenated to KV cache buffer 704, requiring the CPU to shift the contents of KV cache buffer 704 by one memory segment. To accommodate the additional memory segment occupied by the first KV tensor 708, a padding memory segment is removed from KV cache buffer 704. The CPU can also concatenate the second generated term 710 to the first generated term 706.
[0069] Following the second inference, the AI model performs a third inference. During the third inference, the AI model inputs the KV cache buffer 704 and the second generated lexical unit 710, and outputs a second KV tensor 712 and a third generated lexical unit 714. The second KV tensor 712 is concatenated to the KV cache buffer 704, thus requiring the CPU to shift the contents of the KV cache buffer 704 by one memory segment again. To accommodate the additional memory segment occupied by the second KV tensor 712, another padding memory segment is removed from the KV cache buffer 704. The CPU may also concatenate the third generated lexical unit 714 to the first generated lexical unit 706 and the second generated lexical unit 710.
[0070] AI model can execute Figure 7Additional inferences not illustrated. After each inference, the AI model concatenates the KV tensor to the KV cache buffer 704 and generates new lexical units. The AI model may be interrupted after detecting the sentence end (EOS) lexical unit 838 generated in the current step or once the KV cache buffer 704 reaches its maximum context length.
[0071] Figure 8 This is a network diagram illustrating an improved KV cache management pipeline 800 according to various aspects of this disclosure. Pipeline 800 is similar to pipeline 700 in that, as illustrated, both pipelines implement a BERT model for first inference and a KV cache model for any remaining inference. Unlike pipeline 700, pipeline 800 implements an improved method for managing the cache memory.
[0072] Pipeline 800 implements LLM. Due to the large size of LLM, it is divided into four partitions: 802, 804, 806, and 808. Each partition... Figure 8 The example is shown vertically. LLM also handles multiple inferences, including... Figure 8 The examples illustrate the first inference, second inference, third inference, fourth inference, etc. Each partition implements two buffers: first KV cache buffers 810, 812, 814, 816 and second KV cache buffers 820, 822, 824, 826, each buffer having the same size. Each buffer can be stored in system memory, such as double data rate (DDR) memory. Versions of each of the first buffers 810, 812, 814, 816 and the second buffers 820, 822, 824, 826 are assigned to each partition 802, 804, 806, and 808, respectively. Subsequently, the second inference and each even-numbered inference receive input from the first KV cache buffers 810, 812, 814, 816 and output to the second KV cache buffers 820, 822, 824, 826. Subsequently, the third inference and each odd-numbered inference receive input from the second KV cache buffers 820, 822, 824, and 826, and output to the first KV cache buffers 810, 812, 814, and 816. For example... Figure 8 As illustrated, the first inference may not receive input from the KV cache buffer, but may still output to the first KV cache buffers 810, 812, 814, 816.
[0073] Pipeline 800 begins when the LLM receives a user cue 830. After receiving the user cue 830, the processor splits the BERT model into partitions 802, 804, 806, and 808. Hidden states 832 are provided between each partition 802, 804, 806, and 808 during each inference. Each inference outputs to the logits operator 834, whose output is received after post-processing as input terms 836 for subsequent inferences. If the logit output from the current inference step is an EOS term 838 after post-processing, then the EOS term 838 can be considered a completion signal to stop inference.
[0074] As discussed, the first KV cache buffers 810, 812, 814, 816 and the second KV cache buffers 820, 822, 824, 826 have the same size. n . n The size can be based on the size of the input prompt 830. After each inference, the LLM performs a ping-pong switch between the first KV cache buffers 810, 812, 814, 816 and the second KV cache buffers 820, 822, 824, 826. That is, the first KV cache buffers 810, 812, 814, 816 are treated as inputs in one inference, and the second KV cache buffers 820, 822, 824, 826 store the outputs. The first KV cache buffers 810, 812, 814, 816 store the outputs in the next inference, and the second KV cache buffers 820, 822, 824, 826 are treated as inputs. In this way, the pipeline 800 may not require the CPU to... Figure 7 The memory segments are shifted and copied as frequently as specified in pipeline 700.
[0075] Figure 9A This is a flowchart illustrating the process for updating the KV cache according to various aspects of this disclosure. Figure 9B This is a block diagram illustrating the KV cache memory during updates according to various aspects of this disclosure.
[0076] exist Figure 9A and Figure 9B In this context, the computing engine, besides the CPU, can perform operations on cache buffers. In some respects, the computing engine can be part of a device (such as a SoC 100). In other respects, the computing engine can be a neural processing unit (NPU), a neural signal processor (NSP), a matrix processing unit (MPU), or a hexagonal tensor processor (HTP). For example, an NPU can perform operations such as... Figure 9A The described slicing and splicing operations.
[0077] exist Figure 9A In this process, the computing engine receives the past cache buffer 902 and the new cache output 904 from the inference as input. Although Figure 9A The illustrated new cache output 904 is 1×128 words in size, but in some respects, it can be larger or smaller. For example, it could be 8×128 words. At box 906, the computation engine concatenates the new cache output 904 to the previous cache input 902. After concatenating the new cache output 904 to the previous cache input 902, the computation engine can store the resulting information in a temporary cache buffer 908. If the computation engine is an NPU, the NPU can use the same concatenation operation used by the NPU in matrix multiplication operations within a neural network to concatenate the new cache output 904 to the previous cache input 902. The NPU can also implement accelerator instructions (e.g., Hexagonal Vector Extension (HVX) copy instructions) to copy the cache.
[0078] After the computation engine concatenates the new cache output 904 with the past cache input 902, at box 910, the computation engine slices a portion of memory from the past cache. This portion of memory sliced from the past cache input 902 can be equal to the size of the new cache output 904. For example, if the past cache input size is 1023 × 128 units and the new cache output size is 1 × 128 units, the computation engine can concatenate the past cache input 902 and the new cache output 904 to create a temporary cache buffer 908 of 1024 × 128 units. The computation engine can then slice 1 × 128 units from the past cache input 902, resulting in a final output cache buffer 912 of 1023 × 128 units.
[0079] To perform the slicing operation at box 910, the computation engine can reuse the concatenation operation output buffer to copy temporary cache information from temporary cache buffer 908, starting at an index equal to the starting address of temporary cache buffer 908 plus an index equal to the size of the new cache output 904. For example, if the new cache output is 1×128 words, and temporary cache buffer 908 starts at index 0 and ends at index 1023, the computation engine can copy from temporary cache buffer 908 at index 1 and continue copying temporary cache data until index 1023. Figure 9AIn another example not illustrated, the new cache output 904 is 8 × 128 tokens in size, and the previous cache input was 1016 × 128 tokens in size. The temporary cache buffer 908 starts at index 0 and ends at index 1023. The computation engine can copy from the temporary cache buffer 908 starting at index 8 and continue copying temporary cache data until index 1023. Additionally, the computation engine can copy the temporary cache buffer 908 to the new output cache buffer 912. Figure 8 When the first KV cache buffers 810, 812, 814, 816 or the second KV cache buffers 820, 822, 824, 826 are considered as KV cache buffer outputs, the same cache update can be performed on either the first or second KV cache buffer on the compute engine. By slicing the cached data, the compute engine performs the same cache update more efficiently as... Figure 7 This is similar to a shift operation performed by the CPU.
[0080] The compute engine creates a sliding window by manipulating the start and end indices of the cache buffer when copying cached data. The sliding window fetches cached data along an axis in preparation for the next inference. Although the start and end indices of the cache can change dynamically, the size of the sliding window remains constant.
[0081] Figure 9B This illustrates various aspects of a sliding window. In Figure 9B In the past, cache input 902 included a padding segment 920 on the left and a tensor 925 on the right. Figure 9A During the illustrated operation, the computation engine appends the new tensor (T3) 905 to the previous cache input 902, thereby creating a temporary cache buffer 908 (also referred to as the temporary updated cache). The computation engine can then copy the temporary cache buffer 908 to the output cache buffer 912 during the slicing operation. As discussed, the copy operation begins at an index equal to the starting index of the temporary cache buffer 908 plus the size of the newly appended cache output 904. The copy operation can be completed at the end of the maximum context length of the temporary cache buffer 908.
[0082] In some respects, the temporary cache buffer 908 has a maximum context length. If the temporary cache buffer 908 has a maximum context length of 1024 memory addresses, including indices 0 to 1023, then the copy operation can be performed at index "". n "Starting from ( n =The size of the newly concatenated cache output is 904) and ends at index 1023. Figure 9A In one example, the maximum context length of the temporarily updated cache is 1024; the starting index of the temporarily updated cache is 0, and the ending index is 1023. The new cache output size is 1 × 128 tokens. Therefore, the copy operation starts at index 1 and ends at the maximum context length minus 1 (e.g., 1023). In the second example, the maximum context length of the temporary cache buffer 908 is 1024; the starting index of the temporarily updated cache is 0, and the ending index is 1023. The new cache output size is 8 × 128 tokens, while the previous cache input 902 was 1016 × 128. Therefore, the copy operation starts at index 8 and ends at the maximum context length minus 1 (e.g., 1023).
[0083] In some respects, copy operations can truncate both ends of the temporarily updated cache. One reason for truncating both ends of the temporarily updated cache is that if the new cache output includes both valid and invalid information, it is expected that only the valid information will be used for inference operations. To truncate invalid information, the compute engine can record the maximum new cache size. m "", which indicates the predefined maximum size of the new cache output. In copy operations at both ends of the truncated temporary updated cache, the starting index is... n And the ending index is the maximum context length minus m add n Decrease by 1. In one example, the maximum context length of the temporarily updated cache is 1024; the starting index of the temporarily updated cache is 0, and the ending index is 1023. The maximum new cache size is 8, and the actual size of the new effective cache output is 5. The past cache input 902 is 1016 × 128. Here, the copy operation starts at index 5 and ends at index 1020.
[0084] Figure 10 This is a table illustrating a comparison between conventional KV cache management pipelines and improved KV cache management pipelines according to various aspects of this disclosure. Figure 10 In the illustrated comparison, a CPU implementing a conventional KV cache management pipeline (such as pipeline 700) performs all cache input and output operations. In contrast, a CPU implementing... Figure 8 , Figure 9A and Figure 9B The only cached input and output operation performed by the CPU in the illustrated solution is pointer switching. Separate compute engines (such as NPU / DSP) can perform splicing and slicing operations.
[0085] To facilitate switching between the first and second KV cache buffers, the processor can switch the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switch the output buffer pointer from the second KV cache buffer to the first KV cache buffer. To switch pointers, the processor can reassign the memory address stored in the pointer variable to point to a different location in memory. Pointer switching operations can be implemented in program code by using assignment statements such as "setMemHandle()" to assign the address of another variable or memory location to the pointer variable. By changing the value held by the pointer, the processor redirects the pointer's reference, allowing the processor to access and manipulate different memory locations or objects. An example of program code for performing pointer switching is provided in [link to example code]. Figure 10 An example is shown in the second (CPU) column of the "New Hybrid Solutions" row.
[0086] Figure 11 This is a flowchart illustrating a processor-implemented method for storing KV cache information according to various aspects of this disclosure. For example, the processor-implemented method 1100 may be executed by one or more processors (such as CPUs (e.g., 102, 422), GPUs (e.g., 104, 426) and / or other processing units (e.g., DSP 424, NPU 428, NSP and / or MPU)).
[0087] In some aspects, the processor-implemented method 1100 may include: allocating a first KV cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size (box 1102). The first KV cache buffer and the second KV cache buffer may be stored in, for example, double data rate (DDR) memory. In some aspects, the processor-implemented method 1100 may also include: utilizing a first neural network to compute a first inference (box 1104). The first neural network may be, for example, a BERT network.
[0088] In some aspects, the processor-implemented method 1100 may include: outputting a first set of new KV cache data computed during the first inference to a first KV cache buffer (box 1106). For example, the processor may perform BERT inference and output one or more resulting tokens to the first KV cache buffer. In some aspects, the processor-implemented method 1100 may also include: utilizing a second neural network to compute a second inference, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference (box 1108). For example, a KV cache model may use data from the first KV cache buffer as input to perform inference.
[0089] In some aspects, the processor-implemented method 1100 may further include outputting a second set of new KV cache data computed during the second inference to a second KV cache buffer (box 1110). For example, a KV cache model may output the inferred data to a second KV cache buffer reserved in DDR memory. Additionally, the processor may first concatenate the second set of new KV cache data generated by the second inference with past KV cache data to create a temporarily updated KV cache, and then slice the temporarily updated KV cache to shift it in a first-in-first-out (FIFO) manner. The concatenation and slicing operations may be performed outside the CPU. For example, an NPU / DSP may implement the concatenation and slicing operations (using vector copy instructions to implement slicing). As used, the phrase "shifting the temporarily updated KV cache in a FIFO manner" generally refers to: older memory portions within the temporarily updated KV cache being removed from the temporarily updated KV cache or excluded from the copy operation when copying the temporarily updated KV cache.
[0090] In some aspects, the processor-implemented method 1100 may further include: utilizing a second neural network to compute a third inference, the third inference occurring after the second inference, the second neural network receiving input from a second KV cache buffer for the third inference (box 1112). To facilitate switching between the first and second KV cache buffers, the processor may switch the input buffer pointer from the first KV cache buffer to the second KV cache buffer before computing the third inference, and switch the output buffer pointer from the second KV cache buffer to the first KV cache buffer.
[0091] In some aspects, the processor-implemented method 1100 may further include outputting a third set of cached data computed during the third inference period to a first KV cache buffer (block 1114). At this point, two KV cache buffers have been allocated in the processor-implemented method 1100. The device implementing the processor-implemented method 1100 may continue to switch between the first KV cache buffer and the second KV cache buffer for input and output.
[0092] Example
[0093] Aspect 1: A processor-implemented method comprising: allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; computing a first inference using a first neural network; outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer; computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference; outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer; computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference; and outputting a third set of cache data computed during the third inference to the first KV cache buffer.
[0094] Aspect 2: The processor-implemented method according to Aspect 1 further includes, before outputting the second set of new KV cache data: concatenating the second set of new KV cache data generated by the second inference with past KV cache data to create a temporarily updated KV cache; and slicing the temporarily updated KV cache to shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
[0095] Aspect 3: The method implemented by the processor according to aspect 1 or 2, wherein the splicing and the slicing are performed outside the central processing unit (CPU).
[0096] Aspect 4: The processor-implemented method according to any of the foregoing aspects, the processor-implemented method further includes: implementing the slice using vector copy instructions.
[0097] Aspect 5: The processor-implemented method according to any of the foregoing aspects, the processor-implemented method further comprising: switching the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switching the output buffer pointer from the second KV cache buffer to the first KV cache buffer.
[0098] Aspect 6: An apparatus comprising: at least one memory; and at least one processor coupled to the at least one memory, the at least one processor being configured to: allocate a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; compute a first inference using a first neural network; output a first set of new KV cache data computed during the first inference to the first KV cache buffer; compute a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference; output a second set of new KV cache data computed during the second inference to the second KV cache buffer; compute a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference; and output a third set of cache data computed during the third inference to the first KV cache buffer.
[0099] Aspect 7: The apparatus according to aspect 6, wherein the at least one processor is further configured to, before outputting the second set of new KV cache data: concatenate the second set of new KV cache data generated by the second inference with past KV cache data to create a temporarily updated KV cache; and slice the temporarily updated KV cache to shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
[0100] Aspect 8: The apparatus according to aspect 6 or 7, wherein the at least one processor is further configured to perform the splicing and the slicing outside the central processing unit (CPU).
[0101] Aspect 9: The apparatus according to any one of aspects 6 to 8, wherein the at least one processor is further configured to implement the slice using vector copy instructions.
[0102] Aspect 10: The apparatus according to any one of aspects 6 to 9, wherein the at least one processor is further configured to: switch the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switch the output buffer pointer from the second KV cache buffer to the first KV cache buffer.
[0103] Aspect 11: A non-transitory computer-readable medium having program code recorded thereon, the program code being executed by a processor and comprising: program code for allocating from system memory a first key-value (KV) cache buffer and a second KV cache buffer, the first KV cache buffer and the second KV cache buffer having the same size; program code for computing a first inference using a first neural network; program code for outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer; program code for computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference; program code for outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer; program code for computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference; and program code for outputting a third set of cache data computed during the third inference to the first KV cache buffer.
[0104] Aspect 12: The non-transitory computer-readable medium according to aspect 11, wherein the program code includes: program code for concatenating the second set of new KV cache data generated by the second inference with past KV cache data to create a temporarily updated KV cache; and program code for slicing the temporarily updated KV cache to shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
[0105] Aspect 13: The non-transitory computer-readable medium according to aspect 11 or 12, wherein the program code includes program code for performing the splicing and the slicing outside the central processing unit (CPU).
[0106] Aspect 14: A non-transitory computer-readable medium according to any one of aspects 11 to 13, wherein the program code includes program code for implementing the slice using vector copy instructions.
[0107] Aspect 15: A non-transitory computer-readable medium according to any one of aspects 11 to 14, wherein the program code includes program code for: switching an input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switching an output buffer pointer from the second KV cache buffer to the first KV cache buffer.
[0108] Aspect 16: An apparatus comprising: means for allocating from system memory a first key-value (KV) cache buffer and a second KV cache buffer, the first KV cache buffer and the second KV cache buffer having the same size; means for computing a first inference using a first neural network; means for outputting a first set of new KV cache data computed during the first inference to the first KV cache buffer; means for computing a second inference using a second neural network, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference; means for outputting a second set of new KV cache data computed during the second inference to the second KV cache buffer; means for computing a third inference using the second neural network, the third inference occurring after the second inference, the second neural network receiving input from the second KV cache buffer for the third inference; and means for outputting a third set of cache data computed during the third inference to the first KV cache buffer.
[0109] Aspect 17: The apparatus according to aspect 16 further includes: components for concatenating the second set of new KV cache data generated by the second inference with past KV cache data to create a temporarily updated KV cache; and components for slicing the temporarily updated KV cache to shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
[0110] Aspect 18: The apparatus according to aspect 16 or 17, wherein the component for splicing and the component for slicing are located outside the central processing unit (CPU).
[0111] Aspect 19: The apparatus according to any one of aspects 16 to 18, the apparatus further comprising a component for implementing the slice using vector copy instructions.
[0112] Aspect 20: The apparatus according to any one of aspects 16 to 19 further includes components for: switching an input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switching an output buffer pointer from the second KV cache buffer to the first KV cache buffer.
[0113] The various operations of the methods described above can be performed by any suitable component capable of performing the corresponding function. These components may include various hardware and / or software components and / or modules, including but not limited to circuits, application-specific integrated circuits (ASICs), or processors. Generally, in the cases where operations are illustrated in the accompanying drawings, these operations may have corresponding paired components with similar numbering plus functional components.
[0114] As used, the term "determine" encompasses a wide variety of actions. For example, "determine" can include calculation, computation, processing, derivation, research, searching (e.g., looking in a table, database, or other data structure), assertion, etc. Additionally, "determine" can include receiving (e.g., receiving information), accessing (e.g., accessing data in memory), etc. Furthermore, "determine" can include parsing, selecting, choosing, building, etc.
[0115] As used, the phrase "at least one of the items" in a list of items refers to any combination of these items, including a single member. As an example, "at least one of a, b, or c" is intended to cover: a, b, c, ab, ac, bc, and abc.
[0116] The various exemplary logic blocks, modules, and circuits described in this disclosure can be implemented or executed using a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic components, discrete hardware components, or any combination thereof designed to perform the described functions. While the general-purpose processor may be a microprocessor, in alternative embodiments, the processor may be any commercially available processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration.
[0117] The steps or algorithms of the methods described in this disclosure may be directly embodied in hardware, software modules executed by a processor, or a combination of both. The software modules may reside in any form of storage medium known in the art. Some examples of usable storage media include random access memory (RAM), read-only memory (ROM), flash memory, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, removable disks, CD-ROMs, and the like. Software modules may include a single instruction or multiple instructions and may be distributed across several different code segments, across different programs, and across multiple storage media. The storage medium may be coupled to the processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium may be integral with the processor.
[0118] The disclosed method includes one or more steps or actions for implementing the described method. The steps and / or actions of the method may be interchanged without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and / or use of a particular step and / or action may be modified without departing from the scope of the claims.
[0119] The described functionality can be implemented in hardware, software, firmware, or any combination thereof. If implemented in hardware, an example hardware configuration may include a processing system within the device. This processing system may utilize a bus architecture. Depending on the specific application and overall design constraints of the processing system, the bus may include any number of interconnect buses and bridges. The bus can link various circuits together, including processors, machine-readable media, and bus interfaces. The bus interface can be used to connect network adapters, etc., to the processing system via the bus. The network adapter can be used to implement signal processing functions. In some respects, user interfaces (e.g., keypads, displays, mice, joysticks, etc.) may also be connected to the bus. The bus may also link various other circuits, such as timing sources, peripherals, voltage regulators, power management circuits, etc., which are well known in the art and will not be described further.
[0120] A processor may be responsible for managing the bus and general-purpose processing, including executing software stored on a machine-readable medium. A processor may be implemented using one or more general-purpose processors and / or special-purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry that can execute software. Software should be interpreted broadly as instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, disks, optical disks, hard disks, or any other suitable storage medium, or any combination thereof. A machine-readable medium may be embodied as a computer program product. A computer program product may include packaging material.
[0121] In a hardware implementation, machine-readable media can be part of a processing system separate from the processor. However, as those skilled in the art will readily understand, machine-readable media, or any portion thereof, can be external to the processing system. By way of example, machine-readable media may include transmit lines, carrier waves modulated by data, and / or computer components separate from the device, all accessible to the processor via a bus interface. Alternatively or additionally, machine-readable media, or any portion thereof, may be integrated into the processor, such as in the case of a cache and / or a general-purpose register file. Although the various components discussed may be described as having a specific location, such as local components, they may also be configured in various ways, such as certain components being configured as part of a distributed computing system.
[0122] The processing system may be configured as a general-purpose processing system having one or more microprocessors providing processor functionality and external memory providing at least a portion of machine-readable medium, all of which are linked together with other supporting circuitry via an external bus architecture. Alternatively, the processing system may include one or more neuromorphic processors for implementing the described neuron and nervous system models. As another alternative, the processing system may be implemented using an application-specific integrated circuit (ASIC) having a processor, bus interface, user interface, supporting circuitry, and at least a portion of machine-readable medium integrated on a single chip, or using one or more field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), controllers, state machines, gated logic components, discrete hardware components, or any other suitable circuitry, or any combination of circuitry capable of performing the various functionalities described throughout this disclosure. Those skilled in the art will recognize how best to implement the described functionality of the processing system depends on the specific application and the overall design constraints imposed on the system as a whole.
[0123] Machine-readable media may include multiple software modules. These software modules include instructions that, when executed by a processor, cause the processing system to perform various functions. Software modules may include send and receive modules. Each software module may reside in a single storage device or be distributed across multiple storage devices. For example, when a triggering event occurs, a software module may be loaded from a hard disk drive into RAM. During the execution of a software module, the processor may load some of the instructions into a cache to improve access speed. One or more cache lines may then be loaded into a general-purpose register file for processor execution. When the functionality of a software module is referred to below, it will be understood that such functionality is implemented by the processor when executing the instructions from that software module. Furthermore, it should be understood that aspects of this disclosure result in improvements to the functionality of a processor, computer, machine, or other system implementing such aspects.
[0124] If implemented in software, the functions may be stored as one or more instructions or codes on or transmitted through a computer-readable medium. Computer-readable media includes both computer storage media and communication media, including any medium that facilitates the transfer of a computer program from one location to another. Storage media can be any available medium accessible to a computer. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage devices, disk storage devices or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and is accessible to a computer. Additionally, any connection is also appropriately referred to as a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, optical fiber, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared (IR), radio, and microwave, then such coaxial cable, optical fiber, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. The disks and optical discs used include compact discs (CDs), laser discs, optical discs, digital multifunction discs (DVDs), floppy disks, and Blu-ray discs. ® Optical discs, where magnetic disks typically reproduce data magnetically, and optical discs reproduce data optically using lasers. Therefore, in some aspects, computer-readable media may include non-transitory computer-readable media (e.g., tangible media). Furthermore, in other aspects, computer-readable media may include transient computer-readable media (e.g., signals). Combinations of the above should also be included within the scope of computer-readable media.
[0125] Therefore, certain aspects may include a computer program product for performing the presented operations. For example, such a computer program product may include a computer-readable medium on which instructions are stored (and / or encoded) that can be executed by one or more processors to perform the described operations. In some aspects, the computer program product may include packaging material.
[0126] Furthermore, it should be understood that modules and / or other suitable components for performing the described methods and techniques may be downloaded and / or otherwise obtained by the user terminal and / or base station where applicable. For example, such devices can be coupled to a server to facilitate the transfer of components for performing the described methods. Alternatively, the various methods described can be provided via storage components (e.g., RAM, ROM, physical storage media such as CDs or floppy disks) so that the user terminal and / or base station can obtain the various methods once the storage component is coupled to or provided to the device. Furthermore, any other suitable techniques suitable for providing the described methods and techniques to the device may be utilized.
[0127] It should be understood that the claims are not limited to the precise configurations and components illustrated above. Various modifications, variations, and alterations may be made to the arrangement, operation, and details of the methods and apparatus described above without departing from the scope of the claims.
Claims
1. A processor-implemented method, the processor-implemented method comprising: Allocate a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; The first inference is calculated using the first neural network; The first set of new KV cache data calculated during the first inference period will be output to the first KV cache buffer; A second neural network is used to compute a second inference, which occurs after the first inference. The second neural network receives input from the first KV cache buffer for the second inference. The second set of new KV cache data calculated during the second inference period will be output to the second KV cache buffer; The second neural network is used to compute a third inference, which occurs after the second inference, and the second neural network receives input from the second KV cache buffer for the third inference. as well as The third set of cached data calculated during the third inference period is output to the first KV cache buffer.
2. The processor-implemented method according to claim 1, further comprising, before outputting the second set of new KV cache data: The second set of new KV cache data generated by the second inference is concatenated with the past KV cache data to create a temporarily updated KV cache; and The temporarily updated KV cache is sliced and shifted in a first-in-first-out (FIFO) manner.
3. The processor-implemented method according to claim 2, wherein the splicing and the slicing are performed outside the central processing unit (CPU).
4. The processor-implemented method according to claim 3, further comprising: The slicing is achieved using vector copy instructions.
5. The processor-implemented method according to claim 1, further comprising: Before calculating the third inference, the input buffer pointer is switched from the first KV cache buffer to the second KV cache buffer, and the output buffer pointer is switched from the second KV cache buffer to the first KV cache buffer.
6. An apparatus comprising: At least one memory; and At least one processor, coupled to the at least one memory, is configured to: Allocate a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; The first inference is calculated using the first neural network; The first set of new KV cache data calculated during the first inference period will be output to the first KV cache buffer; A second neural network is used to compute a second inference, which occurs after the first inference. The second neural network receives input from the first KV cache buffer for the second inference. The second set of new KV cache data calculated during the second inference period will be output to the second KV cache buffer; The second neural network is used to compute a third inference, which occurs after the second inference, and the second neural network receives input from the second KV cache buffer for the third inference. as well as The third set of cached data calculated during the third inference period is output to the first KV cache buffer.
7. The apparatus of claim 6, wherein the at least one processor is further configured to: before outputting the second set of new KV cache data: The second set of new KV cache data generated by the second inference is concatenated with the past KV cache data to create a temporarily updated KV cache; and The temporarily updated KV cache is sliced and shifted in a first-in-first-out (FIFO) manner.
8. The apparatus of claim 7, wherein the at least one processor is further configured to perform splicing and slicing outside the central processing unit (CPU).
9. The apparatus of claim 8, wherein the at least one processor is further configured to implement the slice using vector copy instructions.
10. The apparatus of claim 6, wherein the at least one processor is further configured to: switch the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switch the output buffer pointer from the second KV cache buffer to the first KV cache buffer.
11. A non-transitory computer-readable medium having program code recorded thereon, the program code being executed by at least one processor and comprising: Program code for allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; Program code used to compute the first inference using the first neural network; Program code for outputting the first set of new KV cache data calculated during the first inference period to the first KV cache buffer; Program code for using a second neural network to compute a second inference, the second inference occurring after the first inference, the second neural network receiving input from the first KV cache buffer for the second inference; Program code for outputting the second set of new KV cache data calculated during the second inference to the second KV cache buffer; Program code for using the second neural network to compute a third inference, the third inference occurring after the second inference, wherein the second neural network receives input from the second KV cache buffer for the third inference; and Program code for outputting the third set of cached data calculated during the third inference to the first KV cache buffer.
12. The non-transitory computer-readable medium of claim 11, wherein the program code further comprises: The program code is used to concatenate the second set of new KV cache data generated by the second inference with the past KV cache data to create a temporarily updated KV cache; and The following program code is used to slice the temporarily updated KV cache and shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
13. The non-transitory computer-readable medium of claim 12, wherein the program code further includes program code for performing splicing and slicing outside the central processing unit (CPU).
14. The non-transitory computer-readable medium of claim 13, wherein the program code further comprises program code for implementing the slice using vector copy instructions.
15. The non-transitory computer-readable medium of claim 11, wherein the program code further comprises program code for: switching the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference, and switching the output buffer pointer from the second KV cache buffer to the first KV cache buffer.
16. An apparatus comprising: Components for allocating a first key-value (KV) cache buffer and a second KV cache buffer from system memory, the first KV cache buffer and the second KV cache buffer having the same size; Components used to calculate the first inference using the first neural network; A component for outputting the first set of new KV cache data calculated during the first inference period to the first KV cache buffer; A component for using a second neural network to compute a second inference, the second inference occurring after the first inference, wherein the second neural network receives input from the first KV cache buffer for the second inference; A component for outputting the second set of new KV cache data calculated during the second inference to the second KV cache buffer; A component for using the second neural network to compute a third inference, the third inference occurring after the second inference, wherein the second neural network receives input from the second KV cache buffer for the third inference; and A component for outputting the third set of cached data calculated during the third inference to the first KV cache buffer.
17. The apparatus of claim 16, further comprising: The component is used to concatenate the second set of new KV cache data generated by the second inference with the past KV cache data to create a temporarily updated KV cache; and The component is used to slice the temporarily updated KV cache to shift the temporarily updated KV cache in a first-in-first-out (FIFO) manner.
18. The apparatus of claim 17, wherein the component for splicing and the component for slicing are located outside the central processing unit (CPU).
19. The apparatus of claim 18, further comprising a component for implementing the component for slicing using vector copy instructions.
20. The apparatus of claim 16, further comprising: A component for switching the input buffer pointer from the first KV cache buffer to the second KV cache buffer before calculating the third inference; And a component for switching the output buffer pointer from the second KV cache buffer to the first KV cache buffer before calculating the third inference.