Tuning large language models for next sentence prediction
By decomposing the multi-head attention mechanism into independent single-head attention operations and executing them in parallel, the problem of low computational efficiency of large language models is solved, improving computational speed and hardware efficiency while reducing power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-10-08
- Publication Date
- 2026-06-19
AI Technical Summary
Existing large-scale language models suffer from low computational efficiency, high power consumption, and increased hardware design complexity when using multi-head attention mechanisms, especially in neural signal processors that handle neural network operations.
The multi-head attention mechanism is decomposed into independent single-head attention operations, which are executed in parallel to reduce the need for data reshaping and transposition, thereby improving computational efficiency and data throughput.
By performing single-head attention operations in parallel, computational overhead is reduced, the converter's computational speed and hardware efficiency are improved, and power consumption is reduced.
Smart Images

Figure CN122249815A_ABST
Abstract
Description
[0001] Cross-reference to related applications
[0002] This application claims priority to U.S. Patent Application No. 18 / 522,149, filed November 28, 2023, entitled “TUNING LARGE LANGUAGE MODELSFOR NEXT SENTENCE PREDICTION”, the entire disclosure of which is expressly incorporated herein by reference. Technical Field
[0003] The various aspects of this disclosure generally relate to neural networks, and more specifically to tuning large language models for next-sentence prediction. Background Technology
[0004] Artificial neural networks can comprise interconnected groups of artificial neurons (e.g., neuron models). An artificial neural network (ANN) can be a computing device or represented as a method to be performed by a computing device. A convolutional neural network (CNN) is a type of feedforward ANN.
[0005] A network can comprise an array of neurons, each with its own receptive field, which collectively divide the input space. Convolutional neural networks, such as deep convolutional neural networks (DCNs), have numerous applications. Specifically, these neural network architectures are used in a variety of technologies, such as image recognition, speech recognition, acoustic scene classification, keyword retrieval, autonomous driving, and other classification tasks.
[0006] Given the many useful applications of neural networks, there is a growing need to use them to solve increasingly complex problems in further application areas. One area being explored is generative artificial intelligence. Large language models (LLMs) have made significant progress in natural language understanding and are gaining popularity for text generation tasks and tasks involving modeling information from both textual and visual domains. LLMs can receive prompts from users and then generate responses or completions. Summary of the Invention
[0007] In one aspect of this disclosure, a processor-implemented method includes generating a set of single-head attention (SHA) operations based on a plurality of attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in the set of attention heads associated with the MHA mechanism. The processor-implemented method further includes executing each of the set of SHA operations independently in parallel among hardware blocks of a device associated with a neural network model. The processor-implemented method also includes generating an MHA output based on the parallel execution of each of the set of SHA operations.
[0008] Another aspect of this disclosure relates to an apparatus including components for generating a set of single-head attention (SHA) operations based on a plurality of attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in the set of attention heads associated with the MHA mechanism. The apparatus also includes components for independently and in parallel executing each of the set of SHA operations among hardware blocks of a device associated with a neural network model. The apparatus further includes components for generating an MHA output based on the parallel execution of each of the set of SHA operations.
[0009] In another aspect of this disclosure, a non-transitory computer-readable medium having non-transitory program code recorded thereon is disclosed. This program code is executed by a processor and includes program code for generating a set of single-head attention (SHA) operations based on multiple attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in the set of attention heads associated with the MHA mechanism. The program code also includes program code for independently and in parallel executing each SHA operation in the set of SHA operations among hardware blocks of a device associated with a neural network model. The program code further includes program code for generating an MHA output based on the parallel execution of each SHA operation in the set of SHA operations.
[0010] Another aspect of this disclosure relates to an apparatus having: one or more processors; and one or more memories coupled to the one or more processors and storing instructions that, when executed by the one or more processors, are operable to cause the apparatus to generate a set of single-head attention (SHA) operations based on multiple attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in a set of attention heads associated with the MHA mechanism. Execution of the instructions also causes the apparatus to independently execute each of the set of SHA operations in parallel among hardware blocks of a device associated with a neural network model. Execution of the instructions further causes the apparatus to generate an MHA output based on the parallel execution of each of the set of SHA operations.
[0011] Additional features and advantages of this disclosure will be described below. Those skilled in the art will understand that this disclosure can be readily used as the basis for modifying or designing other structures for implementing the same purposes as this disclosure. Those skilled in the art will also recognize that such equivalent constructions do not depart from the teachings of this disclosure as set forth in the appended claims. Novel features considered characteristic of this disclosure, in both their organization and manner of operation, along with further objects and advantages, will be better understood when considered in conjunction with the accompanying drawings. However, it is to be clearly understood that each drawing is provided for illustrative and descriptive purposes only and is not intended to be a definition of a limitation of this disclosure. Attached Figure Description
[0012] The features, substance, and advantages of this disclosure will become more apparent when understood in conjunction with the accompanying drawings, in which the same reference numerals are always used to identify the parts of the drawings.
[0013] Figure 1 Example implementations of neural networks using a system-on-a-chip (SoC) (including a general-purpose processor) according to certain aspects of this disclosure are illustrated.
[0014] Figure 2A , Figure 2B and Figure 2C These are illustrations of neural networks according to various aspects of this disclosure.
[0015] Figure 2D This is a diagram illustrating exemplary deep convolutional networks (DCNs) according to various aspects of this disclosure.
[0016] Figure 3 This is a block diagram illustrating exemplary deep convolutional networks (DCNs) according to various aspects of this disclosure.
[0017] Figure 4 This is a block diagram illustrating exemplary software architectures that enable modularization of artificial intelligence (AI) functions according to various aspects of this disclosure.
[0018] Figure 5 and Figure 6 This is a diagram illustrating various aspects of the word generation model according to this disclosure.
[0019] Figure 7 This is a flowchart illustrating examples of a process for processing multi-head attention (MHA) inputs by an MHA mechanism of a transformer model, according to various aspects of this disclosure.
[0020] Figure 8 This is a flowchart illustrating an example process for parallel processing of MHA inputs by multiple single-head attention (SHA) mechanisms of a transformer model, according to various aspects of this disclosure.
[0021] Figure 9 This is a flowchart illustrating examples of a process for generating an SHA mechanism from an MHA mechanism associated with a converter model, according to various aspects of this disclosure. Detailed Implementation
[0022] The detailed description that follows, taken in conjunction with the accompanying drawings, is intended as a description of various configurations and not as representing only configurations in which the described concepts can be practiced. To provide a comprehensive understanding of the various concepts, the detailed description includes specific details. However, it will be apparent to those skilled in the art that these concepts can be practiced without these specific details. In some instances, to avoid obscuring such concepts, well-known structures and components are shown in block diagram form.
[0023] Based on the teachings, those skilled in the art will recognize that the scope of this disclosure is intended to cover any aspect of this disclosure, whether implemented independently of or in combination with any other aspect of this disclosure. For example, an apparatus or method may be implemented using any number of the aspects described. Furthermore, the scope of this disclosure is intended to cover such apparatuses or methods practiced using other structures, functionalities, or structures and functionalities that complement or differ from the various aspects of this disclosure described. It should be understood that any aspect of this disclosure may be embodied by one or more elements of the claims.
[0024] The word “exemplary” is used to mean “serving as an example, instance, or illustration.” Any aspect described as “exemplary” need not be interpreted as superior to or better than other aspects.
[0025] While specific aspects have been described, numerous variations and substitutions of these aspects fall within the scope of this disclosure. Although some benefits and advantages of preferred aspects have been mentioned, the scope of this disclosure is not intended to be limited to a particular benefit, use, or purpose. Rather, aspects of this disclosure are intended to be broadly applicable to different technologies, system configurations, networks, and protocols, some of which are illustrated by way of example in the accompanying drawings and the following description of preferred aspects. The detailed description and drawings are merely illustrative and not limiting of this disclosure, the scope of which is defined by the appended claims and their equivalents.
[0026] Large language models (LLMs) are examples of models for understanding natural language. LLMs are increasingly popular for text generation tasks and tasks involving modeling information from both textual and visual domains.
[0027] In many cases, LLMs use a transformer-based architecture (e.g., a transformer model) to process sequences of data, particularly text. In an LLM, the transformer receives a sequence of lexical units (which can be, for example, words, parts of words, or characters) and processes these units through its layers. As data flows through each layer, the model learns increasingly complex representations of the input data. This allows LLMs to perform a wide range of language-related tasks with high fluency and coherence, such as text completion, translation, summarization, and question answering.
[0028] The attention layer (e.g., attention head) of the transformer determines the important parts of the input and how these parts should affect the prediction task. The attention head processes data with a shape [sequence length, depth]. That is, for each position in the input sequence (sequence length), the attention head processes a vector of a specific size (depth). This vector represents the features or encoded information at each position in the sequence. In some cases, the attention head can use a multi-head attention (MHA) mechanism to process and understand the input data in a parallel and comprehensive manner. The MHA mechanism can also be called MHA (the two are used interchangeably). In such cases, MHA processes data with a shape [number of heads, sequence length, depth]. The MHA mechanism improves upon the capabilities of a single-head attention (SHA) mechanism by allowing simultaneous processing across multiple heads. The SHA mechanism can also be called SHA (the two are used interchangeably). The core computation of SHA can be represented as: Attention = V softmax(Q K T The MHA mechanism enhances SHA by introducing parallelism, allowing the transformer to attend to different parts of the input simultaneously. This improves the transformer's ability to capture various aspects of the input data. Specifically, SHA focuses on a single set (query, key, and value), while MHA splits that focus across multiple sets, each potentially capturing different contextual nuances from the input sequence.
[0029] For example, in the context of natural language processing, each head can focus on different types of relationships between words, such as syntactic and semantic relationships. This allows the transformer to have a more comprehensive understanding of the text. However, using multiple heads has drawbacks in terms of efficiency for neural signal processors (NSPs) that handle neural network operations. In most cases, various computations of MHA can be processed on dedicated hardware cores within the NSP. For example, matrix multiplication can be assigned to a hardware core optimized for such operations, while the softmax function that normalizes attention weights can be processed on a vector processing core. The use of multiple dedicated cores can lead to inefficiencies. The need to reshape and transpose data tensors for different stages of MHA computation can also introduce additional computational overhead. These extra steps increase the amount of processing power and time allocated to perform attention operations, potentially slowing down overall computation and reducing the throughput of the NSP. Furthermore, the increased complexity of managing data across different cores can increase power consumption and may require more complex hardware designs to maintain efficiency.
[0030] Various aspects of this disclosure relate to improving MHA by dividing MHA into separate SHA operations. In some examples, each SHA operation can be executed independently to achieve parallelization across hardware blocks, thereby reducing the number of reshaping and transpose layers.
[0031] Specifically, MHA allows the transformer to remember and pay attention to different parts of the text as it generates or processes language, enabling it to maintain context and understand nuances in longer passages. MHA generates three vectors from the input data: query (Q), key (K), and value (V). The attention layer uses these vectors to create a set of attention scores based on comparisons between the query and the key. These scores are normalized using a softmax function to determine the amount of attention (e.g., "attention") to be given to the corresponding value. The result is a weighted sum of value vectors that carries both the original information and the context obtained from the attention process. The model can perform this operation simultaneously using multiple heads.
[0032] Specific aspects of the subject matter described in this disclosure can be implemented to achieve one or more of the following potential advantages. In some examples, the described techniques for decomposing MHA into independent SHA operations can improve computational efficiency. Specifically, these techniques can increase data throughput by shortening the lifetime of data in local memory and distributing data transfer loads more evenly across memory resources. Additionally, the described methods can reduce the need for data reshaping and transposition, thereby potentially reducing computational overhead and improving the overall speed of the attention mechanism within the transformer.
[0033] Figure 1An example implementation of a System-on-a-Chip (SOC) 100 is illustrated, which may include a Central Processing Unit (CPU) 102 or a multi-core CPU configured to tune a large language model. Variables (e.g., neural signals and synaptic weights), system parameters associated with the computing device (e.g., a weighted neural network), latency, frequency window (bin) information, and task information may be stored in a memory block associated with a Neural Processing Unit (NPU) 108, a memory block associated with the CPU 102, a memory block associated with a Graphics Processing Unit (GPU) 104, a memory block associated with a Digital Signal Processor (DSP) 106, a memory block 118, or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from the program memory associated with the CPU 102 or from memory block 118.
[0034] The SOC 100 may also include additional processing blocks tailored for specific functions, such as a GPU 104, a DSP 106, a connectivity block 110 (which may include fifth-generation (5G) connectivity, fourth-generation LTE (4G) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, etc.), and a multimedia processor 112 capable of, for example, detecting and recognizing gestures. In one specific implementation, an NPU 108 is implemented within a CPU 102, a DSP 106, and / or a GPU 104. The SOC 100 may also include a sensor processor 114, an image signal processor (ISP) 116, and / or a navigation module 120, which may include a global positioning system.
[0035] The SOC 100 can be based on ARM, RISC-V (RISC-5), or any Reduced Instruction Set Computing (RISC) architecture. In various aspects of this disclosure, the instructions loaded into the general-purpose processor 102 may include code for generating a set of single-head attention (SHA) operations based on multiple attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in the set of attention heads associated with the MHA mechanism; code for independently and in parallel executing each of the SHA operations in the set of SHA operations across hardware blocks of the device associated with the neural network model; and code for generating an MHA output based on the parallel execution of each of the SHA operations in the set of SHA operations.
[0036] Deep learning architectures perform object recognition tasks by learning to represent inputs at progressively higher levels of abstraction in each layer, thereby constructing useful feature representations of the input data. In this way, deep learning addresses a major bottleneck of traditional machine learning. Before deep learning, machine learning methods for object recognition problems often relied heavily on human-designed features, possibly in conjunction with shallow classifiers. Shallow classifiers could be two-class linear classifiers, where a weighted sum of feature vector components is compared to a threshold to predict which class the input belongs to. Human-designed features could be templates or kernels customized for a specific problem domain by engineers with domain expertise. In contrast, while deep learning architectures can learn to represent features similar to those that human engineers might design, this requires training. Furthermore, deep networks can learn to represent and recognize novel types of features that humans might not have considered.
[0037] Deep learning architectures can learn hierarchical structures of features. For example, if presented with visual data, the first layer can learn to recognize relatively simple features in the input stream, such as edges. In another example, if presented with auditory data, the first layer can learn to recognize spectral power at specific frequencies. The second layer, taking the output of the first layer as input, can learn to recognize combinations of features, such as simple shapes in visual data or combinations of sounds in auditory data. For example, higher layers can learn to represent complex shapes in visual data or words in auditory data. Even higher layers can learn to recognize common visual objects or spoken phrases.
[0038] Deep learning architectures perform particularly well when applied to problems with a natural hierarchical structure. For example, the classification of motorized vehicles can benefit from first learning to identify features such as wheels, windshields, and others. These features can then be combined in different ways at higher levels to identify cars, trucks, and airplanes.
[0039] Neural networks can be designed to have multiple connectivity patterns. In feedforward networks, information is passed from lower layers to higher layers, where each neuron in a given layer communicates with neurons in higher layers. As described above, hierarchical representations can be built in successive layers of a feedforward network. Neural networks can also have recurrent or feedback (also known as top-down) connections. In recurrent connections, the output from a neuron in a given layer can be passed to another neuron in the same layer. Recurrent architectures can help identify patterns across more than one block of input data that is sequentially delivered to the neural network. Connections from neurons in a given layer to neurons in lower layers are called feedback (or top-down) connections. Networks with many feedback connections can be helpful when the recognition of higher-level concepts can aid in discerning specific lower-level features of the input.
[0040] The connections between layers in a neural network can be fully connected or locally connected. Figure 2A An example of a fully connected neural network 202 is illustrated. In the fully connected neural network 202, neurons in the first layer can transmit their outputs to each neuron in the second layer, so that each neuron in the second layer will receive inputs from each neuron in the first layer. Figure 2B An example of a locally connected neural network 204 is illustrated. In the locally connected neural network 204, neurons in a first layer can connect to a finite number of neurons in a second layer. More generally, the locally connected layers of the locally connected neural network 204 can be configured such that each neuron in the layer will have the same or similar connectivity pattern, but the connection strength can have different values (e.g., 210, 212, 214, and 216). The connectivity pattern of locally connected layers can produce spatially different receptive fields in higher layers because neurons in higher layers in a given region can receive inputs that are tuned to the characteristics of a restricted portion of the total input to the network through training.
[0041] An example of a locally connected neural network is a convolutional neural network. Figure 2C An example of a convolutional neural network 206 is illustrated. The convolutional neural network 206 can be configured such that the connection strength associated with the input for each neuron in the second layer is shared (e.g., 208). Convolutional neural networks may be well-suited for problems where the spatial location of the input is meaningful.
[0042] One type of convolutional neural network is the deep convolutional network (DCN). Figure 2D A detailed example of a DCN 200 designed to recognize visual features from an image 226 input from an image capture device 230 (such as an in-vehicle camera) is illustrated. The DCN 200 in this example can be trained to identify traffic signs and the numbers provided on them. Of course, the DCN 200 can be trained for other tasks, such as identifying lane markings or traffic lights.
[0043] Supervised learning can be used to train the DCN 200. During training, an image (such as image 226 of a speed limit sign) can be presented to the DCN 200, and forward passes can then be computed to produce output 222. The DCN 200 may include a feature extraction part and a classification part. Upon receiving image 226, convolutional layer 232 may apply a convolutional kernel (not shown) to image 226 to generate a first set of feature maps 218. As an example, the convolutional kernel used for convolutional layer 232 may be a 5x5 kernel that generates 28x28 feature maps. In this example, because four different feature maps are generated in the first set of feature maps 218, four different convolutional kernels are applied to image 226 at convolutional layer 232. Convolutional kernels may also be referred to as filters or convolutional filters.
[0044] The first set of feature maps 218 can be subsampled by a max-pooling layer (not shown) to generate a second set of feature maps 220. The max-pooling layer reduces the size of the first set of feature maps 218. That is, the size of the second set of feature maps 220 (e.g., 14×14) is smaller than the size of the first set of feature maps 218 (e.g., 28×28). The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 can be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).
[0045] exist Figure 2D In the example, the second set of feature maps 220 is convolved to generate a first feature vector 224. Furthermore, the first feature vector 224 is further convolved to generate a second feature vector 228. Each feature of the second feature vector 228 may include a number corresponding to a possible feature of image 226, such as "sign", "60", and "100". A softmax function (not shown) converts the numbers in the second feature vector 228 into probabilities. Thus, the output 222 of DCN 200 can be the probability that image 226 includes one or more features.
[0046] In this example, the probabilities for "sign" and "60" in output 222 are higher than the probabilities for other numbers in output 222 (such as "30", "40", "50", "70", "80", "90", and "100"). Before training, output 222 generated by DCN 200 may be incorrect. Therefore, the error between output 222 and the target output can be calculated. The target output is the baseline ground truth (e.g., "sign" and "60") of image 226. The weights of DCN 200 can then be adjusted so that output 222 of DCN 200 is more closely aligned with the target output.
[0047] To adjust the weights, the learning algorithm computes the gradient vector of the weights. The gradient indicates by how much the error will increase or decrease as the weights are adjusted. At the top layers, the gradient corresponds directly to the values of the weights connecting the activated neurons in the penultimate layer to the neurons in the output layer. In lower layers, the gradient depends on the values of the weights and the error gradient computed in the higher layers. The weights can then be adjusted to reduce the error. This method of adjusting weights is called "backpropagation" because it involves "passing backward" through the neural network.
[0048] In practice, the error gradient of the weights can be calculated using a small number of examples to make the calculated gradient approximate the true error gradient. This approximation method is called stochastic gradient descent. Stochastic gradient descent can be repeated until the achievable error rate of the entire system stops decreasing or until the error rate reaches a target level. After learning, a new image (e.g., a speed limit sign in image 226) can be presented to DCN 200, and output 222 can be generated through the forward pass of DCN 200. This output can be considered as an inference or prediction of DCN 200.
[0049] Deep Belief Networks (DBNs) are probabilistic models that include multiple layers of hidden nodes. DBNs can be used to extract hierarchical representations of training datasets. DBNs are obtained by stacking layers of Restricted Boltzmann Machines (RBMs). An RBM is a type of artificial neural network that learns a probability distribution from a set of inputs. Because RBMs can learn a probability distribution without information about the class each input should be classified into, they are often used for unsupervised learning. Using a hybrid paradigm of supervised and unsupervised learning, the bottom RBM of a DBN can be trained unsupervised and used as a feature extractor, while the top RBM can be trained supervisedly (on the joint distribution of inputs from the previous layer and the target class) and used as a classifier.
[0050] DCN is a network of convolutional networks configured with additional pooling and normalization layers. DCN has achieved state-of-the-art performance on many tasks. DCN can be trained using supervised learning, where both the input and output targets are known for many paradigms and are used to modify the network's weights using gradient descent.
[0051] DCNs can be feedforward networks. Furthermore, as described above, connections from neurons in the first layer of a DCN to a set of neurons in the next higher layer are shared across neurons in the first layer. The feedforward and shared connections of a DCN can be used for fast processing. For example, the computational cost of a DCN may be much smaller than that of a similarly sized neural network that includes recurrent or feedback connections.
[0052] The processing at each layer of a convolutional network can be thought of as a spatially invariant template or base projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, then a convolutional network trained on that input can be thought of as three-dimensional, with two spatial dimensions along the image's axes and a third dimension capturing color information. The output of the convolutional connections can be thought of as forming a feature map in the next layer, where each element in the feature map (e.g., 220) receives input from a range of neurons in the previous layer (e.g., feature map 218) and from each of the multiple channels. The values in the feature map can be further processed using non-linear methods (e.g., rectified, max(0,x)). Values from neighboring neurons can be further pooled, which corresponds to downsampling and provides additional local invariance and dimensionality reduction. Normalization corresponding to whitening can also be applied through lateral inhibition between neurons in the feature map.
[0053] Figure 3 This is a block diagram illustrating a DCN 350. A DCN 350 can include multiple layers of different types based on connectivity and weight sharing. For example... Figure 3 As shown, DCN 350 includes convolutional blocks 354A and 354B. Each convolutional block in convolutional blocks 354A and 354B can be configured with a convolutional layer (CONV) 356, a normalization layer (LNorm) 358, and a max pooling layer (MAX POOL) 360.
[0054] Although only two convolutional blocks 354A and 354B are shown, this disclosure is not limited thereto, and any number of convolutional blocks 354A and 354B may be included in the DCN 350 according to design preferences.
[0055] Convolutional layer 356 may include one or more convolutional filters that can be applied to the input data to generate feature maps. Normalization layer 358 may normalize the output of the convolutional filters. For example, normalization layer 358 may provide whitening or lateral suppression. Max pooling layer 360 may provide spatial downsampling aggregation to achieve local invariance and dimensionality reduction.
[0056] Parallel filter banks of deep convolutional networks can be loaded onto an SOC 100 (e.g., Figure 1 The CPU 102 or GPU 104 of the SOC 100 can be used to achieve high performance and low power consumption. In an alternative implementation, a parallel filter bank can be loaded onto the DSP 106 or ISP 116 of the SOC 100. In addition, the DCN 350 can access other processing blocks that may exist on the SOC 100, such as the sensor processor 114 and navigation module 120, which are dedicated to sensors and navigation, respectively.
[0057] The DCN 350 may also include one or more fully connected layers 362 (FC1 and FC2). The DCN 350 may also include logistic regression (LR) layers 364. Weights (not shown) to be updated are located between each of the layers 356, 358, 360, 362, and 364 of the DCN 350. The output of each layer (e.g., 356, 358, 360, 362, and 364) can be used as input to the next layer in the DCN 350 (e.g., 356, 358, 360, 362, and 364) to learn hierarchical feature representations from the input data 352 (e.g., images, audio, video, sensor data, and / or other input data) supplied at the first convolutional block in convolutional block 354A. The output of the DCN 350 is a classification score 366 of the input data 352. The classification score 366 may be a set of probabilities, where each probability is the probability that the input data includes a feature from the feature set.
[0058] Figure 4 This is a block diagram illustrating an exemplary software architecture 400 that enables modularization of artificial intelligence (AI) functionality. Using architecture 400, applications can be designed that enable SOC 420 (which may be similar to...) Figure 1 Various processing blocks of the SOC 100 (e.g., CPU 422, DSP 424, GPU 426, and / or NPU 428) support tuning a large language model for next-sentence prediction (NSP) for AI applications 402 according to various aspects of this disclosure. Architecture 400 may, for example, be included in a computing device (such as a smartphone).
[0059] AI application 402 can be configured to call functions defined in user space 404, which may, for example, provide an LLM, process the input of an LLM, or provide a generative AI application. AI application 402 can make requests for compiled program code associated with libraries defined in the AI Function Application Programming Interface (API) 406. This request may ultimately rely on the output of a deep neural network configured to provide inferred responses based on, for example, video and location data.
[0060] The runtime engine 408 (which may be compiled code of the runtime framework) may be further accessible to the AI application 402. When enabling the runtime engine 408 to provide an inference response, the runtime engine may then signal to the operating system (OS) space 410 running on the SOC 420, such as the kernel 412. In some examples, the kernel 412 may be a Linux kernel. The operating system may then enable LLM tuning to be performed on the CPU 422, DSP 424, GPU 426, NPU 428, or some combination thereof. The CPU 422 may be directly accessible by the operating system, while other processing blocks may be accessed via drivers such as drivers 414, 416, or 418 for the DSP 424, GPU 426, or NPU 428, respectively. In an exemplary example, a deep neural network may be configured to run on a combination of processing blocks such as the CPU 422, DSP 424, and GPU 426, or may run on the NPU 428.
[0061] Large language models (LLMs) are examples of models for understanding natural language. LLMs are increasingly popular for text generation tasks and tasks involving modeling information from both textual and visual domains.
[0062] In many cases, LLMs use a transformer-based architecture (e.g., a transformer model) to process sequences of data, particularly text. In an LLM, the transformer receives a sequence of lexical units (which can be, for example, words, parts of words, or characters) and processes these units through its layers. As data flows through each layer, the model learns increasingly complex representations of the input data. This allows LLMs to perform a wide range of language-related tasks with high fluency and coherence, such as text completion, translation, summarization, and question answering.
[0063] In LLM, lexical generation begins with word segmentation, where the input text is divided into lexical units (which can be complete words, subwords, or even individual characters), forming basic units that the model can interpret. These lexical units are then transformed into numerical vectors through an embedding process that captures the semantic meaning of each lexical unit and encodes that meaning into a format that the model can process.
[0064] Since the transformer lacks inherent sequence processing capabilities, positional encoding is added to these embeddings to provide the model with sequence order information. The tokens are then fed into the layers of the transformer, each layer comprising an MHA and a feedforward network. The MHA allows the model to selectively weigh the importance of different parts of the input sequence, thus refining its focus based on their relevance to the current processing point.
[0065] The final layer generates a set of output vectors, each representing the model's contextual understanding. To predict the next lexical term, the output vector of the current endpoint is passed through a dense layer equipped with softmax activation. This produces a probability distribution across the entire vocabulary, indicating the likelihood that each lexical term will follow the existing sequence. The selection of subsequent lexical terms can be deterministic, using the ArgMax function to select the most probable term, or random, introducing randomness by sampling from the distribution. The process continues iteratively, with each newly predicted lexical term being added to the sequence, and the model rejoining with the expanded input until a predefined stopping criterion is met. This recursive process allows LLMs to generate context-rich, coherent text sequences that often bear a striking resemblance to human-generated text.
[0066] Figure 5 and Figure 6 This is a diagram illustrating various aspects of the word generation model according to this disclosure. Figure 5 An example of a first KV model 500 is illustrated, which provides a pipeline for lexical generation for a large language model (LLM) 512. Since some neural processors (such as the NPU 108) may only support static models with a fixed input shape, the first KV model 500 can fix the input shape based on the maximum input length. In doing so, the first KV model 500 can take data prompted by the user (e.g., valid data) and add padding (e.g., zeros) to the remaining portion of the input, such that the input size can be fixed at the maximum length (e.g., 1024 characters).
[0067] The first KV model 500 can receive input text 502. Input text 502 may include system prompts and user prompts. System prompts may be standard prompts, which may, for example, instruct the user on greetings and / or instructions for operating the model (e.g., "Please enter a question"). User prompts may include user input, such as LLM tasks. User prompts may have variable lengths.
[0068] Input text 502 can be provided to a tokenizer 504. The tokenizer 504 can divide the input text 502 into multiple parts called lexical units 506. Lexical units 506 can include sub-parts of the input text 502, such as character sequences (e.g., the average length of a lexical unit can be about four characters), words, or phrases. In the first KV model 500, the tokenizer 504 processes the input text 502, which may include, but is not limited to, words, sentences, paragraphs, or documents, and can generate all lexical units in the lexical units 506 of the input text 502. The tokenizer 504 can then provide all lexical units 506 to the LLM 512 at a time.
[0069] Position embedding 510 can be applied to maintain information related to the order of lexical 506. Attention mask 508 can also be applied to identify more salient lexical (e.g., 506) corresponding to the input text 502. LLM 512 can then generate a prediction of a single subsequent lexical 514. The generated lexical 514 can be considered as a completion, such as a subsequent word in the response or output. LLM 512 can be configured to generate multiple lexical 514 in an autoregressive manner by writing the generated lexical 514 to memory (such as a KV tensor buffer 516) to preserve the internal state KV$ (e.g., a data structure called a KV cache, which can represent the keys (K) and values (V) of previously generated lexical ... As LLM 512 generates each term 514, the generated term 514 can be processed by desegmenter 524 in a manner reversible from that of segmenter 504 to generate output text 526 (e.g., a sequence of characters, words, or phrases). This process can continue to be repeated in this manner. With each iteration, the generated term 522 can update the internal state KV$ 518 and can be written to the KV tensor buffer 516 to update the KV tensor buffer 516, which can then be loaded as subsequent input.
[0070] Figure 6 A second KV model 600 for lexical generation is shown. The second KV model 600 includes reference... Figure 5 The first KV model described is similar to the 500 components. However, in Figure 6 In this model, the second KV model 600 can supply lexical units 606 to the LLM 512 one unit at a time, instead of supplying all lexical units at once. In the second KV model 600, there is one input, lexical unit 606, and the LLM 512 generates a first inference, shown as the last generated lexical unit 614, which can be de-segmented to generate an output 626. The KV tensor buffer 516 can be updated based on the last generated lexical unit 614, and the generated lexical unit can be discarded at box 610. The next lexical unit 606 corresponding to the input can be received and can be appended to the internal state loaded from the KV tensor buffer 516 to generate a second lexical unit 622 at the second inference 618, which can be de-segmented by the de-segmenter 524 and output as output 626. The KV tensor buffer 516 can then be updated based on the last generated lexical unit 614, and the generated lexical unit can be discarded at box 610. This process can be repeated until all the terms in the terminology 606 corresponding to the input have been added to the loop and processed by the LLM 512.
[0071] In short, relative to Figure 5 The first KV model 500, as shown, retrieves all input prompts (e.g., lexical 506) at a time and generates a KV cache of first lexical and input prompts. Conversely, the second KV model 600 retrieves all input prompts one lexical at a time and generates a KV cache of first lexical and prompts. Therefore, the first lexical latency of the second KV model 600 can be less than that of the first KV model 500. Since the first KV model 500 retrieves all input prompts (e.g., system prompts and user prompts) at a time, the input dimension can be fixed at a maximum length (e.g., 1024 characters). Therefore, the first lexical latency of the first KV model 500 can also be fixed at the time required to process the maximum input length (e.g., 2.2 seconds), regardless of the length of the input prompts. On the other hand, since the second KV model 600 retrieves and processes input prompts one lexical at a time, the first lexical latency can depend on the length of the input prompts (e.g., 100 milliseconds per lexical).
[0072] As discussed, the transformer's attention layer can utilize a multi-head attention (MHA) mechanism to process and understand input data in a parallel and comprehensive manner. The MHA mechanism improves upon the capabilities of a single-head attention (SHA) mechanism by allowing simultaneous processing across multiple heads. The core computation of SHA can be represented as: Attention = V softmax(Q K T The MHA mechanism enhances SHA by introducing parallelism, allowing the transformer to attend to different parts of the input simultaneously. This improves the transformer's ability to capture various aspects of the input data. Specifically, SHA focuses on a single set (query, key, and value), while MHA splits that focus across multiple sets, each potentially capturing different contextual nuances from the input sequence.
[0073] For example, in the context of natural language processing, each head can focus on different types of relationships between words, such as syntactic and semantic relationships. This allows the transformer to have a more comprehensive understanding of the text. Figure 7 This is a flowchart illustrating an example of a process 700 for processing MHA inputs by an MHA mechanism of a converter model according to various aspects of this disclosure. Process 700 begins with an MHA input (shown as MHAInput) and ends with an MHA output (shown as MHAOutput).
[0074] exist Figure 7In the example, the MHA input can be an input tensor that is split into three paths 702, 704, and 706 for parallel processing. Each path's tensor has a shape of 1×1×256×96. Each path 702, 704, and 706 undergoes a first matrix multiplication operation 708 with a matrix of shape 96×96, resulting in a corresponding tensor with an unchanged shape of 1×1×256×96. Each tensor can then be reshaped to a new shape of 1×256×3×32. In the first matrix multiplication operation 708, the two tensors from paths 704 and 706 are transposed to 1×3×256×32, and a matrix multiplication (MatMul) operation is performed, resulting in a shape of 1×3×256×256. A multiplication (Mul) operation is performed, multiplying the tensor element-wise with a broadcast tensor of shape 1, resulting in a 1×3×256×256 tensor. Perform a softmax operation to normalize the scores in the attention mechanism.
[0075] At the first path 702, the reshaped tensor is transposed to 1×3×256×32. A second matrix multiplication operation 710 is performed on the result of the transposed tensor and the softmax operation, resulting in a tensor with a shape of 1×3×256×32. The tensor is transposed back to 1×256×3×32 and then reshaped to 1×1×256×96. Finally, the tensor is multiplied again by a 96×96 matrix, producing an MHA output with the same shape as the input tensor.
[0076] like Figure 7 As shown in the example, the input tensor undergoes a series of complex transformations and operations within the MHA, including reshaping, transposition, and various matrix multiplications. Process 700 ultimately yields an output tensor that can be used by subsequent layers in the model. Specifically, process 700 generates three vectors from the input data: query (Q), key (K), and value (V). The attention layer can use these vectors to create a set of attention scores based on comparisons between the query and the key. These scores, after being normalized using a softmax function, determine the amount of attention (e.g., concern) that should be given to the corresponding value. The result is a weighted sum of value vectors, carrying both the original information and the context obtained from the attention process. The model can perform this operation simultaneously based on the use of multiple heads.
[0077] As discussed, MHA may reduce the efficiency of neural signal processors (NSPs) that handle neural network operations. NSPs can serve as a reference. Figure 1The example described is an NPU 108. In most cases, various computations of MHA can be processed on dedicated hardware cores within the NSP. For example, matrix multiplication can be assigned to a hardware core optimized for such operations, while the softmax function that normalizes attention weights can be processed on a vector processing core. The use of multiple dedicated cores can lead to inefficiency. The need to reshape and transpose data tensors for different stages of MHA computation can also introduce additional computational overhead. These additional steps increase the amount of processing power and time allocated to perform attention operations, potentially slowing down overall computation and reducing the throughput of the NSP. Furthermore, the increased complexity of managing data across different cores can increase power consumption and may require more complex hardware designs to maintain efficiency.
[0078] Various aspects of this disclosure relate to improving MHA by splitting it into separate SHA operations. In such aspects, each SHA operation can be executed independently to achieve parallelization across hardware blocks, thereby reducing the number of reshaping and transpose layers. In some examples, multiple single-head attention (SHA) operations can be specified instead of a single MHA operation. Each SHA operation can be associated with a head of the MHA. Additionally, each SHA can be executed independently, thereby increasing parallelization across different hardware blocks. Increased parallelization reduces the dependency of certain operations on specific cores. Additionally, increased parallelization reduces the number of reshaping and transpose layers, which simplifies the overall computation process.
[0079] In some examples, for quantization runtimes that use per-tensor activation quantization, the intermediate activations of each SHA can be quantized independently. This independent quantization mitigates the accuracy loss that may occur due to quantization, thus providing more robust performance in quantized models. Quantization is the process of converting a continuous range of values (e.g., floating-point values) into a finite range of discrete values (e.g., integers), allowing machine learning models (such as transformer models) to be implemented on hardware with limited precision.
[0080] In some conventional systems, tensors can store the output of each attention head. That is, a tensor of size [number of heads, sequence length, depth] can be computed by the attention head. In contrast to conventional systems, key (K) tensors and value (V) tensors are split and stored independently for each head. That is, multiple head tensors of size [sequence length, depth] can be written. This approach reduces the time data resides in memory and balances memory usage over time.
[0081] Autoregressive models are a type of neural network used in natural language processing and other sequential data tasks. These models generate predictions sequentially, meaning that each new output is conditioned on previous outputs. In the context of language models, an autoregressive model can predict the next word in a sentence given all the previous words. In such models, a cache stores key (K) tensors and value (V) tensors, which are internal representations of the input data generated during the inference process. When the autoregressive model makes a prediction, it uses this cache to quickly access information about previous parts of the sequence without having to recompile them from scratch. That is, the model reads the existing cache of tensors during inference and writes back the entire updated cache.
[0082] An autoregressive model generates one output at a time. Therefore, in some examples, for each inference step, only a new set of keys and values might be added to the cache. That is, instead of updating and storing the entire cache after each individual prediction, only the newly generated keys and values can be updated. This process reduces the demand on memory bandwidth and storage because the model only saves the latest changes instead of rewriting the entire cache. This is analogous to adding new entries to a diary instead of rewriting the entire diary every time a new event occurs.
[0083] Furthermore, in some such examples, cache maintenance can be delegated to the calling application. That is, the calling application, which uses the model for inference, controls the cache update process. By doing so, the amount of data transferred back and forth between the CPU and NSP is reduced, thereby lowering overhead and reducing potential bottlenecks associated with moving large amounts of data.
[0084] Rotational Positional Encoding (RoPE) can be used in transformers to integrate the order of tokens into the transformer's understanding of the sequence. Unlike regular positional encoding, which adds a fixed vector to the token embedding, RoPE combines token and positional encodings in a rotation-invariant manner. This is done using sine and cosine functions, which associate positional information with token representations without additively encoding vectors. The sine and cosine functions used in RoPE are computationally intensive and not particularly efficient when executed during quantization runtime. Quantization models can be useful for deployment on devices with limited processing power, such as mobile phones or Internet of Things (IoT) devices, where full-precision computation may be prohibited.
[0085] To address this inefficiency, in some examples, the computation of RoPE embeddings can be offloaded to the CPU instead of being processed on the NSP. The NSP is dedicated hardware designed to accelerate neural network computations. The NSP may not be optimized for the types of operations associated with RoPE. By pre-computing the RoPE tensor on the CPU for the next inference step while the current inference is still running on the NSP, the embeddings associated with the tensor can be prepared when needed without introducing latency into the processing pipeline. This approach leverages the CPU's capabilities to handle these specific mathematical operations more efficiently, thus not only saving NSP resources but also simplifying the overall computational process.
[0086] Pre-computation improves the parallelism of system operations. For example, while the NSP is busy with an inference task, the CPU can work ahead of time to prepare RoPE embeddings for subsequent tasks. This parallelization of tasks improves throughput and reduces latency in the converter model.
[0087] In some cases, an attention mask can be specified in the transformer to focus on the transformer (e.g., the model) during the inference process, effectively guiding the attention mechanism to focus on certain parts of the input while ignoring others. This selective focus can be useful in tasks such as language translation or text generation. The attention mask may change each time inference is performed, adapting to new inputs and contexts. This dynamic nature means that the attention mask can be recomputed for each step in sequence generation. Recomputed attention masks typically involve changing the data type of mask elements, which could be a conversion from integers to floating-point numbers and vice versa. These conversions can be resource-intensive and may slow down processing on the NSP.
[0088] In some examples, to mitigate the computational overhead on the NSP, attention masks can be pre-computed on the CPU. CPUs are typically more general-purpose and capable of managing control flow operations (such as loops and conditional statements), which can be specified to generate these attention masks. Pre-compiling attention masks on the CPU allows the NSP to remain dedicated to processing deep learning operations. Simultaneously, the CPU handles the control flow logic, which would otherwise impose additional complexity and processing time on the NSP. This division of labor simplifies the inference process and improves the overall efficiency of the model, thereby improving inference time.
[0089] Figure 8 This is a flowchart illustrating an example process 800 for parallel processing of MHA inputs by multiple SHA mechanisms of a converter model, according to various aspects of this disclosure. Figure 8 In the example, the MHA mechanism is divided into multiple SHA mechanisms, each of which corresponds to the MHA mechanism (such as reference). Figure 7The header of the described MHA mechanism. Each SHA mechanism is associated with paths 802, 804, and 806. For ease of explanation, in Figure 8 The example shows only three SHA paths. This disclosure is not limited to three SHA paths.
[0090] Process 800 begins with an MHA input (shown as MHAInput) and ends with an MHA output (shown as MHAOutput). The input to the MHA begins with a four-dimensional tensor of size 1×1×256×96, which is then fed into multiple SHAs, each operating in parallel. This segmentation allows the MHA to independently process different portions of the input data, with each SHA performing its own set of computations.
[0091] In each SHA path 802, 804, and 806, the initial step involves a matrix multiplication (MatMul) with a transformation matrix of shape 96×32, thereby modifying the shape of the input data (1×1×256×32) to facilitate further processing. In some paths, the output of the matrix multiplication is transposed to reshape the data to 1×1×32×256. The reshaped data can be multiplied with the output of another matrix multiplication to produce an output of shape 1×1×256×256. A multiplication (Mul) operation is performed, causing the tensor to be multiplied element-wise with a broadcast tensor of shape 1, resulting in a 1×1×256×256 tensor. A softmax operation is performed to normalize the scores in the attention mechanism.
[0092] Following softmax, the normalized scores are element-wise multiplied with the value vector, allowing the model to assign appropriate weights to each part of the input based on its contextual meaning. The result is a tensor with a shape of 1×1×256×32. The outputs from the independent processing of each SHA are then concatenated, merging the different perspectives collected by each SHA from the same input data. The concatenated output undergoes a final matrix multiplication as a synthesis step to integrate the insights from each head.
[0093] The final output of this process is a tensor that preserves the shape of the original input, indicating that multiple attention perspectives have been successfully merged into a unified representation. This output is then processed by subsequent layers of the transformer.
[0094] Figure 9This is a flowchart illustrating an example of a process 900 for generating multiple SHA mechanisms from an MHA associated with a converter model, according to various aspects of this disclosure. For example, process 900 may be executed by one or more processors, such as CPUs (e.g., 102, 422), GPUs (e.g., 104, 426), and / or other processing units (e.g., DSP 424, NPU 428)). Process 900 begins at block 902 by generating a set of single-head attention (SHA) operations based on multiple attention heads in the multi-head attention (MHA) mechanism, each SHA operation corresponding to a specific attention head in the set of attention heads associated with the MHA mechanism. At block 904, process 900 executes each of the set of SHA operations independently and in parallel across hardware blocks of the device associated with the neural network model. At block 906, process 900 generates an MHA output based on the parallel execution of each of the set of SHA operations.
[0095] Specific implementation examples are provided in the following numbered clauses.
[0096] Clause 1. A processor-implemented method comprising: generating a set of single-head attention (SHA) operations based on a plurality of attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in a set of attention heads associated with the MHA mechanism; independently and in parallel executing each of the set of SHA operations among hardware blocks of a device associated with a neural network model; and generating an MHA output based on the parallel execution of each of the set of SHA operations.
[0097] Clause 2. The processor-implemented method according to Clause 1, the processor-implemented method further comprising quantizing the intermediate activation of each of the set of SHA operations independently in the quantized runtime environment via per-tensor activation quantization.
[0098] Clause 3. The processor-implemented method according to any one of Clauses 1 to 2, the processor-implemented method further comprising: splitting a key (K) tensor and a value (V) tensor associated with the output of the set of attention heads into a set of tensors, each of the tensors corresponding to one of the attention heads in the set of attention heads and having a dimension defined by a sequence length and a depth; and writing each of the tensors in the set of tensors into local memory.
[0099] Clause 4. A processor-implemented method according to any one of Clauses 1 to 3, the processor-implemented method further comprising: reading an existing tensor from a cache during inference; writing only the latest cache entry updated during the inference to the cache; and enabling the cache to be accessed by one or more processors associated with the neural network model by invoking an application-managed cache maintenance operation.
[0100] Clause 5. The processor-implemented method according to any one of Clauses 1 to 4, the processor-implemented method further comprising, while performing the current inference on one or more neural network processors associated with the neural network model, pre-compiling a rotation position encoding (RoPE) tensor on one or more central processing units (CPUs) associated with the neural network model.
[0101] Clause 6. The method according to any one of Clauses 1 to 5, the method further comprising: pre-computing one or more attention masks; and applying the one or more attention masks during inference, wherein each of the one or more attention masks changes with each inference.
[0102] Clause 7. An apparatus comprising a processor, a memory coupled to the processor, and instructions stored in the memory and operable, when executed by the processor, to cause the apparatus to perform any one of Clauses 1 to 6.
[0103] Clause 8. An apparatus comprising at least one component for performing any one of Clauses 1 to 6.
[0104] Clause 9. A computer program comprising code for causing a device to perform any one of Clauses 1 to 6.
[0105] The various operations of the methods described above can be performed by any suitable component capable of performing the corresponding function. These components may include various hardware and / or software components and / or modules, including but not limited to circuits, application-specific integrated circuits (ASICs), or processors. Generally, in the cases where operations are illustrated in the accompanying drawings, these operations may have corresponding paired components with similar numbering plus functional components.
[0106] As used, the term "determine" encompasses a wide variety of actions. For example, "determine" can include calculation, computation, processing, derivation, research, searching (e.g., looking in a table, database, or other data structure), assertion, etc. Additionally, "determine" can include receiving (e.g., receiving information), accessing (e.g., accessing data in memory), etc. Furthermore, "determine" can include parsing, selecting, choosing, building, etc.
[0107] As used, the phrase "at least one of the items" in a list of items refers to any combination of these items, including a single member. As an example, "at least one of a, b, or c" is intended to cover: a, b, c, ab, ac, bc, and abc.
[0108] The various exemplary logic blocks, modules, and circuits described in this disclosure can be implemented or executed using a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic component, discrete hardware component, or any combination thereof designed to perform the described functions. While the general-purpose processor may be a microprocessor, in alternative embodiments, the processor may be any commercially available processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration.
[0109] The steps or algorithms of the methods described in this disclosure may be directly embodied in hardware, a software module executed by a processor, or a combination of both. The software module may reside in any form of storage medium known in the art. Some examples of usable storage media include random access memory (RAM), read-only memory (ROM), flash memory, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, removable disks, CD-ROMs, and the like. The software module may include a single instruction or multiple instructions and may be distributed across several different code segments, across different programs, and across multiple storage media. The storage medium may be coupled to the processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium may be integral with the processor.
[0110] The disclosed method includes one or more steps or actions for implementing the described method. The steps and / or actions of the method may be interchanged without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and / or use of a particular step and / or action may be modified without departing from the scope of the claims.
[0111] The described functionality can be implemented in hardware, software, firmware, or any combination thereof. If implemented in hardware, an example hardware configuration may include a processing system within the device. This processing system may utilize a bus architecture. Depending on the specific application and overall design constraints of the processing system, the bus may include any number of interconnect buses and bridges. The bus can link various circuits together, including processors, machine-readable media, and bus interfaces. The bus interface can be used to connect network adapters, etc., to the processing system via the bus. The network adapter can be used to implement signal processing functions. In some respects, user interfaces (e.g., keypads, displays, mice, joysticks, etc.) may also be connected to the bus. The bus may also link various other circuits, such as timing sources, peripherals, voltage regulators, power management circuits, etc., which are well known in the art and will not be described further.
[0112] A processor may be responsible for managing the bus and general-purpose processing, including executing software stored on a machine-readable medium. A processor may be implemented using one or more general-purpose processors and / or special-purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry that can execute software. Software should be interpreted broadly as instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. By way of example, a machine-readable medium may include random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, disks, optical disks, hard disks, or any other suitable storage medium, or any combination thereof. A machine-readable medium may be embodied as a computer program product. A computer program product may include packaging material.
[0113] In a hardware implementation, machine-readable media can be part of a processing system separate from the processor. However, as those skilled in the art will readily understand, machine-readable media, or any portion thereof, can be external to the processing system. By way of example, machine-readable media may include transmit lines, carrier waves modulated by data, and / or computer components separate from the device, all accessible to the processor via a bus interface. Alternatively or additionally, machine-readable media, or any portion thereof, may be integrated into the processor, such as in the case of a cache and / or a general-purpose register file. Although the various components discussed may be described as having a specific location, such as local components, they may also be configured in various ways, such as certain components being configured as part of a distributed computing system.
[0114] The processing system may be configured as a general-purpose processing system having one or more microprocessors providing processor functionality and external memory providing at least a portion of machine-readable medium, all of which are linked together with other supporting circuitry via an external bus architecture. Alternatively, the processing system may include one or more neuromorphic processors for implementing the described neuron and nervous system models. As another alternative, the processing system may be implemented using an application-specific integrated circuit (ASIC) having a processor, bus interface, user interface, supporting circuitry, and at least a portion of machine-readable medium integrated on a single chip, or using one or more field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), controllers, state machines, gated logic, discrete hardware components, or any other suitable circuitry, or any combination of circuitry capable of performing the various functionalities described throughout this disclosure. Those skilled in the art will recognize how best to implement the described functionality of the processing system depends on the specific application and the overall design constraints imposed on the system as a whole.
[0115] Machine-readable media may include multiple software modules. These software modules include instructions that, when executed by a processor, cause the processing system to perform various functions. Software modules may include send and receive modules. Each software module may reside in a single storage device or be distributed across multiple storage devices. For example, when a triggering event occurs, a software module may be loaded from a hard disk drive into RAM. During the execution of a software module, the processor may load some of the instructions into a cache to improve access speed. One or more cache lines may then be loaded into a general-purpose register file for processor execution. When the functionality of a software module is referred to below, it will be understood that such functionality is implemented by the processor when executing the instructions from that software module. Furthermore, it should be understood that aspects of this disclosure result in improvements to the functionality of a processor, computer, machine, or other system implementing such aspects.
[0116] If implemented in software, the functions may be stored as one or more instructions or codes on or transmitted through a computer-readable medium. A computer-readable medium includes both computer storage media and communication media, including any medium that facilitates the transfer of a computer program from one location to another. A storage medium can be any available medium accessible to a computer. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage devices, disk storage devices or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and is accessible to a computer. Additionally, any connection is also appropriately referred to as a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, optical fiber, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared (IR), radio, and microwave, then such coaxial cable, optical fiber, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. The disks and optical discs used include compact discs (CDs), laser discs, optical discs, digital multifunction discs (DVDs), floppy disks, and Blu-ray discs. ® Optical discs, where magnetic disks typically reproduce data magnetically, and optical discs reproduce data optically using lasers. Therefore, in some aspects, computer-readable media may include non-transitory computer-readable media (e.g., tangible media). Furthermore, in other aspects, computer-readable media may include transient computer-readable media (e.g., signals). Combinations of the above should also be included within the scope of computer-readable media.
[0117] Therefore, certain aspects may include a computer program product for performing the presented operations. For example, such a computer program product may include a computer-readable medium on which instructions are stored (and / or encoded) that can be executed by one or more processors to perform the described operations. In some aspects, the computer program product may include packaging material.
[0118] Furthermore, it should be understood that modules and / or other suitable components for performing the described methods and techniques may be downloaded and / or otherwise obtained by the user terminal and / or base station where applicable. For example, such devices can be coupled to a server to facilitate the transfer of components for performing the described methods. Alternatively, the various methods described can be provided via storage components (e.g., RAM, ROM, physical storage media such as CDs or floppy disks) so that the user terminal and / or base station can obtain the various methods once the storage component is coupled to or provided to the device. Furthermore, any other suitable techniques suitable for providing the described methods and techniques to the device may be utilized.
[0119] It should be understood that the claims are not limited to the precise configurations and components illustrated above. Various modifications, variations, and alterations may be made to the arrangement, operation, and details of the methods and apparatus described above without departing from the scope of the claims.
Claims
1. A processor-implemented method comprising: generating a set of single-head attention (SHA) operations based on a plurality of attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a respective attention head in a set of attention heads associated with the MHA mechanism; independently and in parallel executing each SHA operation in the set of SHA operations between hardware blocks of a device associated with a neural network model; and generating an MHA output based on the parallel execution of each SHA operation in the set of SHA operations.
2. The processor-implemented method of claim 1, further comprising independently quantizing intermediate activations of each SHA operation in the set of SHA operations via per-tensor activation quantization in a quantized runtime environment.
3. The processor-implemented method of claim 1, further comprising: splitting key (K) and value (V) tensors associated with outputs of the set of attention heads into a set of tensors, each tensor in the set of tensors corresponding to one attention head in the set of attention heads and having dimensions defined by a sequence length and a depth; and writing each tensor in the set of tensors to a local memory.
4. The processor-implemented method of claim 1, further comprising: reading existing tensors from a cache during inference; writing only the most recent cache entries that are updated during the inference to the cache; and enabling the cache to be accessed by one or more processors associated with the neural network model via invoking application management cache maintenance operations.
5. The processor-implemented method of claim 1, further comprising precomputing a rotary position encoding (RoPE) tensor on one or more central processing units (CPUs) associated with the neural network model while a current inference is being executed on one or more neural network processors associated with the neural network model.
6. The processor-implemented method of claim 1, further comprising: precomputing one or more attention masks; and applying the one or more attention masks during inference, wherein each attention mask in the one or more attention masks changes with each inference.
7. An apparatus comprising: means for generating a set of single-head attention (SHA) operations based on a plurality of attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a respective attention head in a set of attention heads associated with the MHA mechanism; means for independently and in parallel executing each SHA operation in the set of SHA operations between hardware blocks of a device associated with a neural network model; and means for generating an MHA output based on the parallel execution of each SHA operation in the set of SHA operations. 8. The apparatus of claim 7, further comprising a component for independently quantizing the intermediate activation of each of the set of SHA operations in the quantized runtime environment via per-tensor activation quantization.
9. The apparatus according to claim 7, further comprising: The component used to split the key (K) tensor and value (V) tensor associated with the output of the set of attention heads into a set of tensors, each of which corresponds to one attention head in the set of attention heads and has a dimension defined by sequence length and depth; and A component for writing each of the set of tensors to local memory.
10. The apparatus according to claim 7, further comprising: The component used to read existing tensors from the cache during inference; A component for writing only the latest cache entry updated during the inference period to the cache; and A component for enabling the cache to be accessed by one or more processors associated with the neural network model by invoking application-managed cache maintenance operations.
11. The apparatus of claim 7, further comprising means for pre-compiling a rotation position encoding (RoPE) tensor on one or more central processing units (CPUs) associated with the neural network model while performing the current inference on one or more neural network processors associated with the neural network model.
12. The apparatus according to claim 7, further comprising: A component used to pre-compute one or more attention masks; and Components for applying the one or more attention masks during inference, wherein each of the one or more attention masks changes with each inference.
13. An apparatus comprising: One or more processors; and One or more memories, coupled to the one or more processors and storing instructions that, when executed by the one or more processors, can operate the device to: A set of single-head attention (SHA) operations are generated based on multiple attention heads in the multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in a set of attention heads associated with the MHA mechanism; Each of the set of SHA operations is executed independently and in parallel among the hardware blocks of the device associated with the neural network model; as well as The MHA output is generated by executing each of the set of SHA operations in parallel.
14. The apparatus of claim 13, wherein execution of the instructions further causes the apparatus to independently quantize the intermediate activation of each of the set of SHA operations in a quantized runtime environment via per-tensor activation quantization.
15. The apparatus of claim 13, wherein execution of the instructions further causes the apparatus to: The key (K) tensor and value (V) tensor associated with the output of the set of attention heads are split into a set of tensors, each tensor in the set corresponding to one attention head in the set of attention heads and having a dimension defined by the sequence length and depth; and Write each of the tensors in the set to local memory.
16. The apparatus of claim 13, wherein execution of the instructions further causes the apparatus to: Read existing tensors from the cache during inference; Only the latest cache entry updated during the inference period is written to the cache; and By invoking application-managed cache maintenance operations, the cache becomes accessible to one or more processors associated with the neural network model.
17. The apparatus of claim 13, wherein execution of the instructions further causes the apparatus to pre-compute a rotation position encoding (RoPE) tensor on one or more central processing units (CPUs) associated with the neural network model while performing the current inference on one or more neural network processors associated with the neural network model.
18. The apparatus of claim 13, wherein execution of the instructions further causes the apparatus to: Pre-compute one or more attention masks; and One or more attention masks are applied during inference, wherein each of the one or more attention masks changes with each inference.
19. A non-transitory computer-readable medium having program code recorded thereon, the program code being executed by one or more processors and comprising: Program code for generating a set of single-head attention (SHA) operations based on multiple attention heads in a multi-head attention (MHA) mechanism, each SHA operation corresponding to a corresponding attention head in a set of attention heads associated with the MHA mechanism; Program code for independently and in parallel executing each of the set of SHA operations among hardware blocks of a device associated with a neural network model; and Program code for generating MHA output based on the parallel execution of each of the set of SHA operations.
20. The non-transitory computer-readable medium of claim 19, wherein the program code further comprises program code for independently quantizing the intermediate activation of each of the set of SHA operations in the quantized runtime environment via per-tensor activation quantization.
21. The non-transitory computer-readable medium of claim 19, wherein the program code further comprises: Program code for splitting the key (K) tensor and value (V) tensor associated with the output of the set of attention heads into a set of tensors, each of which corresponds to one of the attention heads in the set of attention heads and has a dimension defined by sequence length and depth; and Program code for writing each of the tensors in the set to local memory.
22. The non-transitory computer-readable medium of claim 19, wherein the program code further comprises: Program code used to read existing tensors from the cache during inference; Program code used to write only the latest cache entry updated during the inference period to the cache; and Program code used to enable the cache to be accessed by one or more processors associated with the neural network model by invoking application-managed cache maintenance operations.
23. The non-transitory computer-readable medium of claim 19, wherein the program code further comprises program code for pre-compiling a rotation position encoding (RoPE) tensor on one or more central processing units (CPUs) associated with the neural network model while performing the current inference on one or more neural network processors associated with the neural network model.
24. The non-transitory computer-readable medium of claim 19, wherein the program code further comprises: Program code used to pre-compute one or more attention masks; and Program code for applying the one or more attention masks during inference, wherein each of the one or more attention masks changes with each inference.