Dynamic scaling of neural network based on input sequence length

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
Dynamic scaling of MHA layers in transformer networks optimizes compute cycles and memory usage by adapting to actual input sequence lengths, addressing inefficiencies in static compilation and padding methods.

WO2026142734A1PCT designated stage Publication Date: 2026-07-02INTEL CORP

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: INTEL CORP
Filing Date: 2025-05-20
Publication Date: 2026-07-02

Application Information

Patent Timeline

20 May 2025

Application

02 Jul 2026

Publication

WO2026142734A1

IPC: G06N3/045; G06N3/08; G06F8/38; G06F40/284

AI Tagging

Technology Topics

Parallel computing Processing element

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

An FPGA-based DFM Pattern Match hardware accelerator and method
CN122389774AComputer hardware Pattern matching
Instruction cache management method, instruction cache, computing device and system
CN122018985BComputer architecture Parallel computing
A GC mechanism processing method, device and medium
CN114968584Bprocessing speed Relieve stress Resource allocation Memory adressing/allocation/relocation Parallel computing Multithreading
Extending temporal coherency within msoc to improve cache replacement policies for msoc
US20260178490A1Cache memory detailsParallel computing Control cell
A dot code
CN122366488ADot matrix Algorithm

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Transformer networks face inefficiencies due to static compilation for maximum sequence lengths, leading to increased memory footprint and performance loss when processing input sequences significantly shorter than the maximum length, and existing solutions like pre-compiling for different shapes or padding with sparsity are computationally inefficient and memory-intensive.

Method used

Dynamic scaling of Multi-Head Attention (MHA) layers in transformer networks by compiling for maximum sequence length and processing actual sequence length at runtime, partitioning workloads into subtensors, and selectively executing or padding tensors to optimize compute cycles and memory usage.

Benefits of technology

This approach reduces inference time, power consumption, and improves performance by dynamically adapting to input sequence lengths, achieving better efficiency and reduced memory access.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure US2025030228_02072026_PF_FP_ABST

Patent Text Reader

Abstract

A system may facilitate dynamic scaling of multi-head attention (MHA) layers in transformer networks. The system may include a compiler and a neural processing unit (NPU). The compiler may generate workload descriptors that define a plurality of workloads for performing an operation in an MHA layer. The compiler may generate the workload descriptors based on the maximum sequence length (i.e., the maximum number of tokens) that the transformer network supports. The workloads may correspond to various portions of the maximum sequence length, respectively. The NPU may dynamically scale the MHA layer during runtime. The input sequence length for an execution of the transformer network may be less than the maximum sequence length. The NPU may select one or more workloads from the plurality of workloads based on the sequence length and the workload descriptors pre-generated by the compiler. The NPU may execute the selected workload(s) and skip the other workload(s).

Need to check novelty before this filing date? Find Prior Art

Description

DYNAMIC SCALING OF NEURAL NETWORK BASED ON INPUT SEQUENCE LENGTHCross-Reference to Related Application

[0001] This application claims the benefit of U. S. Provisional Patent Application No.63 / 738,216, filed December 23, 2024, and entitled " DYNAMIC SCALING OF NEURAL NETWORK BASED ON INPUT SEQUENCE LENGTH," which is incorporated by reference in its entirety for all purposes.Technical Field

[0002] This disclosure relates generally to neural network (also referred to as "deep neural network" or " DNN"), and more specifically, dynamic scaling of DNNs (e.g., transformer networks) based on input sequence lengths.Background

[0003] DNNs are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as there can be a large number of operations as well as a large amount of data to read and write.Therefore, techniques to improve efficiency for executing DNNs are needed.Brief Description of the Drawings

[0004] Embodiments can be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.

[0005] FIG. 1 illustrates an example transformer model, in accordance with various embodiments.

[0006] FIG. 2 illustrates an example embedding layer, in accordance with various embodiments.

[0007] FIGS. 3A and 3B illustrate an example multi-head attention (MHA) layer, in accordance with various embodiments.

[0008] FIG. 4 illustrates neural network operations in an MHA layer, in accordance with various embodiments.

[0009] FIG. 5 illustrates dynamic scaling of a matrix multiplication (MatMul) operation, in accordance with various embodiments.

[0010] FIG. 6 illustrates a dynamic scaling mode of an intermediate MatMul operation, in accordance with various embodiments.

[0011] FIG. 7 illustrates another dynamic scaling mode of the intermediate MatMul operation, in accordance with various embodiments.

[0012] FIG. 8 illustrates dynamic scaling of a Softmax operation, in accordance with various embodiments.

[0013] FIG. 9 illustrates dynamic scaling of another Softmax operation, in accordance with various embodiments.

[0014] FIG. 10 dynamic scaling of a final MatMul operation, in accordance with various embodiments.

[0015] FIG. 11 is a block diagram of a DNN system, in accordance with various embodiments.

[0016] FIG. 12 is a block diagram of a DNN module, in accordance with various embodiments.

[0017] FIG. 13 illustrates an example sparse cell, in accordance with various embodiments.

[0018] FIG. 14 illustrates an example sparse cell array, in accordance with various embodiments.

[0019] FIG. 15 illustrates an example processing element (PE), in accordance with various embodiments.

[0020] FIG. 16 illustrates positional encoding, in accordance with various embodiments.

[0021] FIG. 17 illustrates an example linear classifier, in accordance with various embodiments.

[0022] FIG. 18 illustrates a first inference stage of a transformer model, in accordance with various embodiments.

[0023] FIG. 19 illustrates subsequent inference stages of the transformer model, in accordance with various embodiments.

[0024] FIG. 20 is a flowchart of a method for executing a neural network layer, in accordance with various embodiments.

[0025] FIG. 21 is a block diagram of an example computing device, in accordance with various embodiments.Detailed DescriptionOverview

[0026] The last decade has witnessed a rapid rise in artificial intelligence (AI) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. A DNN typically includes a sequence of layers. A DNN layer may include one or more neural network operations (also referred to as "neural network operations"), such as convolution, matrix multiplication, layer normalization, batch normalization, Softmax operation, pooling, elementwise operation, linear operation, nonlinear operation, and so on.

[0027] Input or output data of neural network operations may be arranged in data structures called tensors. A tensor is a data structure having multiple elements across one or more dimensions. Examples of tensors include vector (which is one-dimensional (1D) tensor), matrix (which is two-dimensional (2D) tensor), three-dimensional (3D) tensors, four-dimensional (4D) tensors, and even higher dimensional tensors. A dimension of a tensor may correspond to an axis, e.g., an axis in a coordinate system. A dimension may be measured by the number of data points along the axis. The dimensions of a tensor may define the shape of the tensor. A DNN layer may receive one or more input tensors and compute an output tensor from the one or more input tensors. Taking a convolutional layer for example, the input tensors include an activation tensor (also referred to as "input feature map (IFM)" or "input activation tensor") including one or more activations (also referred to as "input elements") and a weight tensor. The weight tensor may be a kernel (a 2D weight tensor), a filter (a 3D weight tensor), or a group of filters (a 4D weight tensor). A convolution may be performed on the input activation tensor and weight tensor to compute an output activation tensor in the convolutional layer.

[0028] The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability. DNN models may be executed, e.g., for training or inference, by DNNaccelerators. A DNN accelerator may be or include one or more data processing units. A data processing unit may also be referred to as a compute block or compute tile. A data processing unit may include PEs that can carry out neural network operations. A PE may include a multiply-accumulate (MAC) unit that is configured to perform MAC operations.

[0029] Neural Processing Units (NPUs) are typically designed for CNNs where the network structure is static and can be compiled with fixed parameters. The compiled network can be predictably executed during inference. The transformer network is a relatively new type of neural network architecture. It can be used to improve the performance of natural language processing (NLP) tasks, such as machine translation and language modeling. An innovation of the transformer network is the MHA mechanism, which allows the model to selectively focus on different parts of the input sequence when making predictions. The attention mechanism typically works by computing a weighted sum of the input sequence, where the weights are learned based on how relevant each input element is to the current prediction. This can allow the model to dynamically adjust its focus depending on the context and the task at hand. The runtime of the attention layer in transformers can scale quadratically with sequence length because the self-attention mechanism computes similarity scores for all pairs of positions in an input sequence. This usually means that for an input sequence of length n, there are n2pairs of positions, leading to a quadratic computational complexity. In decoder networks, where the context may be extremely sparse because of its autoregressive nature, the solution may be to run the network with dynamic shapes.

[0030] In an example of performing a language translation task: the input sentence may be first tokenized. For instance, the total input length may be 3. Next, first inference is performed, and the tokenized input sequence is passed through the decoder network. The input may be denoted as [Tj(O), Tj(l), Tt(2)]. As this is a classification task, the network generates a probabilistic distribution on each possible next token. A token is sampled T0(0).Next, second inference is performed. For the second inference, Input =[Tj(O),?i(l), Ti(2), T0(0)]. The next inference may be performed on an input with length 5. A new output token is samples To(2). This process continues till the last inference is performed.

[0031] Transformer networks can be more dynamic than CNNs. This can expose limitations in the static compiler model used for CNNs. When a transformer model is being compiled, it is usually compiled for the maximum sequence length supported by the model.Consequently, at inference time the transformer networks cannot be optimally executed. The input to the transformer model could be much smaller than the maximum sequence length, but the full sequence length is processed as it is not feasible to recompile the transformer model based on the input sequence length. As an example, for a transformer supporting an input prompt length of 512 tokens, the static model would need to size the input and all intermediate layer to support up to 512 tokens. In practice, the actual prompt length could be much smaller (e.g. 64 tokens) which could allow for a much smaller footprint. However, due to the static compilation, the hardware is not able to take advantage of the smaller prompt length which results in larger memory footprint and performance loss.

[0032] A currently available solution is to pre-compile several models reshaped for different input shapes. This method can work well when the number of different shapes is small enough to afford increased time for multiple reshapes and compilations as well as increased amount of consumed memory. However, even when the number of different shapes is a small number, this method still leads to inefficient use of memory. As this method cannot be scaled well, it is usually used in combination with padding. The input sequence is padded with pad tokens in many currently available approaches. This does not require special handling of different input shapes but has the clear disadvantage to be compute inefficient since the networks mostly act on pad tokens, which are not useful for computing the last token. A currently available approach is to pad the input sequence with sparsity. This can help mitigate the inefficiency, but there can still be a large penalty with processing sparse elements in the input sequence. Also, there is a memory overhead for storing a sparsity bitmap. This approach is not trivial as it typically requires sparsity pattern generation based on the sequence length.

[0033] Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by dynamic scaling MHA layers of transformer networks. In an example of dynamic scaling, a transformer model is compiled to the maximum sequence length and the hardware (e.g., NPU) processes the actual sequence length at runtime to skip or scale workloads to reduce the inference time of the statically compiled transformer network. With minimal runtime configuration, this approach can dynamically change aspects of the statically compiled model to significantly improve the performance of the hardware and reduce power consumption.

[0034] In various embodiments of the present disclosure, a computing system, e.g., an AI system, may facilitate dynamic scaling of MHA layers of transformer networks. The system may include a compiler, which may be implemented by a central processing unit (CPU), and an NPU. The compiler may compile transformer networks and the NPU may execute compiled transformer networks for performing AI tasks. The compiler may compile a transformer network based on maximum sequence length supported by the transformer network. The transformer network may have been generated or trained for processing sequences with lengths up to the maximum sequence length. The compiler may partition the workload of executing an operation in an MHA layer, such as a MatMul operation or Softmax operation, into a plurality of workloads based on the maximum sequence length. For instance, the compiler may divide an input tensor having the maximum sequence length as a dimension into subtensors and each subtensor may correspond to a single workload in which the subtensor is processed. Such a subtensor is also referred to as a primitive. The sequence length of a subtensor is also referred to as a primitive sequence length, which may indicate a fixed number of tokens. The maximum sequence length may be a multiple of the primitive sequence length. In an example, the maximum sequence length is 512, and the primitive sequence length is 128. The compiler may determine the primitive sequence length based on one or more configurations of the NPU. The compiler may also generate workload descriptors that define the plurality of workloads. A workload descriptor may include a primitive ID that identifies the corresponding subtensor. A workload may have multiple primitive IDs, e.g., when the compiler partitions multiple input tensors of the operation or when the compiler compiles an input tensor on multiple dimensions. The compiler may provide the primitive sequence length and workload descriptors to the NPU.

[0035] The NPU may dynamically scale the MHA layer during runtime based on the primitive sequence length and workload descriptors. For an execution of the transformer network, the sequence length (e.g., the number of tokens input into the transformer network) may be smaller than the maximum sequence length. The NPU may select one or more workloads from the plurality of workloads based on the actual sequence length, primitive sequence length, and workload descriptors. The NPU may execute the selected workload(s) and skip the other workload(s). To facilitate the dynamic scaling of the statically compiled transformed network, the NPU may pad one or more input tensors of the neural networkoperation. For instance, the NPU may add one or more zeros into an input tensor to make a dimension of the padded input tensor be a multiple of the primitive sequence length.

[0036] The approach in this disclosure allows a pre-compiled model to be dynamically scaled at runtime. The dynamic configuration required at runtime is minimal and can be programed once per inference. This allows an effective way to take advantage of dynamism in transformer networks. The approach in this disclosure can lead to less compute cycles, less data movement, and fewer memory accesses. A significant speed up can be achieved on MatMul operations and Softmax operations in transformer networks. A better performance of the NPU can be achieved with lower power consumption for transformer networks.

[0037] For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it would be apparent to one skilled in the art that the present disclosure may be practiced without the specific details or / and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.

[0038] Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.

[0039] Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter.However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.

[0040] For the purposes of the present disclosure, the phrase " A or B" or the phrase " A and / or B" means (A), (B), or (A and B). For the purposes of the present disclosure, the phrase " A, B, or C" or the phrase " A, B, and / or C" means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C). The term "between," when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.

[0041] The description uses the phrases "in an embodiment" or "in embodiments," which may each refer to one or more of the same or different embodiments. The terms "comprising," "including," "having," and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above," "below," "top," "bottom," and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives "first," "second," and "third," etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

[0042] In the following detailed description, various aspects of the illustrative implementations are described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.

[0043] The terms "substantially," "close," "approximately," "near," and "about," generally refer to being within + / - 20% of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., "coplanar," "perpendicular," "orthogonal," "parallel," or any other angle between the elements, generally refer to being within + / - 5-20% of a target value as described herein or as known in the art.

[0044] In addition, the terms "comprise," "comprising," "include," "including," "have," "having" or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term "or" refers to an inclusive "or" and not to an exclusive "or."

[0045] The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.

[0046] FIG. 1 illustrates an example transformer network 100, in accordance with various embodiments. The transformer network 100 may also be referred to as a transformer or transformer model. The transformer network 100 may transform input sequences intooutput sequences. In some embodiments, the transformer network 100 is a DNN that can learn context and meaning by tracking relationships in sequential data, such as sequential words in a sentence, sequential audio signals, sequential images, and so on. In an example, the transformer network 100 may be a large language model (LLM). The transformer network 100 includes an encoder block 110, a decoder block 120, and a head block 130. In other embodiments, functionality attributed to a component of the transformer network 100 may be accomplished by a different component included in the transformer network 100 or a different model or module.

[0047] Further, different or additional components may be included in the transformer network 100. In an example, the transformer network 100 may be an encoder model that includes an encoder component (e.g., the encoder block 110) without a decoder component (e.g., the decoder block 120). The encoder may process an input sequence and generate a fixed-size representation, which can be used for various downstream tasks such as classification, translation, or generation. The final layer of the encoder may produce a fixed-size representation of the input sequence. This representation may be obtained by computing a weighted sum of the output sequence from the last layer, where the weights are again learned based on the relevance of each element in the sequence to the overall representation. Such encoder models can be used when it is unnecessary to generate a specific output sequence, but it is needed to classify the input sequence.

[0048] In another example, the transformer network 100 may be a decoder model that includes a decoder component (e.g., the decoder block 120) without an encoder component (e.g., the encoder block 110). The decoder may generate an output sequence based on a fixed-size representation of an input sequence. The decoder may be autoregressive in nature, which means that it may generate the output sequence by generating one element at a time, based on previously generated elements. In some cases, the decoder takes one or more initial tokens that represent the beginning of the output sequence. It then generates one output token at a time, conditioned on the previously generated tokens, until an end-of-sequence token is generated, or a maximum length is reached. Successive runs of the decoder work on the previous input combined with the previously generated token. This type of model can be used as an autoregressive model to generate text based on user input.

[0049] In the embodiments of FIG. 1, the transformer network 100 may be a seq2seq model with a transformer architecture, e.g., BART or T5. The encoder block 110 and decoder block120 may be trained together to map an input sequence to an output sequence. In another embodiment, the transformer network 100 may be an encoder-only model transformer architecture, e.g., BERT-like models, which does not include the decoder block 120. In yet another embodiment, the transformer network 100 may be decoder-only model transformer architecture, e.g., GPT-like models, which does not include the encoder block 110. The transformer network 100 can be used in machine translation and other AI tasks where the input and output sequences may have different lengths. In some embodiments, the transformer network 100 may be generated and trained to support sequences up to a certain length. A sequence length may be the number of tokens in the sequence. In an example, the transformer network 100 may support a maximum sequence length of 512, meaning the transformer network 100 can process sequences with 512 tokens or less. When the transformer network 100 is deployed to do Al tasks, it may receive sequences of various lengths. Some sequences can be significantly less than the maximum sequence length. In some embodiments, the transformer may be compiled based on the maximum sequence length. For instance, workloads for executing neural network operations in the transformer may be defined using the maximum sequence length during the compilation stage, despite that actual input sequence lengths may be different from the maximum sequence length. The operations in the transformer can be dynamically scaled based on the actual lengths of input sequences during runtime, i.e., during the execution of the transformer. A sequence received by the transformer network 100 is referred to as an input sequence, and tokens in input sequences are referred to as input tokens. Input tokens may be words, phrases, sentences, symbols, images, audio signals, other types of input tokens, or some combination thereof. Certain aspects of dynamic scaling are described below in conjunction with FIGS. 5-10. Certain aspects of compiling transformer networks are described below in conjunction with FIG. 11.

[0050] The encoder block 110 receives input sequences and generates matrix representations of the input sequences. In the embodiments of FIG. 1, the encoder block 110 receives an input 101 and generates an encoder output 102. The input 101 may be an input prompt. In some embodiments, the input 101 may include one or more input tokens. In an example, the input 101 may include a prompt received from a user of the transformer network 100. The prompt may include a question or request made by the user. A word in the prompt may be an input token. The encoder output 102 may include one or morevectors that are contextualized representations of the input 101. Each vector in the encoder output 102 may represent a token in the input 101 with contextual understanding.

[0051] The encoder block 110 includes an embedding layer 113, a positional encoding layer 115, and a plurality of layers 140 (individually referred to as "layer 140"). In other embodiments, the encoder block 110 may have different, fewer, or more components. Also, the arrangement of the components in the encoder block 110 may be different from the arrangement shown in FIG. 1. For the purpose of illustration, the encoder block 110 has N layers in FIG. 1, where N is an integer. Each layer 140 may include one or more neural network operations. The layers 140 may transform a sequence of embeddings into a representation that encapsulates the learned information from the input 101. Different layers 140 may have different internal parameters, e.g., different weights, bias, or other types of internal parameters. In some embodiments, the layers 140 have identical components. The components in a layer 140 may be layers and may also be referred to as sub-layers of the layer 140. As shown in FIG. 1, a layer 140 includes four sub-layers: an MH A layer 141, an add & norm layer 142, a feed forward layer 143, and another add & norm layer 144.

[0052] The decoder block 120 iteratively generates outputs 103 using encoded representations generated by the encoder block 110. The decoder block 120 includes an embedding layer 123, a positional encoding layer 125, and a plurality of layers 150 (individually referred to as "layer 150"). For the purpose of illustration, the decoder block 120 has N layers in FIG. 1, where N is an integer. In the embodiments of FIG. 1, the number of layers 150 in the decoder block 120 is the same as the number of layers 140 in the encoder block 110. In other embodiments, the number of layers 150 in the decoder block 120 may be different from the number of layers 140 in the encoder block 110. Each layer 150 may include one or more neural network operations. Different layers 150 may have different internal parameters. In some embodiments, the layers 150 may have identical components. The components in a layer 150 may be layers and may also be referred to as sub-layers of the layer 150. As shown in FIG. 1, a layer 150 includes six sub-layers: an MHA layer 151, an add & norm layer 152, an MHA layer 153, another add & norm layer 154, a feed forward layer 155, and another add & norm layer 156.

[0053] In some embodiments, a sequence of inference stages is performed in the decoder block 120 using encoder outputs, e.g., the encoder output 102. A matrix may be predictedthrough each inference stage. The outputs 103 may include a plurality of matrices. Each matrix may be further processed in the head block 130 to predict a token. The plurality of matrices may be used to predict a sequence of tokens. For the first inference stage, the decoder block 120 may receive one or more start tokens as input tokens and compute a first matrix from the input tokens and the output of the encoder block 110. The first matrix may be used by the head block 130 to predict a first token. The predicted token may be used as a new input token, in addition to the start token(s), in the second inference stage. Similarly, a second token may be predicted through the second inference stage and may be used in the third inference stage. This iteration may continue till all the inference stages are complete.

[0054] The head block 130 receives the output of the decoder block 120 and processes it in a linear layer 133 and a Softmax layer 135. A linear operation may be performed on the output of the decoder block 120 in the linear layer 133. The linear operation may include a multiplication of the output of the decoder block 120 with a weight matrix. The output of the linear layer 133 may be a vector. In some embodiments, the head block 130 may function as a classifier. The number of data elements in the vector computed in the linear layer 133 may depend on the number of classes involved. In an example where there are M classes, where M is an integer, the vector computed in the linear layer 133 may have M data elements representing the prediction for the M classes, respectively.

[0055] The output of the linear layer 133 may be input into the Softmax layer 135. A Softmax function may be applied on the output of the linear layer 133 to compute probability scores. A probability score may have a value in the range from 0 to 1. In some embodiments, a probability value is computed for each data element in the vector computed in the linear layer 133. The highest one of the probability scores may be the key. The corresponding index of the key may point to the token that the transformer network 100 predicts as the next in the sequence. The final output of the transformer network 100 may be the sequence of predicted tokens. In some embodiments, the head block 130 may be a language modeling head.

[0056] An embedding layer (e.g., the embedding layer 113 or the embedding layer 123) converts an input of the embedding layer (e.g., the input 101 or the outputs 103) into one or more embeddings. An embedding may be a vector, which is also referred to as an embedding vector or a vector embedding. The vector embedding may include a sequence of data elements. In some embodiments, the embedding layer 113 may generate a plurality ofembeddings, each of which may be converted from a different input token in the input 101. The embeddings may capture the semantic meaning of the tokens in the input 101. The embeddings may be numerical representations that capture the relationships or meanings of words, phrases, or other data types. In an example where the input 101 is a prompt including a sequence of words, the embedding layer 113 may generate an embedding from each word in the input 101. The embedding layer 123 in the decoder block 120 may generate a plurality of embeddings from tokens received by the decoder block 120 in a similar manner as the embedding layer 113. Certain aspects of embedding layers are described below in conjunction with FIG. 2.

[0057] A positional encoding layer (e.g., the positional encoding layer 115 or the positional encoding layer 125) performs positional encoding on embeddings generated in the corresponding embedding layer. In some embodiments, the positional encoding layer may apply one or more positional encoding vectors (e.g., a positional encoding vector 104 or positional encoding vector 105) on vector embeddings from the corresponding embedding layer to generate new vector embeddings that represents the embeddings with positional context. The positional encoding vector may encode information about the position of the embedding in a sequence of embeddings. In some embodiments, the positional encoding layer performs an addition operation on a positional encoding vector and a vector embedding. The addition operation may be elementwise addition. The positional encoding layer may output an embedding matrix that includes the vector embeddings computed in the positional encoding layer. Certain aspects of positional encoding layers are described below in conjunction with FIG. 16.

[0058] An MHA layer (e.g., the MHA layer 141, the MHA layer 151, or the MHA layer 153) may implement a multi-head attention mechanism, which may be a multi-head self-attention mechanism or a multi-head cross-attention mechanism. In some embodiments, the MHA layer 141 or the MHA layer 151 may implement a self-attention mechanism. For self-attention, the queries, keys, and values may come from the same place. For instance, for the MHA layer 141, the queries, keys, and values may all come from the positional encoding layer 115. For the MHA layer 151, the queries, keys, and values may all come from the positional encoding layer 125. The self-attention mechanism may enable the transformer network 100 to relate each token with other tokens. The MHA layer may compute attention scores from embeddings generated in the corresponding positionalencoding layer. In some embodiments, the MHA layer may receive one or more queries, one or more keys, and one or more values. In some embodiments, the MHA layer has a number of heads that receive different linearly projected versions of the queries, keys, and values and produce outputs in parallel that are then used to generate the final result.

[0059] In some embodiments, the queries, keys, and values input into the MHA layer 141 may be computed from vector embeddings generated by the positional encoding layer 115. The queries, keys, and values input into the MHA layer 151 may be computed from vector embeddings generated by the positional encoding layer 125. A query, key, or value may be a vector the represents a token in a sequence. In some embodiments, a query matrix Q G IRWx / lmay be computed by multiply an embedding matrix X G IRWxd(e.g., an embedding matrix computed in a positional encoding layer) with a weight matrix WqG IRdx / l, where d is the dimension of a vector embedding, N is the number of vector embeddings in the embedding matrix, and h is the number of attention heads. Each row in the query matrix may be a query. A key matrix K G IRWx / lmay be computed by multiple an embedding matrix X G IRWxd(e.g., an embedding matrix computed in a positional encoding layer) with a weight matrix WkG IRdx / l. Each row in the key matrix may be a key. A value matrix V G IRWx / lmay be computed by multiple an embedding matrix X G IRWxd(e.g., an embedding matrix computed in a positional encoding layer) with a weight matrix WvE IRdx / l. Each row in the value matrix may be a value.

[0060] In some embodiments, the MHA layer 151 may implement masked multi-head selfattention. The MHA layer 151 may prevent positions from attending to subsequent positions. For instance, each token in the sequence may not be influenced by future tokens. This masking can ensure that the predictions of a particular position can depend on known outputs at positions before it and not depend on unknown outputs at positions after it.

[0061] In some embodiments, the MHA layer 153 may implement a cross-attention mechanism, such as encoder-decoder cross-attention. The MHA layer 153 may use outputs from the previous layer (i.e., the add & norm layer 152) as queries and use outputs from the encoder block 110 as keys and values. The cross-attention can align the encoder's input with the decoder's, empowering the decoder block 120 to identify and emphasize the most relevant parts of the encoder's input. Certain aspects of MHA layers are described below in conjunction with FIGS. 3A and 3B.

[0062] An add & norm layer in the transformer network 100, such as the add & norm layer 142, 144, 152, 154, and 156, has an addition operation followed by a layer normalization operation. The addition operation may be an addition of the output of the preceding layer and the input of the preceding layer. The preceding layer is a layer that is arranged right before the add & norm layer. For example, the preceding layer of the add & norm layer 142 is the MHA layer 141. As another example, the preceding layer of the add & norm layer 154 is the MHA layer 153.

[0063] Then the layer normalization operation is applied on the result of the addition operation, which may be denoted as LayerNorm(x + sublayer(x)), where LayerNorm denotes layer normalization, x is the input of the preceding layer, and sublayer(x) denotes the output of the preceding layer. In some embodiments, the layer normalization operation may include a sequence of computations. In an example, the layer normalization operation may include a mean computation, which may be denoted as y.xy= - X z=i ^xyz, where Axyzdenotes a data element in the input tensor, x may be the positional index of the data element in one of the spatial dimensions, y may be the positional index of the data element in the other one of the spatial dimensions, z may be the positional index of the data element in the channel dimension, and ixydenotes the output of the mean computation, which may be a 2D matrix. The mean computation may be channel-wise reduction operation. The layer normalization operation may convert ixyto a 3D tensor [ixyz, e.g., by replicating every data element over z output points.

[0064] The layer normalization operation may also include an elementwise subtraction, which may be denoted as Dxyz= Axyz—.xyz. The layer normalization operation may further include a variance computation denoted as <j2%y= z=iD2xyzand a division computation denoted as Mxy=. ^=. Mxymay be a 2D tensor. The layerx< AzXy+exZ)normalization operation may also convert Mxyto a 3D tensor Mxyz, e.g., by replicating every data element over z output points. Further, the layer normalization operation may have an element multiplication denoted as A1xyz= ^xyz= (Axyz— [ixyz) X ^|x(o-2xy+e)1. = = Dxyzx Mxyz. The layer normalization operation may further compute ljx(ff2xy+e)A"Xyz = A'xyz+ and LNxyz= A"xyzX yz. LNxyzmay be the output of the layernormalization operation.

[0065] A feed forward layer (e.g., the feed forward layer 143 and the feed forward layer 155) may be a position-wise fully-connected feed forward network. In an example, the feed forward layer may include two linear layers with an activation function in between. An example of the activation function is Rectified Linear Unit (ReLU).

[0066] FIG. 2 illustrates an embedding operation in an embedding layer 200, in accordance with various embodiments. The embedding layer 200 may be an example of the embedding layer 113 or the embedding layer 123 in FIG. 1. As shown in FIG. 2, the embedding layer 200 receives an input sequence 201, which includes three words 202, 203, and 204. Each word may be treated as a token. In other embodiments, the input sequence 201 may fewer or more tokens. The number of tokens in the input sequence 201 may be the length of the input sequence 201. There may be a limit on the length of the input sequence 201. For example, the length of the input sequence 201 may be not more than a predetermined sequence length for which the transformer including the embedding layer 200 can support. The transformer may have been designed, trained, or compiled for processing sequences of the predetermined sequence length and shorter lengths.

[0067] As shown in FIG. 2, the embedding layer 200 generates a vector embedding 205 from the word 202. The embedding layer 200 also generates a vector embedding 206 from the word 203. The embedding layer 200 further generates a vector embedding 207 from the word 204. In the embodiments of FIG. 2, the vector embeddings 205, 206, and 207 have the same dimension, i.e., they each have five data elements. In other embodiments, the vector embedding 205, 206, or 207 may have a different dimension. Also, the input to the embedding layer 200 may be data of a type other than words, such as audio signals, images, and so on.

[0068] In some embodiments where the embedding layer 200 is in an encoder (e.g., the encoder block 110), the input sequence 201 may be an input received by the encoder, such as a prompt made by a user. The input sequence 201 may remain the same during inference of the encoder. In some embodiments where the embedding layer 200 is in a decoder (e.g., the decoder block 120), the input sequence 201 may change and the dimension of the input sequence 201 may be dynamic during inference of the decoder. In an example, the decoderinference may include a sequence of phases. Each inference stage may be conducted for predicting a token. For the first inference stage, the input sequence 201 may include one or more start tokens. For each subsequent inference stage (e.g., the second inference stage, the third inference stage, etc.), the input sequence 201 may include tokens predicted in the previous inference stages. The dimension of the input sequence may be increased by one after each inference stage.

[0069] FIGS. 3A and 3B illustrate an example MHA layer 300, in accordance with various embodiments. The MHA layer 300 may be an example of the MHA layer 141 or the MHA layer 151 in FIG. 1. As shown in FIG. 3A, the MHA layer 300 includes linear layers 310, 320, and 330, a MatMul layer 340, a scale layer 350, a Softmax layer 360, another MatMul layer 370, a concatenation layer 380, and another linear layer 390. In other embodiments, the MHA layer 300 may include fewer, more, or different layers.

[0070] The MHA layer 300 receives three input matrices: a query matrix 301, a key matrix 302, and a value matrix 303, which are inputs of the linear layers 310, 320, and 330, respectively. The linear layers 310, 320, and 330 are in a linear block 315 of the MHA layer 300. In some embodiments, the MHA layer 300 includes a plurality of linear blocks that includes the linear block 315. For the purpose of illustration, the MHA layer 300 includes h linear blocks in FIG. 3A, where h is an integer. Each of the linear blocks may have the same layers as the linear block 315. Each linear block may compute three parameter matrices from the query matrix 301, key matrix 302, and value matrix 303, respectively. For instance, the linear layer 310 may perform a multiplication of the query matrix 301 with a weight matrix to compute a parameter matrix 304 shown in FIG. 3B. The linear layer 320 may perform a multiplication of the key matrix 302 with a weight matrix to compute a parameter matrix 305 shown in FIG. 3B. The linear layer 330 may perform a multiplication of the value matrix 303 with a weight matrix to compute a parameter matrix 306 shown in FIG. 3B.

[0071] The parameter matrix 304 may be denoted as QW, where Q is the query matrix 301 and W ∈ ℝdmodel×dqis the weight matrix. The parameter matrix 305 may be denoted as KW, where K is the key matrix 302 and W G ]R>dmodeixdfc jsthe weight matrix. The parameter matrix 306 may be denoted as VW, where V is the value matrix 303 and Wiv∈ ℝdmodel×dkis the weight matrix, i may indicate the index of the head. dqis the dimension ofa query vector. dkis the dimension of a key vector. dvis the dimension of a value vector. In some embodiments, dq= dk= dv= dmodel / h.

[0072] The MatMul layer 340, scale layer 350, mask layer 355, Softmax layer 360, and MatMul layer 370 are in an attention block 325 of the MHA layer. The attention block 325 may implement a scaled dot-product attention mechanism. In some embodiments, the MHA layer 300 includes a plurality of attention blocks that includes the attention block 325. For the purpose of illustration, the MHA layer 300 includes h attention blocks in FIG. 3A. Each of the attention blocks may have the same layers as the attention block 325. The linear block 315 and attention block 325 may constitute a head of the MHA layer 300. As the MHA layer 300 has h linear blocks and h attention blocks, the MHA layer 300 has h heads. A head may be denoted as headi = Attention (QW^KW^VW^).

[0073] A matrix multiplication operation may be performed on the parameter matrices 304 and 305 in the MatMul layer 340, which computes a score matrix 307. In some embodiments, the score matrix 307 may establish the degree of emphasis each token should place on other tokens. The score matrix 307 may include a plurality of scores. Each token may be assigned a score in relation to other tokens within the same time step. A higher score may indicate a higher focus or emphasis. The score matrix 307 may be scaled in the scale layer 350. In some embodiments, the score matrix 307 is scaled down in the scale layer 350 by dividing the scores in the score matrix by the square root of the dimension of the query vector and the key vector, which may be denoted asThe output of the scale layer 350 may be a scaled matrix 308, which includes adjusted scores. The mask layer 355 may be optional in some embodiments. The mask layer 355 may add an attention mask (which may be an input to the attention block 325) to the output of the scale layer 350 to mask out some elements in the output of the scale layer 350. The positions of the masked-out elements may be defined by the attention mask. A Softmax function may be applied on the scaled matrix 308 in the Softmax layer 360 to compute an attention weight matrix 309. The attention weight matrix 309 includes attention weights. The attention weights may be probability values ranging from 0 to 1. The Softmax function may emphasize high scores while diminishing low scores, which can enhance the model's ability to determine which tokens should get more attention.

[0074] In the MatMul layer 370, a matrix multiplication operation is performed on the attention weight matrix 309 computed in the Softmax layer 360 and the parameter matrix 306 computed from value matrix 303 in the linear layer 330. The result of the matrix multiplication operation is a single-head output matrix 311, which is an output of the attention block 325.

[0075] As the MHA layer 300 has h attention blocks, there are h single-head output matrices. The single-head output matrices are concatenated in the concatenation layer 380 to form a concatenated matrix. A linear operation (also referred to as "linear transformation") is performed on the concatenated matrix using a weight matrix in the linear layer 390. In some embodiments, the MHA may be denoted asMultiHead(Q, K, V) = Concat(head1, head2,..., headh)W°, where Concat denotes concatenation, and W° ∈ ℝhdv×dmodelis the weight matrix in the linear layer 390.

[0076] FIG. 4 illustrates neural network operations in an MHA layer 400, in accordance with various embodiments. An example of the MHA layer 400 may be the MHA layer 300 in FIGS.3A and 3B. The MHA layer 400 can be broken down into Matrix-Matrix workloads, including MatMul workloads, and a Softmax workload. In the embodiments of FIG. 4, the MHA layer 400 includes matrix multiplication operations: Q-MatMul 410, K-MatMul 420, V-MatMul 430, MatMul 440, and MatMul 450. The MHA layer 400 also includes a Softmax operation (Softmax 460) and a transpose operation (transpose 470). In other embodiments, the MHA layer 400 may have fewer, more, or different operations.

[0077] In the embodiments of FIG. 4, the Q-MatMul 410 is a multiplication of a query matrix with a weight matrix. The number of columns in the query matrix may equal the number of rows in the weight matrix. The result of the Q-MatMul 410 is a resulting matrix 401, which may have the same number of rows as the query matrix and the same number of columns as the weight matrix. In some embodiments, the spatial size of the query matrix is1 x S x D, where S denotes a scalable dimension, which is the variable input sequence length, and D denotes a model dimension, which is a fixed dimension. The spatial size of the weight matrix is 1 x D X D / H, where H denotes the number of heads in the MHA layer 400, which is fixed. The fixed dimensions D and D / H may be predetermined when the transformer network was generated or trained. The spatial size of the resulting matrix 401 is I x S x D / H.

[0078] The K- Mat Mu I 420 is a multiplication of a key matrix with another weight matrix. The number of columns in the key matrix may equal the number of rows in the weight matrix. The result of the K-MatMul 420 is a resulting matrix 402, which may have the same number of rows as the key matrix and the same number of columns as the weight matrix. In some embodiments, the spatial size of the key matrix is 1 x S x D, the spatial size of the weight matrix is 1 x D X D / H, and the spatial size of the resulting matrix 402 is 1 x S X D / H.

[0079] The V-MatMul 430 is a multiplication of a value matrix with yet another weight matrix. The number of columns in the value matrix may equal the number of rows in the weight matrix. The result of the V-MatMul 430 is a resulting matrix 403, which may have the same number of rows as the value matrix and the same number of columns as the weight matrix. In some embodiments, the spatial size of the value matrix is 1 x S x D, the spatial size of the weight matrix is 1 x D X D / H, and the spatial size of the resulting matrix 403 is I x S x D / H.

[0080] The MatMul 440 is performed on the resulting matrix 401 and resulting matrix 402 and produces a matrix 404. In some embodiments, the resulting matrix 402 may be transposed to 1 X D / H X S before the MatMul 440 is performed. The spatial size of the matrix 404 may be 1 x S x S. The MatMul 440 may be an MatMul + rescale operation, in which the result of the MatMul operation on the resulting matrix 401 and resulting matrix 402 may be followed by a scaling operation. The matrix 404 is further processed in the Softmax 460. The Softmax 460 produces a matrix 405, whose spatial size is also l x S x S. The Softmax 460 may include Softmax summation or Softmax elementwise operation.

[0081] The transpose 470 is performed on the resulting matrix 403, which produces a matrix 406 whose spatial size is 1 X D / H X S. Each row in the resulting matrix 403 may be transposed to a column in the matrix 406. Each column in the resulting matrix 403 may be transposed to a row in the matrix 406.

[0082] The matrix 405 and matrix 406 are then input into the MatMul 440. The MatMul 450 is a multiplication operation on the two matrices and produces a matrix 407. The spatial size of the matrix 407 is 1 X S X D / H. The matrix 407 may be the output tensor of the MHA layer 400. The matrix 407 may be further processed in one or more other layers of the transformer network.

[0083] FIG. 5 illustrates dynamic scaling of a MatMul operation 500, in accordance with various embodiments. The MatMul operation 500 may be a MatMul in an MHA layer of atransformer network. In some embodiments, the MatMul operation 500 is operated as a convolution. Examples of the MatMul operation 500 may include the Q-MatMul 410, K-MatMul 420, and V-MatMul 430 in FIG. 4.

[0084] As shown in FIG. 5, the MatMul operation 500 has two input tensors: input tensor 510 and input tensor 520. The input tensor 510 may be a query tensor, key tensor, or value tensor. The input tensor 520 may be a weight tensor. In some embodiments, a dot product is performed between each row of the input tensor 510 and each column of the input tensor 520 to generate a single point in the output tensor 530. The spatial size of the input tensor 510 may be 1 x S x D, where S denotes the spatial dimension of the input tensor 510, which is the height of the input tensor 510 and equals the maximum sequence length that the transformer network can support, and D denotes the model dimension and is the depth of the input tensor 510. The spatial size of the input tensor 520 may be 1 x D X D / H, where H denotes the number of heads in the MHA layer, D is the height of the input tensor 520, and D / H is the depth of the input tensor 520. The MatMul operation 500 on the input tensor 510 and input tensor 520 results in an output tensor 530. The spatial size of the output tensor 530 may be 1 x S X D / H. In FIG. 5, the input tensor 510, input tensor 520, and output tensor 530 are shown as 2D matrices. In other embodiments, the input tensor 510, input tensor 520, or output tensor 530 may be a 3D tensor.

[0085] The transformer network may be compiled by a compiler before it is executed. The compiler may generate configuration parameters that can be used by an NPU to execute the neural network operations in the transformer network. In an implementation without dynamical scaling, the compiler may generate a workload descriptor defining a single workload to execute the MatMul operation 500. In implementations with dynamic scaling, the compiler may break the MatMul operation 500 into multiple workloads. For the purpose of illustration, the compiler breaks the MatMul operation 500 into four workloads in the embodiments of FIG. 5 by partitioning the input tensor 510 into four subtensor 515A-515D (collectively referred to as "subtensors 515" or "subtensor 515") in its spatial dimension. Each of the four workloads corresponds to a respective one of the subtensors 515. The spatial dimension of the subtensors 515 may be the same and may be a configuration parameter determined by the compiler. This configuration parameter may be referred to as primitive sequence length. The primitive sequence length may indicate the number of tokens to be processed in a single workload. The compiler may determine the primitivesequence length based on one or more hardware configurations of the NPU, such as the number of PEs in the NPU, the arrangement of the PEs, and so on. The compiler may determine a desirable primitive sequence length that can maximize the utilization of the PEs in the NPU for executing the MatMul operation 500.

[0086] The compiler may generate workload descriptors for the four workloads. The workload descriptor for each workload may include a PRIMITIVE ID inserted for the input tensor 510. For instance, the PRIMITIVE ID of the workload for the subtensor 515A may be denoted as PRIMITIVE. A = 0, the PRIMITIVE ID of the workload forthe subtensor 515B may be denoted as PRIMITIVE. A = 1, the PRIMITIVE ID of the workload for the subtensor 515C may be denoted as PRIMITIVE. A = 2, the PRIMITIVE ID of the workload for the subtensor 515D may be denoted as PRIMITIVE. A = 3, For N primitives, this would be N workloads.

[0087] The NPU may support reading a scaled sequence length from memory and uses this scaled sequence length to determine whether to process or skip a workload based on the PRIMITIVE ID in the workload. The scaled sequence length may be the length of the input sequence received by the transformer network, which may be equal to or smaller than the maximum sequence length that the transformer network can support. In some embodiments, the NPU may execute the following pseudo code to determine whether to process or skip a workload:IF (SP * PRIMITIVE. A > SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOADwhere SP represents the primitive sequence length, PRIMITIVES is the primitive ID of the workload under consideration, and SCALED_SEQ_LENGTH represents the scaled sequence length. The NPU may always execute the workload for the subtensor 515A as PRIMITIVES of the workload is 0 and SP * PRIMITIVES would always be smaller than SCALED_SEQ_LENGTH. In an example where the maximum sequence length is 512, the primitive sequence length is 128, and the scaled sequence length is 196, the NPU may also execute the workload for the subtensor 515B as SP * PRIMITIVES < SCALED_SEQ_LENGTH but may skip the workload for the subtensor 515C as SP * PRIMITIVES > SCALED_SEQ_LENGTH. Similarly, the NPU may determine not to execute the workload forthe subtensor 515D as SP * the PRIMITIVE. A > SCALED_SEQ_LENGTH. In this example, the first 2 primitives are executed, even though part of the second primitive does not contain valid data. The result is an output tensor 535, the height of which is the scaled sequence length rounded up to the nearest primitive boundary.

[0088] In some embodiments, the compiler allocates memory for all primitives as the scaled sequence length is not known at compile time, so processing the unused half of the second primitive would not cause any issues. In some embodiments, the layer processing time decreases linearly with sequence length. The decrease in processing time may be close to linear when a relatively large number (e.g., 8) primitives are used and may be less even when a relatively small number (e.g., 4) primitive are used. Despite the number of primitives, the actual processing time can be the same as or close to the ideal processing time and give a significant performance uplift compared to the implementation without dynamic shapes.

[0089] FIG. 6 illustrates a dynamic scaling mode of an intermediate MatMul operation 600, in accordance with various embodiments. The intermediate MatMul operation 600 may be a MatMul in an MHA layer of a transformer network. In some embodiments, the intermediate MatMul operation 600 is operated as a convolution. Examples of the intermediate MatMul operation 600 may include the MatMul 440 in FIG. 4.

[0090] The intermediate MatMul operation 600 includes a MatMul operation on two input tensors: input tensor 610 and input tensor 620. In some embodiments, a dot product is performed between each row of the input tensor 610 and each column of the input tensor 620 to generate a single point in the output tensor 630. The spatial size of the input tensor 610 may be 1 x S x D / H, where S denotes the spatial dimension of the input tensor 610, which is the height of the input tensor 610 and equals the maximum sequence length that the transformer network can support, D denotes the model dimension, H denotes the number of heads in the MHA layer, and D / H is the depth of the input tensor 610. The spatial size of the input tensor 620 may also be 1 x S x D / H. The intermediate MatMul operation 600 on the input tensor 610 and input tensor 620 results in an output tensor 630. The spatial size of the output tensor 630 may be 1 x S x S. In FIG. 6, the input tensor 610, input tensor 620, and output tensor 630 are shown as 2D matrices. In other embodiments, the input tensor 610, input tensor 620, or output tensor 630 may be a 3D tensor. In some embodiments, the intermediate MatMul operation 600 may include a scaling operation inaddition to the MatMul operation. The output of the MatMul operation may be scaled down.

[0091] The transformer network may be compiled by a compiler before it is executed. The compiler may generate configuration parameters that can be used by an NPU to execute the neural network operations in the transformer network. In an implementation without dynamical scaling, the compiler may generate a workload descriptor defining a single workload to execute the intermediate MatMul operation 600. In implementations with dynamic scaling, the compiler may break the intermediate MatMul operation 600 into multiple workloads. In the embodiments of FIG. 6, the compiler breaks the intermediate MatMul operation 600 by partitioning the input tensor 610 in the spatial dimension and partitioning the output tensor 630 in the spatial dimension. In some embodiments, the compiler may generate two workload descriptors for each workload: a PRIMITIVE ID (" PRIMIVITE. A") inserted for the input tensor 610 and a PRIMITIVE ID (" PRIMIVITE. B") inserted for the input tensor 620. For N primitives, this results in N2workloads. In the example shown in FIG. 6, there are four primitives in each input tensor, and there are 16 workloads in total. In other embodiments, PRIMIVITE. A and PRIMIVITE. B for a workload may constitute a single workload descriptor.

[0092] The compiler partitions the input tensor 610 into four subtensor 615A-615D (collectively referred to as "subtensors 615" or "subtensor 615") and partitions the input tensor 620 into four subtensor 625A-625D (collectively referred to as "subtensors 625" or "subtensor 625") in the spatial dimension. Each of the 16 workloads corresponds to a respective one of the subtensors 615 and a respective one of the subtensors 625. The spatial dimension of the subtensors 615 (or the subtensors 625) may be the same and may be a configuration parameter determined by the compiler. This configuration parameter may be referred to as primitive sequence length. The compiler may determine the primitive sequence length based on one or more hardware configurations of the NPU, such as the number of PEs in the NPU, the arrangement of the PEs, and so on. The compiler may determine a desirable primitive sequence length that can maximize the utilization of the PEs in the NPU for executing the intermediate MatMul operation 600.

[0093] The NPU may support reading a scaled sequence length from memory and uses this scaled sequence length to determine whether to process or skip a workload based on the PRIMITIVE ID in the workload. The scaled sequence length may be the length of the inputsequence received by the transformer network, which may be equal to or smaller than the maximum sequence length that the transformer network can support. In some embodiments, the NPU may execute pseudo code to determine whether to process or skip a workload. Taking the workload for the subtensor 615A and the subtensor 625A for example, the NPU may execute the following pseudo code:IF (SP * PRIMITIVE. A >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE IF (SP * PRIMITIVE. B >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOADwhere SP represents the primitive sequence length, PRIMITIVES is the primitive ID for the input tensor 610, PRIMITIVE. B is the primitive ID for the input tensor 620, and SCALED_SEQ_LENGTH represents the scaled sequence length.

[0094] The NPU may always execute the workload for the subtensor 615A and subtensor 625A as both PRIMITIVES and PRIMITIVE. B are 0. In an example where the maximum sequence length is 512, the primitive sequence length is 128, and the scaled sequence length is 196, the NPU may also execute the workload for the subtensor 615A and subtensor 625B, the workload for the subtensor 615B and subtensor 625A, and the workload for the subtensor 615B and subtensor 625B. The NPU may skip the other 12 workloads. The NPU may produce an output tensor 635, the height and depth of which are the scaled sequence length rounded up to the nearest primitive boundary. The depth dimension on the output tensor 635 is scaled. The output tensor 635 may be the input tensor to a Softmax function in the next layer of the transformer network.

[0095] In some embodiments, the scaled sequence length may not align to a primitive boundary, as shown in FIG. 6. To allow the primitives to be consumed by the Softmax function in NPU on the next layer, output locations in the depth dimension which are greater than the scaled sequence length may be overwritten by the maximum negative value. This overwrite of output data in the depth dimension may be implemented by the following code that is executable by the NPU:IF (SP * PRIMITIVE. B + OUTPUTJZHANNEL <= SCALED_SEQ_LENGTH)DO NOT OVERRWITE OUTPUT DATAELSEOVERRWITE OUTPUT DATA WITH MAX NEGATIVE VALUE

[0096] In some embodiments, the ideal processing time may decrease quadratically with sequence length. The actual processing time with dynamic scaling can be significantly less compared to the case without dynamic shaping, even though it may be some way off the ideal processing time. As the compiler generates N2workloads, the software and hardware overhead per workload can be the reason that there is a difference between the actual and ideal processing time.

[0097] FIG. 7 illustrates another dynamic scaling mode of the intermediate MatMul operation 600, in accordance with various embodiments. Different from the embodiments of FIG. 6 in which the compiler breaks the intermediate MatMul operation 600 based on the spatial dimension of both the input tensor 610 and input tensor 620, the compiler breaks the intermediate MatMul operation 600 based on the spatial dimension of input tensor 610 in the embodiments of FIG. 7. The compiler partitions the input tensor 610 into four subtensor 615A-615D (collectively referred to as "subtensors 615" or "subtensor 615") but does not partition the input tensor 620. For each workload, the compiler may generate a workload description including a PRIMITIVE ID inserted for tensor A. For N primitives, this results in N workloads. For the purpose of simplicity and illustration, FIG. 7 shows four primitives, which result in four workloads.

[0098] The NPU may execute the following code to determine whether to execute or skip a workload:IF (SP * PRIMITIVES >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOADIn the example shown in FIG. 7, the first two primitives are executed and the other two are skipped by the NPU. The intermediate MatMul operation 600 produces the output tensor 635. The output tensor 635 may be input into a Softmax function in the transformer network. The NPU may overwrite the output tensor 635 in the depth dimension by executing the following code:IF (OUTPUTJZHANNEL <= SCALED_SEQ_LENGTH)DO NOT OVERRWITE OUTPUT DATAELSEOVERRWITE OUTPUT DATA WITH MAX NEGATIVE VALUE

[0099] In the embodiments of FIG. 7, the depth dimension on the output tensor 635 may be scaled but there may be no primitives in this dimension. To reduce unnecessary computation, the output depth dimension in the workload descriptor may be scaled based on SCALED_SEQUENCE_LENGTH. The output depth dimension may be scaled using the following equation:DESCRIPTOR JDPJ / VORKLOAD_DEPTH = roundup(SCALED_SEQUENCE_LENGTH / SP)*SP where DESCRIPTOR_OP_WORKLOAD_DEPTH denotes the depth of the output tensor, and SP denotes primitive sequence length. Additional output channel(s) beyond the end of the scaled sequence length may be written, for instance in embodiments with a hardware constraint of the NPU where the number of output channels needs to be a multiple of 16.

[0100] In some embodiments, the reduction in processing time in the dynamic scaling mode illustrated in FIG. 7 can track the ideal uplift more closely compared to the dynamic scaling mode illustrated in FIG. 6 and is closer to quadratic speed up compared to the mode without dynamic shape. For this dynamic shape implementation, the compiler generates N workloads - so the per workload overhead can be smaller, which can lead to a performance that is closer to the ideal performance.

[0101] The dynamic scaling mode illustrated in FIG. 6 can result in more workloads, where each workload processes a smaller amount of data. The dynamic scaling mode illustrated in FIG. 7 can result in fewer workloads where each workload processes a larger amount of data. Both methods may perform the same function. In some embodiments, the more optimal method may be chosen from the two dynamic scaling modes during characterization.

[0102] FIG. 8 illustrates dynamic scaling of a Softmax operation 800, in accordance with various embodiments. The Softmax operation 800 may be part of a Softmax function in an MHA layer of a transformer network. For instance, the Softmax operation 800 may be part of the Softmax 460 in FIG. 4. The compute for the Softmax function may scale quadratically. For an input sequence of length n, there may be n2compute operations. When the sequence length is scaled, the saving on compute may be n2. There may be two processing modes required from the NPU for dynamic scaling of the Softmax function: a Softmax summation and a Softmax elementwise operation. There are other features required forimplementing Softmax on the NPU such as numerical scaling, exponent and divide, but these do not need to be dynamically scaled and are beyond the scope of this document.

[0103] In some embodiments, the Softmax operation 800 is a Softmax summation. The Softmax operation 800 may be compiled by using a tensor 810. The tensor 810 has a spatial dimension and a depth dimension. The spatial dimension and depth dimension may both equal the maximum sequence length of the transformer network. The Softmax operation 800 may include a summation in the depth dimension on each spatial point in the tensor 810. The spatial size of the output tensor of the Softmax operation 800 may be the same as the spatial size of the tensor 810. During inference of the transformer network, the actual tensor input into the Softmax operation 800 may be a tensor 815, which is smaller than the tensor 810. The tensor 815 has a depth 801, which is the scaled sequence length that is shorter than the maximum sequence length. The spatial dimension of the tensor 815 may also equal the scaled sequence length. As described above, the tensor 815 may be padded from the scaled sequence length to the nearest primitive boundary. In the example shown in FIG. 8, the primitive sequence length is represented by SP, and the tensor 815 is padded with a padding length 802 so that the depth dimension equals two primitive sequence lengths. The NPU may pad the tensor 815 by adding zeros to the right edge of the tensor 815.

[0104] In some embodiments, there are two aspects to the scaling for the Softmax operation 800. Firstly, in the spatial dimension workloads may be skipped based on a comparison of the primitive ID against the scaled sequence length as follows:IF (SP * PRIMITIVE. A >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOADSecondly, in the depth dimension the input depth is modified. Consistent with the output in the previous layer, the input tensor depth is modified as follows:DESCRIPTOR_IP_WORKLOAD_DEPTH = roundup(SCALED_SEQUENCE_LENGTH / SP)*SP where DESCRIPTOR_IP_WORKLOAD_DEPTH denotes the depth of the input tensor, and SP denotes the primitive sequence length. This dynamic scaling mode can be closer to quadratic speed up compared to the mode without dynamic shape.

[0105] FIG. 9 illustrates dynamic scaling of another Softmax operation 900, in accordance with various embodiments. The Softmax operation 900 may be a Softmax elementwise operation. It may be part of a Softmax function in an MHA layer of a transformer network. For instance, the Softmax operation 900 may be part of the Softmax 460 in FIG. 4. In some embodiments, elementwise operations are required for addition and multiplication operations in the Softmax function. As shown in FIG. 9, the Softmax operation 900 is divided into multiple workloads, where each workload has two primitive IDs in the workload descriptor. The division of the Softmax operation 900 may be performed by a compiler using a tensor 910. The tensor 910 may represent an input tensor of the Softmax operation 900. The output tensor of the Softmax operation 900 may have the same spatial size as the tensor 910. The tensor 910 may have a spatial dimension and depth dimension, each of which equals the maximum sequence length of the transformer network. The two primitive IDs for a workload may correspond to the two dimensions, respectively, of a tensor 910.

[0106] During inference of the transformer network, the NPU may determine whether to skip or execute each elementwise workload based on the following conditions:IF (SP * PRIMITIVE. A >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE IF (SP * PRIMITIVE. B >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOAD

[0107] FIG. 10 dynamic scaling of a final MatMul operation 1000, in accordance with various embodiments. The final MatMul operation 1000 may be a MatMul operation in an MHA layer of a transformer network. For instance, the final MatMul operation 1000 may be an example of the MatMul 450 in FIG. 4.

[0108] In some embodiments, a compiler may compile the final MatMul operation 1000 based on the maximum sequence length of the transformer network. As shown in FIG. 10, the final MatMul operation 1000 may be compiled based on an input tensor 1010 and input tensor 1020. The spatial size of the input tensor 1010 is 1 x S x S, where S denotes the maximum sequence length. The spatial size of the input tensor 1020 may be l x S x D / H, where D denotes the model dimension, and H denotes the number of heads in the MHA layer. The final MatMul operation 1000 on the input tensor 1010 and input tensor 1020produces an output tensor 1030, the spatial size of which is 1 x S x D / H. In some embodiments, the final MatMul operation 1000 may have scaling on both the spatial and depth dimension of the input tensor 1010 and on the depth dimension on the input tensor 1020. The compiler may partition the final MatMul operation 1000 into multiple workloads, where each workload has two primitive IDs in the workload descriptor. The two primitive IDs of a workload may corresponding to the input tensor 1010 and input tensor 1020, respectively. For instance, the primitive ID corresponding to the input tensor 1010 may be referred to as PRIMITVE. A, and the primitive ID corresponding to the input tensor 1020 may be referred to as PRIMITVE. B. The primitive sequence length in the input tensor 1010 or the input tensor 1020 is shown as SP in FIG. 10.

[0109] The actual input tensors of the final MatMul operation 1000 may have different spatial sizes from the input tensor 1010 and input tensor 1020. For instance, during an execution of the transformer network, the final MatMul operation 1000 has an input tensor 1015 and input tensor 1025. The spatial dimension and depth dimension of the input tensor 1015 may both equal a scaled sequence length. The depth dimension of the input tensor 1025 may also equal the scaled sequence length, while the spatial dimension of the input tensor 1025 is D / H. When the NPU executes the final MatMul operation 1000 using the input tensor 1015 and input tensor 1025, the NPU may determine to skip or execute each of the workloads based on the following conditions:IF (SP * PRIMITIVE. A >= SCALED_SEQ_LENGTH)SKIP WORKLOAD ELSE EXECUTE WORKLOADwhere SCALED_SEQ_LENGTH denotes the scaled sequence length. In some embodiments, for any executed workload, additional scaling may be performed in the input depth dimension on both the input tensor 1010 and input tensor 1020. That can further reduce the amount of compute and improve efficiency.

[0110] The input depth dimension may be scaled as follows:DESCRIPTOR J P_WORKLOAD_DEPTH = roundup(SCALED_SEQ_LENGTH / SP)*SP. In some embodiments, channels beyond the scaled sequence length are written with max negative value and are processed by the NPU during the execution of the final MatMul operation 1000. In an example, the maximum sequence length is 512, the primitivesequence length is 128, and the scaled sequence length is 196. Channels 0-195 have valid data, and channels 196-255 are padded with max negative values in the previous layers. The NPU may process channels 0-255. Channels 256-511 may be considered invalid so may not be processed.

[0111] FIG. 11 is a block diagram of an AI system 1100, in accordance with various embodiments. The whole Al system 1100 or a part of the Al system 1100 may be implemented in one or more computing devices, such as the computing device 2100 in FIG.21. The Al system 1100 can generate and execute DNNs, such as transformer-based models (e.g., the transformer networks described above), convolution-based models, and so on. As shown in FIG. 11, the Al system 1100 includes a DNN module 1101 and an NPU 1102. In other embodiments, alternative configurations, different or additional components may be included in the Al system 1100. For instance, the Al system 1100 may include multiple DNN modules or multiple DNN accelerators. Further, functionality attributed to a component of the Al system 1100 may be accomplished by a different component included in the Al system 1100 or a different system. In some embodiments, the DNN module 1101 and NPU 1102 may include different types of processing units. In an example, the DNN module 1101 may be implemented by one or more CPUs or graphics processing units (GPUs). The NPU 1102 may also be referred to as a neural processing unit, Al accelerator, or Al processor. The DNN module 1101 and NPU 1102 may be implemented in the same chip or separate chips.

[0112] The DNN module 1101 facilitates generation and deployment of DNNs. In some embodiments, the DNN module 1101 may generate and train DNNs. For instance, the DNN module 1101 can define the layered architecture of a DNN. The DNN module 1101 can also determine the internal parameters of the DNN through a DNN training process. The DNN module 1101 may also determine one or more hyperparameters that define how the DNN is trained. An example hyperparameter is a sparsity ratio that defines the sparsity level of one or more deep learning tensors for the DNN.

[0113] The DNN module 1101 may deploy trained, compressed, or validated DNNs for use in Al applications. In some embodiments, the DNN module 1101 may distribute trained, compressed, or validated DNNs to devices or systems which may use the DNNs to perform tasks (e.g., image classification, motion planning, etc.) for which the DNNs were trained. In other embodiments, the DNN module 1101 may facilitate deployment of the DNNs using the NPU 1102. For instance, the DNN module 1101 may receive data from a device orsystem coupled with the Al system 1100 and input the received data (or data generated by the DNN module 1101, e.g., based on the received data) into a DNN. The DNN module 1101 may generate instructions (e.g., configuration files) that control the operation of the NPU 1102 during the DNN execution. The DNN module 1101 may receive an output of the DNN from the NPU 1102. The DNN module 1101 may transmit the output of the DNN (or a result of processing the output of the DNN by the DNN module 1101) to the device or system. In some embodiments, the DNN module 1101 may control execution processes of trained, compressed, or validated DNNs. The DNN module 1101 may function as a compiler for DNNs executed by the NPU 1102. The DNN module 1101 may perform compilation of DNNs and generate compilation descriptors, based on which the DNNs may be executed.

[0114] The DNN module 1101 may generate executable transformer models. The DNN module 1101 may also facilitate execution of transformer models by the NPU 1102. For instance, the DNN module 1101 may be the host for the execution of neural network operations in transformer models by the NPU 1102, e.g., the host for transformer model inference. In some embodiments, the DNN module 1101 may receive an inference request, which may be a request to have a transformer model to make a prediction based on input data. The DNN module 1101 may facilitate cached inference of the transformer model, in which attention tensors (e.g., key tensors, value tensor, etc.) may be cached and reused in the inference of the transformer model. The inference for making the prediction may include a sequence of inference stages, which generates a sequence of predicted tokens. The sequence of predicted tokens may be the prediction of the transformer model.

[0115] In some embodiments, the DNN module 1101 may facilitate dynamic scaling of transformer networks based on input sequence length. For instance, the DNN module 1101 may facilitate dynamic scaling of MHA layers in transformer networks. The DNN module 1101 may identify MatMul operations and Softmax operations in the MHA layer. The DNN module 1101 may generate configuration parameters (e.g., workload descriptors) for the identified operations based on the maximum sequence length that the transformer network can support. The DNN module 1101 may determine a plurality of workloads for an operation by dividing an input tensor of the operation into a plurality of subtensors. The input tensor may have a dimension that equals the maximum sequence length. The dimension of each subtensor may be referred to as a primitive sequence length. The maximum sequence length may be a multiple of the primitive sequence length. The DNN module 1101 maydetermine the primitive sequence length based on one or more configurations of the NPU 1102. A workload may be for processing a single subtensor. A workload descriptor for the workload may be a primitive ID that identifies the subtensor. In some embodiments, the DNN module 1101 may generate multiple workload descriptors for a single workload. For instance, the DNN module 1101 may divide the workload of executing an operation into a plurality of workloads by partitioning multiple input tensors of the operation. Each of the plurality of workloads may be a workload of processing a subtensor in each of the input tensors. In an example of the operation having two input tensor, a workload may have two workload descriptors for identifying the corresponding subtensor in each of the two input tensors, respectively. The DNN module 1101 may provide the workload descriptors to the NPU 1102 for the NPU 1102 to execute the operation through dynamic scaling. Certain aspects of the DNN module 1101 are provided below in conjunction with FIG. 12.

[0116] The NPU 1102 executes DNNs provided by the DNN module 1101. For instance, the NPU 1102 can execute a transformer network by running neural network operations in the transformer network. The process of carrying out a neural network operation is also referred to as a process of executing the neural network operation or a process of performing the neural network operation. The execution of the DNN may be for training the DNN or for using the DNN to perform Al tasks. The NPU 1102 may be an example of the NPUs described above.

[0117] As shown in FIG. 11, the NPU 1102 includes a memory 1110, a DMA (direct memory access) engine 1120, and compute blocks 1130 (individually referred to as "compute block 1130"). In other embodiments, alternative configurations, different or additional components may be included in the NPU 1102. For example, the NPU 1102 may include more than one memory 1110 or DMA engine 1120. As another example, the NPU 1102 may include a single compute block 1130. As yet another example, the NPU 1102 may include one or more digital signal processors. Further, functionality attributed to a component of the NPU 1102 may be accomplished by a different component included in the NPU 1102 or by a different module or system (e.g., the DNN module 1101). A component of the NPU 1102 may be implemented in hardware, software, firmware, or some combination thereof.

[0118] The memory 1110 stores data associated with neural network operations performed by the NPU 1102. The memory 1110 may be a system memory. In some embodiments, the memory 1110 includes a dynamic random-access memory (DRAM). When the NPU 1102executes operations in transformer models, at least part of the memory 1110 may be used to implement KV caches, such as self-attention KV caches and cross-attention KV caches, in the transformer models. The KV caches may be updated during inference of the transformation model. Layout of data in the KV caches may be determined to optimize the efficiency of the DNN accelerator 1101. The memory 1110 may also store sparsity masks. A sparsity mask may be a sparsity tensor that indicates a sparsity pattern in an input tensor of a neural network operation. As an example, a sparsity mask may indicate the sparsity pattern in a vector along the input channel dimension of an input tensor. The sparsity mask may be a sparsity bitmap that includes one or more zero bits and one or more one bits. Some sparsity masks may be used for distributing different segments of the tensor to different PEs for improving PE utilization.

[0119] In some embodiments, the memory 1110 may store data to be used by the compute blocks 1130 for DNN execution. The memory 1110 may store weights, such as weights of convolutional layers, which are determined by training DNNs. The memory 1110 may further store inputs to DNN layers or outputs of DNN layers, such as data generated by the compute blocks 1130 from performing neural network operations in DNNs. Example neural network operations include convolutions (also referred to as "convolutional operations"), layer normalization operations, Softmax operations, MatMul operations, pooling operations, elementwise operations, activation functions, other types of neural network operations, or some combination thereof. The memory 1110 may be a main memory of the NPU 1102. In some embodiments, the memory 1110 includes one or more DRAMs.

[0120] The DMA engine 1120 facilitates data transfer between the memory 1110 and local memories of the compute blocks 1130. For example, the DMA engine 1120 can read data from the memory 1110 and write data into a local memory of a compute block 1130. As another example, the DMA engine 1120 can read data from a local memory of a compute block 1130 and write data into the memory 1110. The DMA engine 1120 provides a DMA feature that allows the compute block 1130 to initiate data transfer between the memory 1110 and the local memories of the compute blocks 1130 and to perform other operations while the data transfer is being conducted. In some embodiments, the DMA engine 1120 may read tensors from the memory 1110, modify the tensors in a way that is optimized for the compute block 1130 before it writes the tensors into the local memories of the compute blocks 1130.

[0121] The compute blocks 1130 perform neural network operations in DNNs. For instance, a compute block 1130 may execute a DNN layer (e.g., an MHA layer) by running one or more neural network operations (e.g., MatMul operations, Softmax operations, etc.) in the DNN layer. The compute blocks 1130 may be capable of running various types of neural network operations, such as MatMul operation, Softmax operation, convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on. Deep learning operations performed by the compute blocks 1130 include tensor operations, i.e., operations whose inputs are tensors or operations whose outputs are tensors. In an example, the compute block 1130 receives one or more input tensors and performs a MatMul or Softmax operation. The result of the operation may be an output tensor, which can be further computed, e.g., by the compute block 1130 or another compute block 1130. A compute block 1130 may execute a layer, or a portion of a layer, at a time. In some embodiments, the operations of the DNN layers may be run by multiple compute blocks 1130 in parallel. For instance, multiple compute blocks 1130 may each perform a portion of a workload for a neural network operation. Data may be shared between the compute blocks 1130. A compute block 1130 may also be referred to as an NPU, a compute block, or a compute tile.

[0122] In the embodiments of FIG. 11, each compute block 1130 includes a local memory 1140, a dynamic scaling module 1150, a load module 1160, a processing engine 1170, a post-processing engine 1180, and a drain module 1190. Some or all the components of the compute block 1130 can be implemented on the same chip. In other embodiments, alternative configurations, different or additional components may be included in the compute block 1130. Further, functionality attributed to a component of the compute block 1130 may be accomplished by a different component included in the compute block 1130, a different compute block 1130, another component of the NPU 1102, or a different system. A component of the compute block 1130 may be implemented in hardware, software, firmware, or some combination thereof.

[0123] The local memory 1140 is local to the corresponding compute block 1130. In the embodiments of FIG. 11, the local memory 1140 is inside the compute block 1130. In other embodiments, the local memory 1140 may be outside the compute block 1130. Data in the local memory 1140 may be transferred to or from the memory 1110, e.g., through the DMA engine 1120. For instance, KV caches may be copied from the memory 1110 to the localmemory 1140. In some embodiments, data in the local memory 1140 may be transferred to or from the local memory of another compute block 1130. The local memory 1140 may store data received, used, or generated by the dynamic scaling module 1150, the load module 1160, the processing engine 1170, the post-processing engine 1180, or the drain module 1190. Examples of the data may include input activations, weights, output activations, sparsity bitmaps, and so on.

[0124] In some embodiments, the local memory 1140 may store tensors to be processed by the processing engine 1170 or the post-processing engine 1180. The tensors may be input tensors of neural network operations. The local memory 1140 may also store tensors generated by the processing engine 1170 or the post-processing engine 1180. The tensors may be output tensors of neural network operations. In some embodiments, the local memory 1140 may store dense tensors (e.g., dense activation tensors, dense weight tensors, etc.), sparse tensors (e.g., sparse activation tensors, sparse weight tensors, etc.), and so on. A dense tensor may be a tensor from which zero-valued elements (if any) are not removed. A dense tensor may be converted to a sparse tensor by removing one or more zero-valued elements in the dense tensor. A sparse tensor may also be referred to as a compressed tensor or packed tensor. The process of converting a dense tensor to a sparse tensor may be referred to as sparsity encoding. Sparsity encoding may also generate a sparsity tensor. Each element in the sparsity tensor may correspond to a different element in the dense tensor and indicate whether the element in the dense tensor is zero or not. The sparsity tensor may indicate positions of elements of the sparse tensor in the dense tensor. The sparsity tensor may be a sparsity bitmap, each element of which is a bit. A sparse tensor may be converted to a dense tensor through a densifying process, in which one or more zeros may be added to the sparse tensor based on the sparsity tensor.

[0125] In some embodiments, the local memory 1140 includes one or more SRAMs. The local memory 1140 may be byte-addressable, and each memory address identifies a single byte (eight bits) of storage. In some embodiments, the local memory 1140 may include memory banks. The number of data banks in the local memory 1140 may be 16, 64, 128, 256, 512, 1024, 2048, or other numbers. A memory bank may include a plurality of storage units. In an example, a data bank may include 8, 16, 64, or a different number of storage units. A memory bank or a storage unit in a memory bank may have a memory address. In an example, a storage unit may store a single byte, and data larger than a singlebyte may be stored in storage units with consecutive memory addresses, i.e., adjacent storage units. For instance, a storage unit can store an integer number in the INT8 format, versus two storage units may be needed to store a number in the FP16 or BF16 format, which has 16 bits. In some embodiments, 16 bits can be transferred from the local memory 1140 in a single read cycle. In other embodiments, 16 bits can be transferred from the local memory 1140 in multiple read cycles, such as two cycles.

[0126] The dynamic scaling module 1150 performs dynamic scaling of neural network operations, such as operations in MHA layers of transformer networks. In some embodiments, dynamic scaling module 1150 receives configuration parameters from the DNN module 1101 and performs dynamic scaling based on the configuration parameters. The configuration parameters for a neural network operation may include workload descriptors that define workloads for executing the neural network operation. The workload descriptors may define a predetermined number of workloads, such as workloads identified by a compiler. The configuration parameters may also include one or more primitive sequence lengths. The dynamic scaling module 1150 may identify the scaled sequence length of the neural network operation and determine which one(s) of the workloads should be executed based on the scaled sequence length. The scaled sequency length may be the length of an input sequence received by the transformer network and may be shorter than maximum sequence length of the transformer network. For each of the workloads identified by the compiler, the dynamic scaling module 1150 may determine to skip or execute the workload based on the workload descriptor(s), the primitive sequence length(s), and the scaled sequence length. In some embodiments, the dynamic scaling module 1150 may execute pseudo codes (such as the pseudo codes described above) using the workload descriptor(s) to make the determination.

[0127] In some embodiments, the dynamic scaling module 1150 may pad one or more input tensors of the neural network operation. For instance, the dynamic scaling module 1150 may add one or more zeros into an input tensor to make a dimension of the input tensor be a multiple of a primitive sequence length. The dynamic scaling module 1150 may pad the input tensor either before or after it determine which workload(s) to execute. After determining to execute a workload, the dynamic scaling module 1150 may instruct the load module 1160 to load data for the workload to the processing engine 1170 or post-processing engine 1180. The dynamic scaling module 1150 may also instruct the processing engine 1170 or post-processing engine 1180 to perform computations in the workload.

[0128] The load module 1160 loads data from the local memory 1140 to the processing engine 1170 or to the post-processing engine 1180. The load module 1160 may read tensors from the local memory 1140. The tensors may include sparsity masks, query tensors, key tensors, value tensors, activation tensor, weight tensors, and so on. In some embodiments, the load module 1160 may load data based on determinations made by the dynamic scaling module 1150. For instance, the load module 1160 may load tensors padded by the dynamic scaling module 1150 to the processing engine 1170 or post-processing engine 1180. A dimension of a padded tensor may be a multiple of a primitive sequence length determined by the DNN module 1101.

[0129] In some embodiments, the load module 1160 may select different data to transmit to the processing engine 1170 in different sparsity modes. For instance, the load module 1160 may transmit an activation sparsity tensor and a weight sparsity tensor of a layer to the processing engine 1170 in the combined sparsity mode, while transmit the activation sparsity tensor but not the weight sparsity tensor to the processing engine 1170 in the activation sparsity mode and transmit the weight sparsity tensor but not the activation sparsity tensor to the processing engine 1170 in the weight sparsity mode. In the dense mode, the load module 1160 does not transmit either the activation sparsity tensor or the weight sparsity tensor to the processing engine 1170. In some embodiments, for the purpose of sparsity acceleration, an input tensor of a MatMul operation may be treated as an activation tensor, and the other input tensor of the MatMul operation may be treated as a weight tensor.

[0130] The processing engine 1170 performs operations in DNNs. The processing engine 1170 may accelerate neural network operations based on sparsity in data. In some embodiments, the processing engine 1170 may operate in a dense mode in which sparsity acceleration is not performed. The processing engine 1170 may include one or more processing cells. In some embodiments, the processing cells may be arranged in one or more rows and one or more columns in the processing engine 1170. Each processing cell may include PEs that may be arranged in an array that includes rows and columns. All the PEs in the processing engine 1170 may constitute a bigger array that includes more rows and columns.

[0131] An example PE may be or may include one or more multiply-accumulate (MAC) units that can perform MAC operations. In some embodiments (e.g., embodiments where the compute block 1130 executes a convolutional layer), a computation in an MAC unit may be an MAC operation on an activation operand and a weight operand. The activation operand may be an activation tensor that may include one or more activations in the input tensor of the convolution. Different activations may be in different ICs. The weight operand may be a weight tensor that may include one or more weights in the filter of the convolution. The values of the weights are determined through training the DNN. The weights in the weight operand may be in different ICs.

[0132] In some embodiments, an MAC unit includes one or more multipliers for performing multiplications. An MAC unit may also include one or more accumulators ("adders") for performing accumulations. A column of MAC units is referred to as an MAC column. An MAC column may be associated with one or more MAC lanes. A MAC lane is a path for loading data e.g., by the load module 1160, into an MAC column. A MAC lane may be also referred to as a data transmission lane or data loading lane. An MAC column may have multiple MAC lanes. The loading bandwidth of the MAC column is an aggregation of the loading bandwidths of all the MAC lanes associated with the MAC column. With a certain number of MAC lanes, data can be fed into the same number of independent MAC units simultaneously. In some embodiments where an MAC column has four MAC lanes for feeding activations or weights into the MAC column and each MAC lane may have a bandwidth of 16 bytes, the four MAC lanes can have a total loading bandwidth of 64 bytes.

[0133] In some embodiments, a sequence of multiplication produces a product operand that includes a sequence of products. The MAC operation may also include accumulations in which multiple product operands are accumulated to produce an output operand of the MAC unit. The processing engine 1170 may output multiple output operands at a time, each of which is generated by a different MAC unit. In a standard convolution, MAC operations may include accumulations across the channels. For instance, as opposed to generating an output operand, an MAC unit may accumulate products across different channels to generate a single output point.

[0134] In some embodiments, the processing engine 1170 may perform MAC operations in quantized neural network operations, such as MAC operations in a quantized MatMul operation. In some embodiments, an MAC unit in the processing engine 1170 may receivequantized data elements as inputs and compute a quantized MAC result. The quantized MAC result may be a quantized value in an integer format and may be the output of the MAC unit. In some embodiments, the MAC unit may also include a quantization multiplier that can multiply a quantization scale with the quantized MAC result, and the output of the MAC unit may be a real value in a floating-point format. The MAC unit may include no quantization subtractors as zero-point offsetting is not needed for the MAC operations in quantized neural network operations.

[0135] In some embodiments, the processing engine 1170 may include sparsity acceleration logic for facilitating sparsity acceleration. For instance, each processing cell in the processing engine 1170 may include one or more sparsity modules. In an example, each MAC column or each MAC row may have a corresponding sparsity module that accelerates MAC operations in the MAC column or MAC row. In some embodiments, a sparsity module accelerates computations in the processing engine 1170 based on sparsity in activations, sparsity in weights, or both. The sparsity module may include a storage unit that stores a sparsity tensor, which may be loaded to the storage unit by the load module 1160. The sparsity tensor may be an activation sparsity tensor, a weight sparsity tensor, or a combined sparsity tensor.

[0136] An activation sparsity tensor may be the sparsity tensor of an activation tensor and has the same number of elements as the activation tensor. An element in the activation sparsity tensor may indicate whether the corresponding element in the activation tensor is zero or not. For instance, a zero-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is zero. A one-valued in the activation sparsity tensor may indicate that the corresponding element in the activation tensor is nonzero. A weight sparsity tensor may be the sparsity tensor of a weight tensor and has the same number of elements as the weight tensor. An element in the weight sparsity tensor may indicate whether the corresponding element in the weight tensor is zero or not. For instance, a zero-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is zero. A one-valued in the weight sparsity tensor may indicate that the corresponding element in the weight tensor is nonzero. The sparsity module may generate a combined sparsity tensor using an activation sparsity tensor and a weight sparsity tensor. For instance, the sparsity module may multiply an element of the activation sparsity tensor with a corresponding element of the weight sparsity tensor tocompute an element of the combined sparsity tensor. The positions of the three elements in their corresponding sparsity tensors may match. In some embodiments, each element in a sparsity tensor may be a bit, and the sparsity tensor may be referred to as a sparsity bitmap.

[0137] The sparsity module may use the sparsity tensor to identify activations and weights to be used in MAC operations by the MAC units. In an embodiment where the processing engine 1170 operates in the combined sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a combined sparsity tensor. In an embodiment where the processing engine 1170 operates in the activation sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of an activation sparsity tensor. In an embodiment where the processing engine 1170 operates in the weight sparsity mode, the sparsity module may identify activations and weights that correspond to nonzero valued elements of a weight sparsity tensor. The sparsity module may be bypassed in the dense mode as no sparsity acceleration would be conducted.

[0138] The post-processing engine 1180 processes outputs of the processing engine 1170. The post-processing engine 1180 may include one or more post-processing elements (PPEs). In some embodiments, the PPEs in the post-processing engine 1180 may be arranged in an array that has rows and columns. In some embodiments, the post-processing engine 1180 computes activation functions. The post-processing engine 1180 may receive outputs of the processing engine 1170 as inputs to the activation functions. In addition or alternative to activation functions, the post-processing engine 1180 may perform other types of post processing on outputs of the processing engine 1170. For instance, the post-processing engine 1180 may apply a bias on an output of the processing engine 1170. In some embodiments, the post-processing engine 1180 may be bypassed for certain neural network operations. In some embodiments, the post-processing engine 1180 may include one or more adders. An adder may accumulate data elements computed by the processing engine 1170 to compute an output data element of a neural network operation.

[0139] The drain module 1190 drains data from the processing engine 1170 or from the post-processing engine 1180. The drain module may write the data to the local memory 1140. The drained data may be tensors, such as output tensors of neural network operations. In some embodiments, the drain module 1190 may drain data on a cell level. For each processing cell, the drain module 1190 may drain outputs of PEs in the processing cellbased on a row index or column index of each PE. For instance, the drain module 1190 may use a sequence of cycles to drain data from a processing cell. The drain module 1190 may drain the output of some of the PE s in each cycle. The sequence of the cycles may be configured based on a configuration parameter indicating the operation mode of the load module 1160.

[0140] In some embodiments, the drain module 1190 includes sparsity encoding logic that can convert outputs of the processing engine 1170 from a dense format to a sparse format. For instance, the drain module 1190 may be implemented with one or more sparsity encoders. A sparsity encoder converts dense data to compressed data based on sparsity in the dense data. For instance, the sparsity encoder may remove zeros in an activation tensor computed by the processing engine 1170 to convert the activation tensor to a compressed activation tensor. The sparsity encoder may also generate sparsity tensors, including activation sparsity tensors.

[0141] In some embodiments, the data drained from the processing engine 1170 may be at least part of an output tensor of a neural network operation. The sparsity encoder may generate a compressed version of the output tensor. The sparsity encoder may identify every zero-valued activation in the output tensor and remove these activations from the output tensor to generate a compressed activation tensor (aka "sparse activation tensor"). The sparsity encoder may also generate one or more sparsity tensors for the output tensor. A sparsity tensor may correspond to a portion of the output tensor. The sparsity tensor may include sparsity elements (e.g., bits), each of which corresponds to a different activation in the vector and indicates whether the corresponding activation is zeroed or not.

[0142] The drain module 1190 may write the compressed activation tensor and the one or more sparsity tensors into the local memory 1140. The sparse activation tensor and the one or more sparsity tensors may be further loaded to the memory 1110, e.g., through the DMA engine 1120. Additionally or alternatively, the sparse activation tensor and the one or more sparsity tensors may be loaded by the load module 1160 to the processing engine 1170 for further computation, e.g., for performing a neural network operation in the next layer. In some embodiments, the post-processing engine 1180 and drain module 1190 may be located on a drain path of the compute block 1130.

[0143] FIG. 12 is a block diagram of a DNN module 1200, in accordance with various embodiments. The DNN module 1200 may be an embodiment of the DNN module 1101 inFIG. 11. As shown in FIG. 12, the DNN module 1200 includes an interface module 1210, a training module 1220, a compressing module 1230, a validating module 1240, a compiler 1250, and a datastore 1260. In other embodiments, alternative configurations, different or additional components may be included in the DNN module 1200. Further, functionality attributed to a component of the DNN module 1200 may be accomplished by a different component included in the DNN module 1200 or a different module or system.

[0144] The interface module 1210 facilitates communications of the DNN module 1200 with other modules or systems. For example, the interface module 1210 establishes communications between the DNN module 1200 with an external database to receive data that can be used to train DNNs or input into DNNs to perform tasks. As another example, the interface module 1210 transmits configuration parameters to the NPU 1102 for configuring components of the NPU 1102 for DNN execution. As yet another example, the interface module 1210 supports the DNN module 1200 to distribute DNNs to other systems, e.g., computing devices configured to apply DNNs to perform tasks.

[0145] The training module 1220 trains DNNs by using a training dataset. The training module 1220 forms the training dataset. In an example where the training module 1220 trains a DNN to recognize objects in images, the training dataset includes training images and training labels. The training labels describe ground-truth classifications of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in a training image. In some embodiments, a part of the training dataset may be used to initially train the DNN, and the rest of the training dataset may be held back as a validation subset used by the validating module 1240 to validate performance of a trained DNN. The portion of the training dataset not including the tuning subset, and the validation subset may be used to train the DNN.

[0146] The training module 1220 also determines hyperparameters for training the DNN. Hyperparameters are variables specifying the DNN training process. Hyperparameters are different from parameters inside the DNN (e.g., weights of filters). In some embodiments, hyperparameters include variables determining the architecture of the DNN, such as number of hidden layers, etc. Hyperparameters also include variables which determine how the DNN is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the DNN. The batch size is the same as or smaller than the number of samples in the training dataset.The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN. An epoch may include one or more batches. The number of epochs may be 1, 12, 10, 120, 100, 1200, 1000, or even larger.

[0147] The training module 1220 defines the architecture of the DNN, e.g., based on some of the hyperparameters. The architecture of the DNN includes an input layer, an output layer, and a plurality of hidden layers. The input layer of an DNN may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image). The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, Softmax or logistic layers, and so on. The convolutional layers of the DNN abstract the input image to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels (e.g., red, green, blue images include 3 channels). A pooling layer is used to reduce the spatial volume of input image after convolution. It is used between two convolution layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different categories by training.

[0148] In the process of defining the architecture of the DNN, the training module 1220 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a ReLU activation function, a tangent activation function, or other types of activation functions.

[0149] After the training module 1220 defines the architecture of the DNN, the training module 1220 inputs a training dataset into the DNN. The training dataset includes a plurality of training samples. An example of a training sample includes an object in an image and a ground-truth label of the object. The training module 1220 modifies the parameters insidethe DNN ("internal parameters of the DNN") to minimize the error between labels of the training objects that are generated by the DNN and the ground-truth labels of the objects. The internal parameters include weights of filters in the convolutional layers of the DNN. In some embodiments, the training module 1220 uses a cost function to minimize the error.

[0150] The training module 1220 may train the DNN for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm would work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the DNN. After the training module 1220 finishes the predetermined number of epochs, the training module 1220 may stop updating the parameters in the DNN. The DNN having the updated parameters is referred to as a trained DNN.

[0151] The compressing module 1230 compresses DNNs. For instance, the compressing module 1230 may add pruning operations to DNN layers to reduce computational complexity or memory usage. A pruning operation may prune weight tensors of a DNN layer by changing one or more nonzero weights of the layer to zeros. The modification may be done before, during, or after training. Weights may be pruned during training, during inference, or a combination of both. The compressing module 1230 may determine a sparsity ratio for a DNN layer. The sparsity ratio may be a ratio of the number of zero-valued weight to the total number of weights in the layer. The compressing module 1230 may perform the pruning operation till the sparsity ratio of the DNN layer meets a target sparsity ration, such as 10%, 20%, 30%, 120%, 120%, and so on.

[0152] In some embodiments, the compressing module 1230 may select a structured sparsity pattern for a DNN layer and prunes weight of the DNN layer to reach the structured sparsity pattern. The structured sparsity pattern may be represented by a structured sparsity ratio N: M. In the pruning process, the compressing module 1230 may divide a kernel into weight blocks, each of which include M consecutive weights. For each of the weight blocks, the compressing module 1230 may select N element(s) and change the value of the unselected element(s) in the weight block to zero. The compressing module 1230 may generate sparsity maps that indicate weight sparsity. In some embodiments, the compressing module 1230 may generate a sparsity map for each weight block. The sparsity map may include M sparsity elements corresponding to the M weights in the weight block. Eechs sparsity element may indicate whether the corresponding weight is zero or not. Insome embodiments, the compressing module 1230 may compress sparsity maps to generate compressed maps. A compressed map has less elements than the sparsity map from which the compressed map is generated. The compressing module 1230 may write the sparsity maps or compressed maps into the memory 1110 or the local memory 1140.

[0153] In some embodiments, the compressing module 1230 may select one or more layers in a DNN and modify each selected layer with a pruning operation. For instance, the compressing module 1230 may select computationally complex layers, such as layers with large filters. For a pruning operation of a layer or of a type of layer, the compressing module 1230 may determine a weight threshold that would not cause a loss of the accuracy of the DNN to exceed an accuracy loss constraint. A pruning operation may modify weights having absolute values above the weight threshold to zeros and leave the other weights unchanged. The weight pruning can reduce memory storage as zero-valued weights may not be stored. Also, the number of operations in the layer can be reduced as computations on zero-valued weights can be skipped without impacting the output of the layer. In some embodiments, the compressing module 1230 may also measure energy saving, final DNN accuracy, or layer-wise sparsity caused by pruning operations.

[0154] After compressing a DNN, the compressing module 1230 may fine tune the DNN, e.g., through a retraining process. The compressing module 1230 may fine tunes DNNs after weights are pruned. In some embodiments, the fine-tuning process is a retraining or further training process. For instance, after weights in a DNN are pruned, the compressing module 1230 may further train the DNN by inputting a training dataset into the DNN. The values of the unpruned weights in the DNN may be modified based on outputs of the DNN and ground-truth labels of the training samples in the training dataset. In some embodiments, the values of the pruned weights (i.e., zero) are not changed during the fine-tuning process. For instance, the compressing module 1230 may place a mask over a pruned weight block and the mask can prevent values in the pruned weight blocks from being changed during the fine-tuning process. In other embodiments, the values of all weights, including the pruned weights, may be changed during the fine-tuning process. After one or more cycles of retraining and weight changing by the compressing module 1230, the compressing module 1230 may perform a new pruning process, e.g., by selecting weight blocks and pruning the selected weight blocks. In some embodiments, the weight pruning process may be repeated multiple times before the fine-tuning process is done. In some embodiments, the number ofepochs in the fine-tuning process may be different from the number of epochs in the training process in which the pre-pruning values of the weights are determined. For instance, the fine-tuning process may have less epochs than the training process. In an example, the number of epochs in the fine-tuning process may be relatively small, such as 2, 3, 12, 12, and so on.

[0155] The validating module 1240 verifies accuracy of trained or compressed DNNs. In some embodiments, the validating module 1240 inputs samples in a validation dataset into a trained DNN and uses the outputs of the DNN to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 1240 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 1240 may use the following metrics to determine the accuracy score: Precision = TP / (TP + FP) and Recall = TP / (TP + FN), where precision may be how many the DNN correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives), and recall may be how many the DNN correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives). The F-score (F-score = 2 * PR / (P + R)) unifies precision and recall into a single measure.

[0156] The validating module 1240 may compare the accuracy score with a threshold score. In an example where the validating module 1240 determines that the accuracy score of the DNN is less than the threshold score, the validating module 1240 instructs the training module 1220 to re-train the DNN. In one embodiment, the training module 1220 may iteratively re-train the DNN until the occurrence of a stopping condition, such as the accuracy measurement indication that the DNN may be sufficiently accurate, or a number of training rounds having taken place.

[0157] The compiler 1250 compiles information of DNNs to executable instructions that can be executed, e.g., by the NPU 1102, to carry out neural network operations in DNNs. In some embodiments, the compiler 1205 may generate a graph representing a DNN. The graph may include nodes and edges. A node may represent a specific neural network operation in the DNN. An edge may connect two nodes and represent a connection between the two corresponding neural network operations. In an example, an edge mayencode a tensor that flows from one of the neural network operations to the other neural network operation. The tensor may be an output tensor of the first neural network operation and an input tensor of the second neural network operation. The edge may encode one or more attributes of the tensor, such as size, shape, storage format, and so on. The compiler 1250 may use the graph to generate executable DNNs. For instance, the compiler may generate computer program instructions (e.g., compilation descriptors) for executing DNNs. The instructions may be stored in registers associated with components of the NPU 1102.

[0158] The compiler 1250 may generate configuration parameters that facilitates data read or data write, such as a configuration parameter that indicates the number of data elements to be processed (e.g., the number of data elements in a tile), configuration parameter that indicates the memory address where an input data element may be fetched, configuration parameter that indicates memory address where an output data element may be stored, configuration parameter that indicates memory address where another configuration parameters may be stored, and so on.

[0159] In some embodiments, the compiler 1250 generates configuration parameters for dynamic scaling of transformer networks based on input sequence lengths. The compiler 1250 may identify MatMul operations and Softmax operations in an MHA layer of a transformer network. The compiler 1250 may generate configuration parameters (e.g., workload descriptors) for the identified operations based on the maximum sequence length that the transformer network can support. The transformer network may have been generated or trained for processing sequences with lengths up to the maximum sequence length.

[0160] The compiler 1250 may partition a workload for executing a neural network operation with the maximum sequence length into a plurality of workloads. The maximum sequence length is known before runtime, e.g., before inference of the transformer network. The compiler 1250 may divide an input tensor of the neural network operation into subtensors. Each workload may be for processing a respective one of the subtensors. The input tensor may have a dimension (e.g., spatial dimension or depth dimension) that equals the maximum sequence length. The corresponding dimension of each subtensor may be referred to as a primitive sequence length. The maximum sequence length may be a multiple of the primitive sequence length. The compiler 1250 may determine the primitivesequence length based on one or more configurations of the NPU 1102, such as the number of PEs in the processing engine 1170, the number of PEs in a PE row, the number of PEs in a PE column, and so on. The compiler 1250 may determine a desirable primitive sequence length that can maximize the utility of PEs in the NPU 1102.

[0161] The compiler 1250 may generate workload descriptors that define the workloads. Workload descriptors may identify which subtensors are to be processed in the workloads. Subtensors may also be referred to as primitives, and a workload descriptor may be or include a primitive ID. The primitive ID identifies the corresponding subtensor in the input tensor. For instance, the primitive ID may indicate the position of the subtensor in the input tensor. In some embodiments, the compiler 1250 may generate multiple workload descriptors for a single workload. For example, the compiler 1250 may divide the neural network operation into a plurality of workloads by partitioning multiple input tensors (e.g., two input tensors) of the operation. A single workload may be for processing two subtensors from the two input tensors, respectively. The compiler 1250 may determine two primitive IDs for the two subtensors to be processed in the workload. As another example, the compiler 1250 may divide the neural network operation into a plurality of workloads by partitioning an input tensor on multiple dimensions (e.g., two dimensions). The compiler 1250 may determine two primitive IDs corresponding to the two dimensions, respectively. The two primitive IDs may identify a single subtensor in the input tensor.

[0162] The compiler 1250 may provide workload descriptors to the NPU 1102 for executing the transformer network. As described above, the compiler 1250 generates the workload descriptors based on the maximum sequence length. The actual lengths of sequences input into the transformer network for inference may not be the maximum sequence length. The NPU 1102 can perform dynamic scaling to handle the difference, as described above.

[0163] The datastore 1260 stores data received, generated, used, or otherwise associated with the DNN module 1200. For example, the datastore 1260 stores the datasets used by the training module 1220, compressing module 1230, and validating module 1240, such as the hyperparameters for training DNNs, internal parameters of trained DNNs (e.g., weights, etc.), data for sparsity acceleration (e.g., sparsity masks, etc.), and so on. The datastore 1260 may also store data used by the compiler 1250 for compiling DNNs and configuration parameters generated by the compiler 1250. The datastore 1260 may include one or more memories. In the embodiment of FIG. 12, the datastore 1260 is a component of the DNNmodule 1200. In other embodiments, the datastore 1260 may be external to the DNN module 1200 and communicate with the DNN module 1200 through a network.

[0164] FIG. 13 illustrates an example sparse cell 1300, in accordance with various embodiments. The sparse cell 1300 may be at least part of a processing engine, e.g., the processing engine 1170 in FIG. 11. The sparse cell 1300 includes 16 MAC units 1310 (individually referred to as " MAC unit 1310"), which constitutes an MAC array having four rows and four columns. The MAC array has a spatial shape of 13x4, meaning the height of the MAC array is four and the width of the MAC array is also 13. The sparse cell 1300 also includes 16 weight register files 1320 (individually referred to as "weight register file 1320"), 16 activation register files 1330 (individually referred to as "activation register file 1330"), four row buffers 1340 (individually referred to as "row buffer 1340"), and sparsity modules 1366 (individually referred to as "sparsity module 1366"). In other embodiments, the sparse cell 1300 may include fewer, more, or different components. For example, the sparse cell 1300 may include a different number of MAC units 1310, weight register files 1320, activation register files 1330, row buffers 1340, or sparsity modules 1366. As another example, the sparse cell 1300 may include column buffers in lieu of or in addition to the row buffers 1340. Also, the shape (e.g., the height or width) of the MAC array may be different.

[0165] The MAC units 1310 are configured to perform MAC operations. Each MAC unit 1310 may include one or more multipliers and one or more adders. A multiplier may multiply an activation with a weight at a time to compute a product. In some embodiments (e.g., embodiments where the MAC unit 1310 includes multiple multipliers), the multipliers may operate simultaneously to process multiple activation-weight pairs and compute multiple products in one cycle. An adder may accumulate products computed by the multipliers. Even though not shown in FIG. 13, the sparse cell may include an adder tree including a plurality of adder tiers. The first tier may receive outputs of a plurality of MAC units 1310. The number of adders in the first tier may be half of the number of the MAC units 1310, and each adder may accumulate the outputs of two MAC units 1310. The second tier may receive outputs of adders in the first tier. The number of adders in the second tier may be half of the number of adders in the first tier, and each adder in the second tier may accumulate the outputs of two adders in the first tier. The adder tree may include one or more other tiers. The last tier may include a single adder that accumulates outputs of adders in the second last tier to compute a partial sum of the sparse cell 1300.

[0166] The weight register files 1320 store weights to be processed in MAC operations. In the embodiments of FIG. 13, four weight register files 1320 are grouped into a storage set that stores data to be used by a column of MAC units 1310. There are four storage sets corresponding to the four columns of MAC units 1310. In some embodiments, a weight register file 1320 may correspond to an MAC unit 1310 and store data to be processed by the MAC unit. In some embodiments, the four weight register files 1320 for a single column of MAC units 1310 constitute a data storage unit of the column.

[0167] The activation register files 1330 store activations to be processed in MAC operations. In the embodiments of FIG. 13, four activation register files 1330 are grouped into a storage set that stores data to be used by a row of MAC units 1310. There are four storage sets corresponding to the four rows of MAC units 1310. In some embodiments, an activation register file 1330 may correspond to an MAC unit 1310 and store data to be processed by the MAC unit. In some embodiments, the four activation register files 1330 for a single row of MAC units 1310 constitute a data storage unit of the row. The row buffers 1340 store outputs of the MAC units 1310. Each row buffer 1340 may drain outputs of a single row of MAC units 1310.

[0168] The sparsity module 1366 facilitates dynamic sparsity-based acceleration or workload distribution in the sparse cell 1300. In the embodiments of FIG. 13, each sparsity module 1366 includes a sparsity tensor storage unit 1365 and a control logic 1367. The sparsity tensor storage unit 1365 stores combined sparsity tensors. A combined sparsity tensor stored in the sparsity tensor storage unit 1365 may correspond to an activation tensor and a weight tensor. A nonzero element in the combined sparsity tensor may correspond to a nonzero activation-weight pair that includes a nonzero activation and a nonzero weight. The position of the nonzero activation in the activation tensor may match the position of the nonzero weight in the weight tensor. The product of the nonzero activation and nonzero weight would be nonzero.

[0169] The control logic 1367 may control transmission of activations and weights stored from the weight register files 1320 and the activation register files 1330 to the MAC units 1310 based on sparsity tensors. For instance, the control logic 1367 may select a subset of the weights stored in the weight register files 1320 and select a subset of activations stored in the activation register files 1330 based on a sparsity tensor. The control logic 1367 may transmit the selected weights and activations to the MAC units 1310 for performing MACoperations. The other weights stored in the weight register files 1320 or the other activations stored in the activation register files 1330 are skipped from computation. In the embodiments of FIG. 13, each sparsity module 1366 controls sparsity acceleration or workload distribution in a respective MAC unit 1310. As the sparsity acceleration or workload distribution is either based on both weight sparsity and activation sparsity, 16 sparsity modules 1366 are used for acceleration computations in the 16 MAC units 1310.

[0170] As shown in FIG. 13, the sparse cell 1300 is associated with multiplexers (MUXs) 1303, 1304, 1305, and 1306. In other embodiments, the sparse cell 1300 may be associated with a different number of MUXs or other devices. The MUX 1303 facilitates loading weights, e.g., from the local memory 1140, into the weight register files 1320. The MUX 1304 facilitates loading activations, e.g., from the local memory 1140, into the activation register files 1330. The MUX 1305 facilitates loading sparsity tensors into the sparsity tensor storage unit 1365. The MUX 1306 may be a drain MUX that can facilitate draining outputs of the MAC units 1310, e.g., to the local memory 1140.

[0171] FIG. 14 illustrates a sparse cell array 1400, in accordance with various embodiments. The sparse cell array 1400 may be an example of the processing engine 1170 in FIG. 11. In FIG. 14, the sparse cell array 1400 includes sparse cells 1410 (individually referred to as "sparse cell 1410") arranged in four columns and four rows, an activation memory 1420, and a weight memory 1430. The sparse cell array 1400 may also be referred to as a data processing unit. In other embodiments, the sparse cell array 1400 may include fewer, more, or different components. For instance, the sparse cell array 1400 may include a different number of columns, rows, or sparse cells 1410.

[0172] Each sparse cell 1410 may perform sparsity accelerated MAC operations. The sparse cells 1410 may facilitate dynamic sparsity mode. For instance, the sparsity modes of a sparse cell 1410 may be dynamically changed between a combined sparsity mode, an activation sparsity mode, a weight sparsity mode, and a dense mode. An embodiment of a sparse cell 1410 may be the sparse cell 1300 in FIG. 13. The activation memory 1420 stores activations, such as activations in input tensors of neural network operations. Activations may be loaded from the activation memory 1420 to sparse cells 1410. The weight memory 1430 stores weights, such as weights in filters of neural network operations. Weights may be loaded from the weight memory 1430 to sparse cells 1410. The activation memory 1420 or weight memory 1430 may be a buffer. In other embodiments, the sparse cell array 1400 mayinclude a dense data memory and a sparse data memory in lieu of the activation memory 1420 and weight memory 1430. The dense data memory may store dense tensors, e.g., dense tensors generated by the load module 1760. The sparse data memory may store sparse tensors.

[0173] The sparse cell array 1400 may also execute matrix multiplications in attention layers of transformer models. In an example of a matrix multiplication operation on a query tensor and a key tensor, one of the activation memory 1420 and weight memory 1430 may be used to store the query tensor and the other one may be used to store the key tensor. In an example of a matrix multiplication operation on attention weights and a value tensor, one of the activation memory 1420 and weight memory 1430 may be used to store the attention weights and the other one may be used to store the value tensor.

[0174] FIG. 15 illustrates an example PE 1500, in accordance with various embodiments. The PE 1500 may be a unit component of a sparse cell, e.g., the sparse cell 1300 or the sparse cell 1410. In the embodiments of FIG. 15, the PE 1500 includes an MAC unit 1505, an activation register file 1510, a weight register file 1520, an output register file 1550, and a sparsity accelerator 1560. The MAC unit 1505 includes a multiplier 1530 and an adder 1540. In other embodiments, the PE 1500 may include fewer, more, or different components.

[0175] The activation register file 1510 stores an activation operand, which may be a context. The activation register file 1510 may be an example of the activation register files 1330 in FIG. 13. The weight register file 1520 stores a weight operand. The weight register file 1520 may be an example of the weight register files 1320 in FIG. 13. The activation operand and weight operand may be loaded from a memory (e.g., the memory 1140) into the activation register file 1510 and the weight register file 1520, respectively. The sparsity accelerator 1560 receives a sparsity bitmap 1515 that corresponds to the sparse tensor in the weight register file 1520. The sparsity bitmap 1515 may be a combined sparsity bitmap when the MAC unit 1505 operates in a combined sparsity mode. The sparsity bitmap 1515 may be an activation sparsity bitmap when the MAC unit 1505 operates in an activation sparsity mode. The sparsity bitmap 1515 may be a weight sparsity bitmap when the MAC unit 1505 operates in a weight sparsity mode. The sparsity bitmap 1515 may have the same size (e.g., the same number of elements) as or a larger size than the activation operand or the weight operand.

[0176] Using the sparsity bitmap 1515, the sparsity accelerator 1560 selects four activations from the activation register file 1510 and selects four weights from the weight register file 1520. The sparsity accelerator 1560 transmits the selected activations and weights to the multiplier 1530. These selected data elements correspond to the nonzero valued elements of the sparsity bitmap 1515. The four selected activations and the four selected weights may constitute four activation-weight pairs. The multiplier 1530 may compute a product based on each activation-weight pair and therefore, compute four products in total. The four products may be provided to the adder 1540. Even though FIG. 15 shows a single multiplier 1530, the MAC unit 1505 may include multiple multipliers that can perform multiple multiplication operations at the same time.

[0177] The adder 1540 accumulates the four products and computes a unit-level internal partial sum. The four unselected elements of the dense tensor are not processed to save power and time, which would not impact the value of the unit-level internal partial sum. For instance, when the dense tensor is a dense activation tensor, the weights corresponding to the unselected activations are zeros so the products of the unselected activations and the weights would all be zero and have no contribution to the unit-level internal partial sum or other partial sums computed by the sparse cell. Similarly, when the dense tensor is a dense weight tensor, the activations corresponding to the unselected weights are zeros so the products of the unselected weights and the activations would all be zero and have no contribution to the unit-level internal partial sum or other partial sums computed by the sparse cell. In other embodiments, the MAC unit 1505 may operate in a dense mode in which the sparsity bitmap 1515 is not used and the sparsity accelerator 1560 is inactive. The MAC unit 1505 may process all the activations in the activation operand and all the weights in the weight operand.

[0178] In some embodiments, the PE 1500 receives one or more PE-level internal partial sums from one or more other PEs. The adder 1540 or an accumulator (not shown in FIG. 15) can accumulate the one or more PE-level internal partial sums with the PE-level internal partial sum of the PE 1500 and store the result of the accumulation (i.e., a multi-PE internal partial sum) in the output register file 1550. The one or more other PEs may be in the same column as the PE 1500 in a sparse cell. The multi-unit internal partial sum may be a columnlevel internal partial sum. In some embodiments, the PE-level internal partial sum of the PE1500 or the multi-unit internal partial sum may be sent to one or more other PEs for further accumulation.

[0179] FIG. 16 illustrates a positional encoding operation in a positional encoding layer, in accordance with various embodiments. The positional encoding layer may be an example of the positional encoding layer 115 or the positional encoding layer 125 in FIG. 1. The positional encoding operation includes an addition of a vector embedding 1610 and a positional encoding vector 1620. The vector embedding 1610 may be generated by an embedding layer. The positional encoding vector 1620 may encode information of the position of the token represented by the vector embedding 1610 in a sequence of tokens. The positional encoding operation computes a vector embedding 1630, which represents the token with positional context. In some embodiments, the positional encoding operation may be an elementwise addition operation. A data element in the vector embedding 1630 may equal the sum of a data element in the vector embedding 1610 and a data element in the positional encoding vector 1620. In the embodiments of FIG. 16, the vector embedding 1610, positional encoding vector 1620, and vector embedding 1630 have the same dimension, i.e., they each have five data elements. In other embodiments, the vector embedding 1610, positional encoding vector 1620, or vector embedding 1630 may have a different dimension.

[0180] FIG. 17 illustrates an example linear classifier 1700, in accordance with various embodiments. The linear classifier 1700 may be used in transformer models. In some embodiments, the linear classifier 1700 may generate tokens based on outputs of decoders. The linear classifier 1700 may be an example of the head block 130 in FIG. 1. As shown in FIG. 17, the linear classifier 1700 includes a linear layer 1710 and a Softmax layer 1720. In other embodiments, the linear classifier 1700 may include fewer, more, or different components.

[0181] The linear layer 1710 is provided with a matrix 1701. The matrix 1701 may be an output of a decoder, e.g., the decoder block 120. A linear transformation may be performed on the matrix 1701 and a weight matrix in the linear layer 1710. The weight matrix may include weights, which are internal parameters of the linear layer 1710. The linear layer outputs a vector 1702. In some embodiments, the dimension of the vector 1702 (e.g., the total number of elements in the vector 1702) may be equal to the total number of classes associated with the Al task being performed by the transformer model. The vector 1702 isprovided to the Softmax layer 1720. The Softmax layer 1720 generates a vector 1703 from the vector 1702. In some embodiments, the dimension of the vector 1703 may equal the dimension of the vector 1702. Each element in the vector 1703 may correspond to a predicted token and may indicate a probability score of the predicted token. The probability score may indicate the probability that the prediction is correct. A predicted token 1704 having the highest probability score may be selected and output from the linear classifier 1700.

[0182] The output of the linear classifier 1700 may be the output of the transformer model. The execution of the linear classifier 1700 may be performed multiple times during inference of the transformer model. For instance, the transformer model may have multiple inference stages, and the linear classifier 1700 may be executed at least once in each inference stage. The dimensions of the vectors and matrices shown in FIGS. 2-5 are example dimensions used for purpose of illustration and simplicity. Any of the vectors and matrices used or computed by operations illustrated in FIGS. 2-5 may have different dimensions.

[0183] FIG. 18 illustrates a first inference stage of a transformer model 1800, in accordance with various embodiments. The transformer model 1800 includes an encoder 1810, a decoder 1820, and a head 1830. An example of the transformer model 1800 may be the transformer network 100 in FIG. 1. In the embodiments of FIG. 18, the encoder 1810 receives an input tensor 1801. The input tensor 1801 may be a feature map extracted from one or more images, text documents, audio files, videos, other types of data, or some combination thereof. In some embodiments, the input tensor 1801 may be generated by another neural network, e.g., a CNN. The encoder 1810 generates an output tensor 1802 from the input tensor 1801. The shape of the output tensor 1802 may be denoted as [batch size, SLencoder, dmodel], where SLencodermay be the dimension along the X axis (i.e., the width of the output tensor 1802), and dmodeimay be the dimension along the Y axis (i.e., the height of the output tensor 1802). The encoder 1810 may include a plurality of layers arranged in a sequence, such as the layers inside the encoder block 110 in IFG. 1. The output tensor 1802 is provided to the decoder 1820.

[0184] The decoder 1820 receives the output tensor 1802 and an input sequence 1803. The input sequence 1803 may be a sequence of tokens. A token may be a numerical representation of an input signal, such as word, image, audio signal, video signal, etc. The dimension of the input sequence 1803, which may be denoted as SLinput, may be the totalnumber of tokens in the input sequence 1803. For the purpose of illustration and simplicity, SLinputis 4. In other embodiments, the input sequence 1803 may have a different shape. For instance, the input sequence 1803 may be a 2D tensor. The dimension of the 2D tensor along the X axis may be SLinput, while the dimension of the 2D tensor along the Y axis may be a batch size indicating the number of batches in the input sequence 1803.

[0185] The decoder 1820 computes an output tensor 1804, a self-attention key tensor 1805, a self-attention value tensor 1806, a cross-attention key tensor 1807, and a cross-attention value tensor 1808. In some embodiments, the shape of the output tensor 1804 may be denoted as [batch size, SLinput, dmodei]. The shape of the self-attention key tensor 1805 or the shape of the self-attention value tensor 1806 may be denoted as N X[batch size, h, SLinput, dhead], where N is the number of identical layers in the decoder (e.g., the number of layers 150 in the decoder block 120), h is the total number of heads in an MHA layer, and dheadis the dimension of a query vector, key vector, or value vector. In some embodiments, dmodei= h x dhead. The shape of the self-attention key tensor 1805 or self-attention value tensor 1806 may be N X [batch size, SLinput, dmodei]. The shape of the cross-attention key tensor 1807 or the shape of the cross-attention value tensor 1808 may be denoted as N X [batch size, h, SLencoder, dhead] or N x[batch size, S Lencoder,, dmodei.

[0186] The output tensor 1804 may be provided to the head 1830 and the head 1830 outputs a predicted token 1809. The shape of the token 1809 may be denoted as [batch size, 1], For the purpose of illustration and simplicity, batch size is 1 in FIG. 18. In other embodiments, batch size may be a larger number. The predicted token 1809 may be stored in a buffer. In some embodiments, the predicted token 1809 may be used to update the input sequence 1803. For instance, the predicted token 1809 may be added to the right of the input sequence 1803. The updated input sequence may be used as the input sequence in the second inference stage. In the second inference stage, the decoder 1820 may receive the updated input sequence and the output tensor 1802 for predicting another token. The output tensor 1802 may remain the same during inference of the decoder 1820. Certain aspects of subsequent inference stages are described below in conjunction with FIG.19.

[0187] In some embodiments, the self-attention key tensor 1805 and the self-attention value tensor 1806 may be provided to a self-attention layer in the decoder 1820, an example of such a self-attention layer is the MHA layer 151. The self-attention key tensor 1805 may be stored in a self-attention key cache. The self-attention key cache may have the same shape as the self-attention key tensor 1805. The self-attention value tensor 1806 may be stored in a self-attention value cache. The self-attention value cache may have the same shape as the self-attention value tensor 1806.

[0188] In some embodiments, the decoder 1820 computes the self-attention key tensor 1805 and the self-attention value tensor 1806 from the input sequence 1803. The input sequence 1803 may be dynamic during inference of the decoder 1820. For instance, a new token may be added to the input sequence 1803 after each inference stage, as described above. As the input sequence 1803 changes, the self-attention key tensor 1805 and the selfattention value tensor 1806 would also change. For instance, the dimension of the selfattention key tensor 1805 or the self-attention value tensor 1806 along the X axis may increase as SLinputincreases. The self-attention key cache and the self-attention value cache may change during all the inference stages of the decoder 1820 to accommodate the changes in the self-attention key tensor 1805 and the self-attention value tensor 1806.

[0189] In some embodiments, the cross-attention key tensor 1807 and the cross-attention value tensor 1806 may be provided to a cross-attention layer in the decoder 1820, an example of such a cross-attention layer is the MHA layer 153. The cross-attention key tensor 1807 may be stored in a cross-attention key cache. The cross-attention key cache may have the same shape as the cross-attention key tensor 1807. The cross-attention value tensor 1808 may be stored in a cross-attention value cache. The cross-attention value cache may have the same shape as the cross-attention value tensor 1808. In some embodiments, the decoder 1820 computes the cross-attention key tensor 1807 and the cross-attention value tensor 1806 from the output tensor 1802 generated in the encoder 1810. As the output tensor 1802 does not change during inference of the decoder 1820, the cross-attention key tensor 1807 and the cross-attention value tensor 1806 may remain the same during all the inference stages of the decoder 1820. The cross-attention key cache and the cross-attention value cache may remain the same during all the inference stages of the decoder 1820.

[0190] FIG. 19 illustrates subsequent inference stages of the transformer model, in accordance with various embodiments. In the second inference stage, the decoder 1820may reuse the self-attention key tensor 1805, self-attention value tensor 1806, crossattention key tensor 1807, and cross-attention value tensor 1808. The decoder 1820 also receives the predicted token 1809. The decoder 1820 may compute self-attention key vectors from the predicted token 1809 and concatenate the self-attention key vectors with the self-attention key tensor 1805 to generate a new self-attention key tensor 1815. For instance, a self-attention key vector for each head may be added to the right of a selfattention key matrix in the self-attention key tensor 1805, and the self-attention key vector and the self-attention key matrix may correspond to the same head. The elements highlighted with a dot pattern in the self-attention key tensor 1815 are the self-attention key vectors generated from the predicted token 1809.

[0191] Similarly, the decoder 1820 may compute self-attention value vectors from the predicted token 1809 and concatenate the self-attention value vectors with the selfattention value tensor 1806 to generate a new self-attention value tensor 1816. For instance, a self-attention value vector for each head may be added to the right of a selfattention value matrix in the self-attention value tensor 1806, and the self-attention value vector and the self-attention value matrix may correspond to the same head. The elements highlighted with a dot pattern in the self-attention value tensor 1816 are the self-attention value vectors generated from the predicted token 1809.

[0192] The decoder 1820 also generates an output tensor 1814. The decoder 1820 may generate the output tensor 1814 using the new self-attention key tensor 1815 and new selfattention value tensor 1816. The output tensor 1814 is used by the head 1830 to generate another predicted token 1819. The predicted token 1819 is the output of the transformer model 1800 in the second inference stage.

[0193] One or more other subsequent inference stages may be conducted. In each subsequent inference stage, the decoder 1820 receives a token predicted in the previous inference stage, a self-attention key tensor generated in the previous inference stage, a selfattention value tensor generated in the previous inference stage, the cross-attention key tensor 1807, and the cross-attention value tensor 1808. The decoder 1820 may, in the subsequent inference stage, generate a larger self-attention key tensor and a larger selfattention value tensor, in addition to an output tensor which can be used by the head 1830 to predict a new token.

[0194] In embodiments where the total number of inference stages is N, the input sequence 1803 is updated to an input sequence 1813 after N — 1 inference stages. In the last inference stage (i.e., the Nth inference stage), the decoder 1820 may receive the predicted token generated in the (N — l)th inference stage, the self-attention key tensor generated in the (N — l)th inference stage, the self-attention value tensor generated in the (N — l)th inference stage, the cross-attention key tensor 1807, and the cross-attention value tensor 1808. The decoder 1820 may generate a self-attention key tensor 1825 and a self-attention value tensor 1826 using the predicted token generated in the (N — l)th inference stage, the self-attention key tensor generated in the (N — l)th inference stage, and the self-attention value tensor generated in the (N — l)th inference stage. The dimensions of the self-attention key tensor 1825 or self-attention value tensor 1826 along the X axis is SLinput+ N. The decoder 1820 also generates an output tensor 1824, which is used by the head 1830 to generate the last predicted token 1829. The N tokens predicted by the transformer model in the N inference stages may constitute an output tensor 1839, which may be the final output of the transformer model.

[0195] FIG. 20 is a flowchart of a method 2000 for executing a transformer network, in accordance with various embodiments. The method 2000 may be performed by the NPU 1102 in FIG. 11. Although the method 2000 is described with reference to the flowchart illustrated in FIG. 20, many other methods for executing a transformer network may alternatively be used. For example, the order of execution of the steps in FIG. 20 may be changed. As another example, some of the steps may be changed, eliminated, or combined.

[0196] The NPU 1102 receives 2010 one or more workload descriptors for a neural network operation in a transformer network. The one or more workload descriptors indicates a plurality of workloads. A workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens. In some embodiments, the fixed number is determined based on a configuration of the NPUs. In some embodiments, the configuration of the NPUs comprises a number or an arrangement of processing elements in the NPUs.

[0197] In some embodiments, the plurality of workloads are in a sequence, and a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence. In some embodiments, the one or more workload descriptors are generatedbased on a predetermined sequence length. The predetermined sequence length indicates a maximum number of tokens supported by the transformer network.

[0198] The NPU 1102 selects 2020 one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network. In some embodiments, the NPU 1102 reads the sequence length from a memory.

[0199] The NPU 1102 executes 2030 the one or more workloads by performing the neural network operation based on the tokens input into the transformer network. In some embodiments, the NPU 1102 bypasses one or more other workloads of the plurality of workloads. In some embodiments, the plurality of workloads are for executing an MHA layer in the transformer network, and the neural network operation comprises a matrix multiplication. In some embodiments, the NPU 1102 pads an input tensor of the neural network operation, and executes the neural network operation using the padded input tensor.

[0200] In some embodiments, the neural network operation is an operation on a tensor. The one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length. In some embodiments, the neural network operation is a multiplication of a first tensor and a second tensor. The one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor. In some embodiments, the neural network operation is a Softmax operation on a tensor. The one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

[0201] FIG. 21 is a block diagram of an example computing device 2100, in accordance with various embodiments. In some embodiments, the computing device 2100 can be used as at least part of the Al system 1100. A number of components are illustrated in FIG. 21 as included in the computing device 2100, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 2100 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computingdevice 2100 may not include one or more of the components illustrated in FIG. 21, but the computing device 2100 may include interface circuitry for coupling to the one or more components. For example, the computing device 2100 may not include a display device 2106, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 2106 may be coupled. In another set of examples, the computing device 2100 may not include an audio input device 2118 or an audio output device 2108 but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 2118 or audio output device 2108 may be coupled.

[0202] The computing device 2100 may include a processing device 2102 (e.g., one or more processing devices). The processing device 2102 processes electronic data from registers and / or memory to transform that electronic data into other electronic data that may be stored in registers and / or memory. The computing device 2100 may include a memory 2104, which may itself include one or more memory devices such as volatile memory (e.g., DRAM), nonvolatile memory (e.g., read-only memory (ROM)), high bandwidth memory (HBM), flash memory, solid state memory, and / or a hard drive. In some embodiments, the memory 2104 may include memory that shares a die with the processing device 2102. In some embodiments, the memory 2104 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for executing transformer networks (e.g., the method 2000 described in conjunction with FIG. 20) or some operations performed by one or more components of the Al system 1100 in FIG. 11, such as operations performed by the NPU 1102. The instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 2102.

[0203] In some embodiments, the computing device 2100 may include a communication chip 2112 (e.g., one or more communication chips). For example, the communication chip 2112 may be configured for managing wireless communications for the transfer of data to and from the computing device 2100. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.

[0204] The communication chip 2112 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family), IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment), Long-Term Evolution (LTE) project along with any amendments, updates, and / or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2"), etc.). IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 2112 may operate in accordance with a Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E-HSPA), or LTE network. The communication chip 2112 may operate in accordance with Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN). The communication chip 2112 may operate in accordance with Code-division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunications (DECT), Evolution-Data Optimized (EV-DO), and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 2112 may operate in accordance with other wireless protocols in other embodiments. The computing device 2100 may include an antenna 2122 to facilitate wireless communications and / or to receive other wireless communications (such as AM or FM radio transmissions).

[0205] In some embodiments, the communication chip 2112 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet). As noted above, the communication chip 2112 may include multiple communication chips. For instance, a first communication chip 2112 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 2112 may be dedicated to longer-range wireless communications such as global positioning system (GPS), EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication chip 2112 may be dedicated to wirelesscommunications, and a second communication chip 2112 may be dedicated to wired communications.

[0206] The computing device 2100 may include battery / power circuitry 2114. The battery / power circuitry 2114 may include one or more energy storage devices (e.g., batteries or capacitors) and / or circuitry for coupling components of the computing device 2100 to an energy source separate from the computing device 2100 (e.g., AC line power).

[0207] The computing device 2100 may include a display device 2106 (or corresponding interface circuitry, as discussed above). The display device 2106 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD), a light-emitting diode display, or a flat panel display, for example.

[0208] The computing device 2100 may include an audio output device 2108 (or corresponding interface circuitry, as discussed above). The audio output device 2108 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.

[0209] The computing device 2100 may include an audio input device 2118 (or corresponding interface circuitry, as discussed above). The audio input device 2118 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output).

[0210] The computing device 2100 may include a GPS device 2116 (or corresponding interface circuitry, as discussed above). The GPS device 2116 may be in communication with a satellite-based system and may receive a location of the computing device 2100, as known in the art.

[0211] The computing device 2100 may include another output device 2110 (or corresponding interface circuitry, as discussed above). Examples of the other output device 2110 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.

[0212] The computing device 2100 may include another input device 2120 (or corresponding interface circuitry, as discussed above). Examples of the other input device 2120 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, aQuick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.

[0213] The computing device 2100 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA), an ultramobile personal computer, etc.), a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 2100 may be any other electronic device that processes data.

[0214] The following paragraphs provide various examples of the embodiments disclosed herein.

[0215] Example 1 provides a computing system, including a compiler to generate one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, in which a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens; and an NPU to: receive the one or more workload descriptors from the compiler, select one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network, and execute the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network.

[0216] Example 2 provides the computing system of example 1, in which the neural network operation includes a matrix multiplication for an MHA layer in the transformer network.

[0217] Example 3 provides the computing system of example 1 or 2, in which the compiler is to generate the one or more workload descriptors based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

[0218] Example 4 provides the computing system of any one of examples 1-3, in which the compiler is further to determine the fixed number based on a configuration of the NPU, theconfiguration of the NPU including a number or an arrangement of processing elements in the NPUs.

[0219] Example 5 provides the computing system of any one of examples 1-4, in which the NPU is further to pad an input tensor of the neural network operation, wherein the neural processing unit is to executed the selected one or more workloads using the padded input tensor.

[0220] Example 6 provides the computing system of any one of examples 1-5, in which the plurality of workloads are in a sequence, in which a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence.

[0221] Example 7 provides the computing system of any one of examples 1-6, in which the NPU is further to bypass one or more other workloads of the plurality of workloads.

[0222] Example 8 provides the computing system of any one of examples 1-7, in which the neural network operation is an operation on a tensor, in which the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.

[0223] Example 9 provides the computing system of any one of examples 1-7, in which the neural network operation is a multiplication of a first tensor and a second tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor.

[0224] Example 10 provides the computing system of any one of examples 1-7, in which the neural network operation is a Softmax operation on a tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

[0225] Example 11 provides a method of executing a transformer network, the method including receiving one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, in which a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens; selecting one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into thetransformer network; and executing the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network.

[0226] Example 12 provides the method of example 11, in which the neural network operation includes a matrix multiplication for an MHA layer in the transformer network.

[0227] Example 13 provides the method of example 11 or 12, in which the one or more workload descriptors are generated based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

[0228] Example 14 provides the method of any one of examples 11-13, in which the fixed number is determined based on a configuration of an NPU.

[0229] Example 15 provides the method of any one of examples 11-14, further including padding an input tensor of the neural network operation, in which the selected one or more workloads are executed using the padded input tensor.

[0230] Example 16 provides the method of any one of examples 11-15, in which the plurality of workloads are in a sequence, in which a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence.

[0231] Example 17 provides the method of any one of examples 11-16, further including bypassing one or more other workloads of the plurality of workloads.

[0232] Example 18 provides the method of any one of examples 11-17, in which the neural network operation is an operation on a tensor, in which the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.

[0233] Example 19 provides the method of any one of examples 11-17, in which the neural network operation is a multiplication of a first tensor and a second tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor.

[0234] Example 20 provides the method of any one of examples 11-17, in which the neural network operation is a Softmax operation on a tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

[0235] Example 21 provides one or more non-transitory computer-readable media storing instructions executable to perform operations for executing a transformer network, the operations including receiving one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, in which a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens; selecting one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network; executing the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network; and bypassing one or more other workloads of the plurality of workloads.

[0236] Example 22 provides the one or more non-transitory computer-readable media of example 21, in which the neural network operation includes a matrix multiplication for an MHA layer in the transformer network.

[0237] Example 23 provides the one or more non-transitory computer-readable media of example 21 or 22, in which the one or more workload descriptors are generated based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

[0238] Example 24 provides the one or more non-transitory computer-readable media of any one of examples 21-23, in which the fixed number is determined based on a configuration of an NPU

[0239] Example 25 provides the one or more non-transitory computer-readable media of example 24, the configuration of the NPUs includes a number or an arrangement of processing elements in the NPUs.

[0240] Example 26 provides the one or more non-transitory computer-readable media of any one of examples 21-25, in which the plurality of workloads are in a sequence, in which a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence.

[0241] Example 27 provides the one or more non-transitory computer-readable media of any one of examples 21-26, in which the operations further include reading the sequence length from a memory.

[0242] Example 28 provides the one or more non-transitory computer-readable media of any one of examples 21-27, in which the neural network operation is an operation on a tensor, in which the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.

[0243] Example 29 provides the one or more non-transitory computer-readable media of any one of examples 21-27, in which the neural network operation is a multiplication of a first tensor and a second tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor.

[0244] Example 30 provides the one or more non-transitory computer-readable media of any one of examples 21-27, in which the neural network operation is a Softmax operation on a tensor, in which the one or more workload descriptors include a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

[0245] The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art can recognize. These modifications may be made to the disclosure in light of the above detailed description.

Claims

Claims1. A computing system, comprising:a compiler to generate one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, wherein a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens; anda neural processing unit to:receive the one or more workload descriptors from the compiler, select one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network, and execute the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network.

2. The computing system of claim 1, wherein the neural network operation comprises a matrix multiplication for a multi-head attention layer in the transformer network.

3. The computing system of claim 1 or 2, wherein the compiler is to generate the one or more workload descriptors based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

4. The computing system of any one of claims 1-3, wherein the compiler is further to determine the fixed number based on a configuration of the neural processing unit, the configuration comprising a number of processing elements or an arrangement of the processing elements in the neural network.

5. The computing system of any one of claims 1-4, wherein the neural processing unit is further to pad an input tensor of the neural network operation, wherein the neural processing unit is to executed the selected one or more workloads using the padded input tensor.

6. The computing system of any one of claims 1-5, wherein the plurality of workloads are in a sequence, wherein a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence.

7. The computing system of any one of claims 1-6, wherein the neural processing unit is further to bypass one or more other workloads of the plurality of workloads.

8. The computing system of any one of claims 1-7, wherein the neural network operation is an operation on a tensor, wherein the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.

9. The computing system of any one of claims 1-7, wherein the neural network operation is a multiplication of a first tensor and a second tensor, wherein the one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor.

10. The computing system of any one of claims 1-7, wherein the neural network operation is a Softmax operation on a tensor, wherein the one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

11. A method of executing a transformer network, the method comprising:receiving one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, wherein a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens;selecting one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network; andexecuting the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network.

12. The method of claim 11, wherein the plurality of workloads are for executing a multihead attention layer in the transformer network, wherein the neural network operation comprises a matrix multiplication.

13. The method of claim 11 or 12, wherein the one or more workload descriptors are generated based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

14. The method of any one of claims 11-13, wherein the fixed number is determined based on a configuration of a neural processing unit, the configuration comprising a number of processing elements or an arrangement of the processing elements in the neural network.

15. The method of any one of claims 11-14, further comprising:padding an input tensor of the neural network operation,wherein the selected one or more workloads are executed using the padded input tensor.

16. The method of any one of claims 11-15, wherein the plurality of workloads are in a sequence, wherein a workload descriptor of the one or more workload descriptors indicates a position of the workload in the sequence.

17. The method of any one of claims 11-16, further comprising:bypassing one or more other workloads of the plurality of workloads.

18. The method of any one of claims 11-17, wherein the neural network operation is an operation on a tensor, wherein the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.

19. The method of any one of claims 11-17, wherein the neural network operation is a multiplication of a first tensor and a second tensor, wherein the one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the first tensor and a second group of workload descriptors indicating a partition of the second tensor.

20. The method of any one of claims 11-17, wherein the neural network operation is a Softmax operation on a tensor, wherein the one or more workload descriptors comprise a first group of workload descriptors indicating a partition of the tensor in a dimension and a second group of workload descriptors indicating a partition of the tensor in another dimension.

21. One or more non-transitory computer-readable media storing instructions executable to perform operations for executing a transformer network, the operations comprising:receiving one or more workload descriptors for a neural network operation in a transformer network, the one or more workload descriptors indicating a plurality of workloads, wherein a workload of the plurality of workloads is for executing the neural network operation based on a fixed number of tokens;selecting one or more workloads from the plurality of workloads based on the one or more workload descriptors and a sequence length, the sequence length indicating a number of tokens input into the transformer network; andexecuting the selected one or more workloads by performing the neural network operation based on the tokens input into the transformer network.

22. The one or more non-transitory computer-readable media of claim 21, wherein the plurality of workloads are for executing a multi-head attention layer in the transformer network, wherein the neural network operation comprises a matrix multiplication.

23. The one or more non-transitory computer-readable media of claim 21 or 22, wherein the one or more workload descriptors are generated based on a predetermined sequence length, the predetermined sequence length indicating a maximum number of tokens supported by the transformer network.

24. The one or more non-transitory computer-readable media of any one of claims 21-23, wherein the fixed number is determined based on a number or an arrangement of processing elements in a neural processing unit.

25. The one or more non-transitory computer-readable media of any one of claims 21-24, wherein the neural network operation is an operation on a tensor, wherein the one or more workload descriptors indicate a partition of the tensor in a dimension corresponding to the sequence length.