Training and fine-tuning of neural networks on neural processing units

CN122264005APending Publication Date: 2026-06-23INTEL CORP

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: INTEL CORP
Filing Date: 2025-11-21
Publication Date: 2026-06-23

AI Technical Summary

Technical Problem

Existing neural network training and fine-tuning methods suffer from inefficiency and accuracy loss on neural processing units (NPUs), especially in edge devices and embedded systems, where it is difficult to directly apply methods used on general-purpose processors and tensor processing units (GPUs, TPUs).

Method used

By offloading forward and backward operations on the NPU, performing matrix multiplication operations using the MatMul kernel, and automatically calculating gradients using an automatic differentiation module, the training and fine-tuning processes can be performed efficiently.

Benefits of technology

It enables real-time, personalized model training and fine-tuning on edge devices, reducing latency, enhancing privacy and data security, improving energy efficiency, adapting to non-stationary data, and reducing bandwidth and cloud processing costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122264005A_ABST

Patent Text Reader

Abstract

This disclosure relates to training and fine-tuning of neural networks on a neural processing unit. A core on a neural processing unit can perform matrix multiplication (MatMul) on tensors of different dimensions. A neural network can be trained through forward and backward operations, both of which can be offloaded to the core. For the forward operation, the core can perform a layer by performing MatMul on an input tensor and a weight tensor and generate an output tensor. A loss can be computed. For the backward operation, the core can compute a weight gradient of the loss by performing MatMul on a gradient of the output tensor and the input tensor, and compute an input gradient of the loss by performing MatMul on the gradient of the output tensor and the weight tensor. The gradient of the output tensor can be computed by an automatic differentiation module. The weight tensor can be updated based on the input gradient and the weight gradient.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-referencing related applications This application claims the benefit of U.S. Provisional Patent Application 63 / 738,177, filed December 23, 2024, entitled “Training and Fine-tuning a Neural Network on a Neural Processing Unit,” which is incorporated herein by reference in its entirety. Technical Field

[0002] This disclosure generally relates to neural networks (also known as “deep neural networks” or “DNNs”), and more specifically, to training and fine-tuning DNNs on a neural processing unit (NPU). Background Technology

[0003] Because of their high accuracy, deep neural networks (DNNs) are widely used in a variety of artificial intelligence (AI) applications, ranging from computer vision to speech recognition and natural language processing. However, this high accuracy comes at the cost of significant computational complexity. DNNs need to be trained before they can be used for AI tasks. For some applications, pre-trained DNNs may require further fine-tuning. The computational demands of training or fine-tuning DNNs are extremely high, as it can involve numerous operations and large amounts of data reading and writing. Summary of the Invention

[0004] According to one aspect of this application, a method for training a neural network is provided, comprising: providing an input tensor and a weight tensor of a layer in the neural network to a neural processing unit to train the neural network through a training process including a forward operation and a backward operation; offloading the forward operation to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being used to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and generating an output tensor of the layer; offloading the backward operation to the MatMul kernel, the MatMul kernel being used to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensor based on the gradient of the loss.

[0005] According to another aspect of this application, one or more non-transitory computer-readable media are provided storing instructions executable to perform operations for training a neural network, the operations comprising: providing input tensors and weight tensors of layers in the neural network to a neural processing unit to train the neural network through a training process including forward and backward operations; offloading the forward operation to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being configured to perform the layer by performing a first MatMul operation on the input tensors and the weight tensors and generating an output tensor of the layer; offloading the backward operation to the MatMul kernel, the MatMul kernel being configured to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensors based on the gradient of the loss.

[0006] According to another aspect of this application, an apparatus is provided, comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable storage memory storing the computer program instructions executable by the computer processor to perform operations for training a neural network, the operations comprising: providing input tensors and weight tensors of layers in the neural network to a neural processing unit to train the neural network through a training process including forward operations and backward operations; offloading the forward operations to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being configured to perform the layer by performing a first MatMul operation on the input tensors and the weight tensors and generating an output tensor of the layer; offloading the backward operations to the MatMul kernel, the MatMul kernel being configured to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensors based on the gradient of the loss. Attached Figure Description

[0007] The various embodiments will be readily understood from the following detailed description taken in conjunction with the accompanying drawings. For ease of description, similar reference numerals denote similar structural elements. In the accompanying figures, embodiments are shown by way of example rather than limitation.

[0008] Figure 1 Block diagrams of AI systems according to various embodiments are shown.

[0009] Figure 2 Example convolutions according to various embodiments are shown.

[0010] Figure 3 MatMul operations according to various embodiments are illustrated.

[0011] Figure 4 The operation in the forward pass of the DNN training process is illustrated according to various embodiments.

[0012] Figure 5 The MatMul core is shown to offload forward pass to the NPU according to various embodiments.

[0013] Figure 6 The MatMul core, which offloads the reverse pass to the NPU according to various embodiments, is shown.

[0014] Figure 7 This is a flowchart of a method for training a DNN according to various embodiments.

[0015] Figure 8 Example transformer models according to various embodiments are shown.

[0016] Figure 9 Example CNNs according to various embodiments are shown.

[0017] Figure 10 This is a block diagram of a neural processing unit (NPU) according to various embodiments.

[0018] Figure 11 Example sparse units according to various embodiments are shown.

[0019] Figure 12 Example sparse cell arrays according to various embodiments are shown.

[0020] Figure 13 Example processing elements (PEs) according to various embodiments are shown.

[0021] Figure 14 This is a block diagram of an example computing device according to various embodiments. Detailed Implementation

[0022] Overview The past decade has witnessed the rapid rise of AI-based data processing technologies, particularly those based on Deep Neural Networks (DNNs). DNNs are widely used in computer vision, speech recognition, image and video processing primarily due to their ability to achieve accuracy surpassing human levels. DNNs typically consist of sequences of layers. A DNN layer can include one or more operations, such as matrix multiplication, convolution, interpolation, layer normalization, batch normalization, SoftMax operations, pooling, element-wise operations, linear operations, non-linear operations, and so on. These operations are known as deep learning operations or neural network operations.

[0023] Neural network operations can be tensor operations. The input or output data of a neural network operation can be arranged in a data structure called a tensor. Taking a convolutional layer as an example, the input tensor includes an activation tensor (also called an "input feature map (IFM)" or "input activation tensor") and a weight tensor. The activation tensor includes one or more activation values (also called "input elements"). The weight tensor can be a kernel (2D weight tensor), a filter (3D weight tensor), or a filter group (4D weight tensor). Convolution can be performed on the input activation tensor and the weight tensor to compute the output activation tensor in the convolutional layer.

[0024] A tensor is a data structure that has multiple elements along one or more dimensions. Examples of tensors include vectors (which are one-dimensional (1D) tensors), matrices (which are two-dimensional (2D) tensors), three-dimensional (3D) tensors, four-dimensional (4D) tensors, and tensors with higher dimensions. The dimensions of a tensor can correspond to axes (e.g., axes in a coordinate system). Dimensions can be measured by the number of data points along an axis. The dimensions of a tensor can define the shape of the tensor. A DNN layer can receive one or more input tensors and compute an output tensor based on those input tensors. In some embodiments, a 3D tensor can have X, Y, and Z dimensions. The X dimension of a tensor can be a horizontal dimension, and its length can be the width of the tensor; the Y dimension can be a vertical dimension, and its length can be the height of the tensor; the Z dimension can be a channel dimension, and its length can be the number of channels. The coordinates of elements along a dimension can be integers from 0 to (L-1), where L is the length of the tensor in that dimension. For example, the x-coordinate of the first element in a row can be 0, the x-coordinate of the second element in a row can be 1, and so on. Similarly, the y-coordinate of the first element in a column can be 0, the y-coordinate of the second element can be 1, and so on. A 4D tensor can have a fourth dimension, which can indicate the number of batches in the operation.

[0025] In recent years, the rapid development of AI and deep learning has highlighted the need for more efficient and high-performance hardware accelerators designed specifically for DNN workloads. General-purpose processors, such as central processing units (CPUs) and graphics processing units (GPUs), have proven insufficient for certain deep learning applications, especially when faced with resource constraints in mobile, embedded, and edge environments. This inadequacy can be overcome by using NPUs, which are typically designed to efficiently handle computationally intensive tasks in DNN training and inference. While NPUs offer significant advantages in energy efficiency and processing speed, training and fine-tuning DNNs on these architectures presents unique challenges.

[0026] Training or fine-tuning a DNN typically involves using a dataset to teach it to make accurate predictions. The training or fine-tuning process usually involves iteratively updating the DNN's internal parameters (e.g., weights) to minimize a loss function that measures the difference between the DNN's predictions and reference values (e.g., ground-truth values). Training and fine-tuning DNNs on specialized hardware like NPUs can involve a unique set of technical constraints and requirements. Due to their fixed-function hardware design and limited support for the floating-point precision typically required for gradient-based optimization, NPUs are generally optimized for inference rather than training. Therefore, training on these devices often requires workarounds or tweaks to optimize data flow, minimize memory usage, and avoid precision losses that could degrade model performance.

[0027] Currently available training methods primarily rely on GPUs or Tensor Processing Units (TPUs), which have established a wide range of techniques and tools. However, due to fundamental architectural differences, they generally cannot be directly transferred to NPUs. Many NPUs are built around optimized tensor operations and fixed-function pipelines, which differ significantly from the flexible, programmable pipelines of GPUs and TPUs. Furthermore, state-of-the-art DNN models feature increasingly complex architectures, including recurrent networks, convolutional networks, and transformer-based networks, requiring substantial computational power and large amounts of data movement across memory hierarchies. Each layer of these models (especially where fine-tuning of layers can be frozen or adjusted based on previous training) requires precise handling of weights, biases, and gradients, posing a challenge to NPUs.

[0028] However, enabling training on an NPU offers significant benefits, allowing for greater flexibility, efficiency, and responsiveness in machine learning applications deployed on edge devices, embedded systems, and mobile platforms. Typically, DNNs are trained on high-performance GPUs or TPUs in centralized data centers and then deployed for inference on dedicated hardware such as NPUs. While effective for many applications, this approach has clear limitations for scenarios requiring low-latency processing directly on the device, continuous learning, and rapid adaptation. Enabling training on an NPU addresses several key technical and practical needs.

[0029] For example, edge-adaptive and personalized models are needed. Training directly on the NPU allows models to adapt to constantly changing environments or user-specific data at the edge. For instance, models in wearable health devices can be fine-tuned to suit individual unique patterns, or smart home devices can learn user preferences and continuously improve their models without relying on cloud-based update cycles. Real-time learning and reduced latency are also required. Edge devices typically operate in real-time scenarios where latency is critical, such as in autonomous driving or industrial automation. By allowing the NPU to train or fine-tune models in the field, systems can adapt to changing conditions without the latency associated with sending data to remote servers, waiting for updates, and then redeploying the model. Enhanced privacy and data security are also needed. Training on the NPU mitigates privacy and security concerns by keeping data on the device rather than transmitting it to a centralized server. This is particularly important for applications involving sensitive data, such as healthcare, where keeping data on the device helps meet regulatory requirements and reassures users about data privacy. Furthermore, bandwidth efficiency and cost savings are required. Continuously sending data to the cloud for retraining can consume significant bandwidth, especially in Internet of Things (IoT) environments where numerous devices generate large amounts of data. Localized training on NPUs reduces reliance on network infrastructure and saves bandwidth and associated cloud processing costs, making large-scale IoT deployments more scalable. Effective adaptation to non-stationary data is also necessary. Many real-world applications encounter non-stationary data, where the data distribution changes over time. This typically requires models that can dynamically adapt rather than relying on static, pre-trained networks. Training on NPUs enables real-time adaptation to these distribution changes, improving the model's robustness and accuracy under unpredictable conditions. Energy efficiency is also crucial. NPUs are highly energy efficient compared to general-purpose processors, especially in matrix and tensor operations common in neural networks. Training on NPUs optimized for low-power processing allows for energy-efficient model updates, making it feasible to run and train deep learning models even in resource-constrained environments.

[0030] Some embodiments of this disclosure can improve at least some of the challenges and problems mentioned above by providing methods for effectively training and fine-tuning DNNs on an NPU. For example, the forward and backward passes during the training or fine-tuning process can be directly offloaded to the NPU, and an automatic differentiation module can be seamlessly integrated with the training stream to automatically compute gradients.

[0031] In various embodiments of this disclosure, the kernel on the NPU can be designed to perform MatMul operations on tensors of various dimensions. This kernel can also be referred to as a MatMul kernel. The process of training or fine-tuning a DNN can be a process of updating the weights in the DNN to improve the accuracy of the DNN. For example, the weights are updated to minimize the difference between the DNN's predictions and reference data (e.g., true values). The fine-tuning process can be a process of retraining a previously trained model. The following description of DNN training also applies to fine-tuning. The training process can include forward and backward propagation through the layers of the DNN. Forward propagation is also referred to as forward pass-through because data passes through the layers of the DNN in the order of their arrangement, e.g., from the input layer to the hidden layer and then to the output layer. Backward propagation is also referred to as backpropagation because data is passed backward through the layers of the DNN. Operations in forward propagation (“forward operations”) and operations in backward propagation (“backward operations”) can be converted into MatMul operations. Forward and backward operations can be unloaded onto the kernel. For forward operations, the input tensors and weight tensors of the layers can be provided to the kernel. A kernel executes a layer by performing a first MatMul operation on the input and weight tensors, producing the layer's output tensor. The loss is computed by applying a loss function to the output tensor and one or more reference values. For the backward operation corresponding to the forward operation, the kernel computes the weight gradient of the loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor, and the input gradient of the loss by performing a third MatMul operation on the gradient of the output tensor and the weight tensor. The gradient of the output tensor can be computed using an automatic differentiation module running on an NPU or CPU. The input gradient can be backpropagated to previous layers of the neural network. The input tensor can be the output tensor of the previous layer. The weight tensor can be updated based on the input and weight gradients to minimize the loss. The kernel can perform a series of forward and backward operations until the accuracy of the DNN reaches an ideal level.

[0032] Using the methods of this disclosure, on-device training and fine-tuning can be performed directly on the NPU, enabling real-time model adaptation and personalized AI solutions. This approach provides the possibility of continuous and autonomous learning at the edge device and empowers AI systems to become smarter, more personalized, and more responsive. It reduces the need for additional infrastructure, minimizes latency, and enhances data privacy, making it ideal for applications in dynamic, data-sensitive environments. These advantages are likely to be particularly impactful as AI expands into areas requiring real-time adaptation, privacy, and cost-effective scalability, such as healthcare, smart cities, autonomous driving systems, and IoT networks.

[0033] For illustrative purposes, specific figures, materials, and configurations have been set forth to provide a thorough understanding of the illustrative implementation. However, it will be apparent to those skilled in the art that this disclosure may be practiced without specific details, and / or may be practiced using only some of the aspects described. In other instances, well-known features have been omitted or simplified so as not to obscure the illustrative embodiments.

[0034] Furthermore, reference has been made to the accompanying drawings, which form part of this disclosure, and practical embodiments are illustrated in the drawings by way of illustration. It should be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of this disclosure. Therefore, the following detailed description should not be construed as limiting.

[0035] Various operations can be described sequentially as a plurality of discrete actions or operations in a manner most conducive to understanding the claimed subject matter. However, the order of description should not be construed as implying that these operations must depend on the order. In particular, these operations may not be performed in the order presented. The described operations may be performed in a different order than in the described embodiments. Various additional operations may be performed, or the described operations may be omitted in additional embodiments.

[0036] For the purposes of this disclosure, the phrase "A or B" or the phrase "A and / or B" refers to (A), (B), or (A and B). For the purposes of this disclosure, the phrase "A, B, or C" or the phrase "A, B, and / or C" refers to (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C). When used to refer to a measurement range, the term "between" includes the endpoints of the measurement range.

[0037] This description uses the phrases "in one embodiment" or "in an embodiment," both of which can refer to one or more of the same or different embodiments. Terms such as "comprising," "including," "having," etc., used with respect to embodiments of this disclosure are synonyms. This disclosure may use perspective-based descriptions such as "above," "below," "top," "bottom," and "side" to interpret various features of the drawings; however, these terms are merely for ease of discussion and do not imply any desired or required direction. The drawings are not necessarily drawn to scale. Unless otherwise stated, the use of ordinal adjectives such as "first," "second," and "third" to describe common objects indicates only different instances of the similar objects referred to and is not intended to imply that the objects described must be arranged in a given order, whether temporally, spatially, in rank, or otherwise.

[0038] In the following detailed description, terms commonly used by those skilled in the art are used to describe various aspects of the illustrative implementations in order to convey the substance of their work to others skilled in the art.

[0039] The terms “substantially,” “close to,” “approximately,” “near,” and “about” generally refer to values within + / - 20% of the target value as described herein or known in the art. Similarly, terms indicating the orientation of various elements, such as “coplanar,” “perpendicular,” “orthogonal,” “parallel,” or any other angle between elements, generally refer to values within + / - 5-20% of the target value as described herein or known in the art.

[0040] Furthermore, the terms “comprising,” “including,” “having,” or any other variations thereof are intended to cover non-exclusive inclusion. For example, a method, process, apparatus, or DNN accelerator that includes a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such a method, process, apparatus, or DNN accelerator. Additionally, the term “or” refers to inclusive “or” rather than exclusive “or.”

[0041] The systems, methods, and apparatuses disclosed herein are innovative in several ways, but none of them alone is responsible for all the desired properties disclosed herein. Details of one or more implementations of the subject matter described herein are set forth in the following description and figures.

[0042] Figure 1 This is a block diagram of an AI system 100 according to various embodiments. The AI system 100 includes a DNN module 110, a CPU 120A, and an NPU 120B. In other embodiments, the AI system 100 may include alternative configurations, different, or additional components. For example, the AI system 100 may include multiple CPUs or NPUs. Furthermore, the AI system 100 may include other types of processing units, such as GPUs. Additionally, the functionality implemented by the components of the AI system 100 may be implemented by other components included in the AI system 100 or by other systems. For example, the functionality implemented by the DNN module 110 may be implemented by a module or system on the CPU 120A or NPU 120B.

[0043] DNN module 110 facilitates the generation and deployment of DNNs. In some embodiments, DNN module 110 can train and fine-tune the DNN. DNN module 110 can offload operations during the DNN training and fine-tuning process to NPU 120B. DNN module 110 can also deploy the trained or fine-tuned DNN for AI applications (e.g., language processing, image classification, motion planning, etc.). In some embodiments, DNN module 110 can facilitate the deployment of DNNs using NPU 120B. For example, DNN module 110 can offload DNN inference operations to NPU 120B. DNN inference can be the process of executing a trained or fine-tuned DNN to perform an AI task. In other embodiments, DNN module 110 can distribute the trained or fine-tuned DNN to devices or systems that can use the DNN to perform a task for which the DNN has been trained.

[0044] like Figure 1 As shown, the DNN module 110 includes an interface module 130, a training module 140, an automatic differentiation module 150, a compression module 160, a compiler 170, and a data storage device 180. In other embodiments, the DNN module 110 may include alternative configurations, different, or additional components. Furthermore, the functionality implemented by the components of the DNN module 110 may be implemented by other components included in the DNN module 110 or by other modules or systems. In some embodiments, the DNN module 110 may be executed on a computer system including the AI system 100. The DNN module 110 may run on an operating system of the computer system. The DNN module 110 may use a processing unit in the computer system, such as a CPU 120A or another CPU.

[0045] Interface module 130 facilitates communication between DNN module 110 and other modules or systems. In some embodiments, interface module 130 can establish communication between DNN module 110 and an external database to receive datasets that can be used to train or fine-tune the DNN. Interface module 130 can also receive datasets to be processed by the trained or fine-tuned DNN to perform AI tasks. In some embodiments, interface module 130 can receive requests to train, fine-tune, or deploy the DNN. These requests can be received from applications running on the device where DNN module 110 resides. For example, DNN module 110 can be executed on a computing device, and the request can be received from applications running on the operating system of that computing device (e.g., word processing applications, image processing applications, browser applications, etc.). Interface module 130 can forward requests or datasets for training or fine-tuning the DNN to training module 140. Interface module 130 can forward requests or datasets for deploying the DNN to deployment module 160. In some embodiments, interface module 130 can distribute the trained or fine-tuned DNN to other systems, such as computing devices configured to apply the DNN to perform AI tasks.

[0046] Training module 140 trains and fine-tunes the DNN. In various embodiments, the fine-tuning process is considered a training process. For example, the fine-tuning process can be a retraining or further training process. Training module 140 can use a training dataset to train the DNN. Training module 140 can generate a training dataset. The training dataset can include training samples and reference values. Training samples can be the input to the DNN. Reference values can represent the correct predictions made by the DNN based on the training samples. Reference values can be true values or validated values. In an example where training module 140 trains the DNN to recognize objects in images, training module 140 can generate a training dataset that includes training images and training labels. Training labels describe the true classification of objects in the training images. In some embodiments, each label in the training dataset corresponds to an object in the training image. In some embodiments, a portion of the training dataset can be used to initially train the DNN, and the remainder of the training dataset can be reserved as a validation subset, used by training module 140 to validate the performance of the trained DNN. A portion of the training dataset that does not include the validation subset can be used to train the DNN.

[0047] Training module 140 can determine the hyperparameters used to train the DNN. Hyperparameters are variables that specify the DNN training process. Hyperparameters are different from the parameters inside the DNN (e.g., filter weights). In some embodiments, hyperparameters include variables that determine the DNN architecture, such as the number of hidden layers. Hyperparameters also include variables that determine how the DNN is trained, such as batch size, number of epochs, etc. Batch size defines the number of training samples used for a single update of the DNN's internal parameters. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of batches can define the number of times the DNN's internal parameters are updated within a single epoch. The number of epochs can define the number of forward and backward passes through the entire network of the entire training dataset. An epoch means that each training sample in the training dataset has had the opportunity to update the DNN's internal parameters. The number of epochs can be 1, 5, 10, 50, 100, 500, 1000, or greater. An epoch can include one or more batches. Training module 140 can train the DNN for a predetermined number of epochs. After the training module 140 completes the predetermined number of rounds, the training module 140 can stop updating the parameters in the DNN. The DNN with the updated parameters is called the trained DNN.

[0048] In some embodiments, the training module 140 can define the architecture of the DNN, for example, based on some hyperparameters. The architecture of the DNN includes an input layer, an output layer, and multiple hidden layers. The input layer of the DNN can include tensors (e.g., multidimensional arrays) specifying attributes of the input image, such as the height, width, and depth of the input image (e.g., specifying the number of bits of color in the pixels of the input image). The output layer includes labels for objects in the input layer. Hidden layers are layers between the input and output layers. Hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully connected layers, normalization layers, SoftMax, or logistic layers. The convolutional layers of the DNN abstract the input image into a feature map, which is represented by a tensor specifying the feature map height, feature map width, and feature map channels (e.g., red, green, and blue images include 3 channels). Pooling layers are used to reduce the spatial volume of the input image after convolution. They are used between two convolutional layers. Fully connected layers involve weights, biases, and neurons. They connect neurons in one layer to neurons in another layer. It is used to classify images of different categories through training. During the definition of the DNN architecture, the training module 140 also adds activation functions to the hidden or output layers. The activation function of a layer transforms the weighted sum of the layer's inputs into the layer's output. Activation functions can be, for example, ReLU activation functions, tangent activation functions, or other types of activation functions.

[0049] To train the DNN, training module 140 feeds training samples into the DNN. Training module 140 modifies the parameters internal to the DNN (“internal parameters of the DNN”) to minimize the error between the DNN’s predictions and the target value. The target value can be used as a reference value to measure the loss during training. The target value can be a real value (e.g., a value indicating the true value) or a value verified to be accurate or true. The internal parameters can be learnable parameters whose values can be optimized by training the DNN. Internal parameters include weights, such as weights in a convolutional filter, weights in an MHA layer, etc.

[0050] In some embodiments, the training module 140 may define stages in the training process. For example, for each training sample or each epoch, the training module 140 defines forward pass, backward pass, and optimization processes. During the forward pass, data is passed forward through the DNN layers. For example, data (e.g., activation values) is passed from the input layer to the hidden layer, and then to the output layer. The output of the DNN (indicating the DNN's prediction) can be generated at the last layer (which may be the DNN's output layer). This part of the forward pass may be the inference process, where the DNN is executed to process the training samples and make predictions. The inference process can be represented as follows: ,in It is the output of the DNN. It's network architecture. These are intrinsic parameters (e.g., weights).

[0051] Training module 140 can train the DNN using gradient descent. After generating the DNN output, the loss can be calculated. Training module 140 can define a loss function that measures the loss during forward propagation. The loss measures the difference between the DNN output and the actual value. It provides an error metric that the optimization algorithm can use to update its internal parameters during optimization. In some embodiments, a loss function 420 can be selected from various types of loss functions, such as mean squared error (MSE), cross-entropy loss, mean absolute error (MAE), Huber loss, Hinger loss, cosine similarity, Poisson loss, etc. The calculation of the loss function can be expressed as follows: Where L is the loss, It is (one or more) reference values, and N is the number of training samples in the batch.

[0052] During backpropagation, data propagates backward, and the DNN runs backward. The data can be gradients calculated using the loss. A gradient can be the partial derivative of a function (e.g., a loss function) with respect to its inputs; this partial derivative can be the slope of the function. Gradients calculated during backpropagation can measure the change in weights relative to changes in error or loss. Gradients calculated during backpropagation can include the output gradient, the input gradient, and the weight gradients. The output gradient of a layer can be the gradient relative to the layer's output and can be represented as... The input gradient of a layer can be a gradient relative to the layer input, and can be represented as... The weight gradient can be the gradient of each parameter with respect to the layer output, and can be represented as... ,in This is the layer index. Training module 140 can define a MatMul operation to compute the weight gradient and another MatMul operation to compute the input gradient. The input gradient can be defined as... ,in It is a layer input. These are layer parameters. This is the layer output. The weight gradient can be defined as... In some embodiments, the layer executed in the forward pass can be represented as Therefore, the function that inputs the gradient can be transformed into ,in , , The function of the weight gradient can be transformed into ,in , ， . It can be the input tensor of the layer (e.g., the activation tensor). It can be a layer weight tensor. In some embodiments, It can have with Tensors with the same spatial shape. Input gradients can be propagated to previous layers. Weight gradients can be used to update parameters through the optimization process.

[0053] During optimization, an optimization function can be used to update the internal parameters. Training module 140 can define optimization functions. Example optimization functions could be: ,in It's the learning rate. N is the index of the current batch, and N+1 is the index of the next batch.

[0054] In some embodiments, the training module 140 can offload MatMul operations from the forward and backward propagation to a MatMul kernel on the NPU 120B. This MatMul kernel can perform MatMul operations on tensors of various spatial shapes and dimensions. Thus, the MatMul kernel can perform MatMul operations in the forward propagation (e.g., MatMul operations within layers) and MatMul operations in the backward propagation (e.g., MatMul operations for computing input and weight gradients). The computation of the loss function can be performed by the same MatMul kernel or another kernel on the NPU 120B.

[0055] As shown above, the input to the MatMul operation in the backpropagation includes the output gradient. The training module 140 can deploy an automatic differentiation module 150 to compute the output gradient during the backpropagation. The training module 140 can leverage the capabilities of the automatic differentiation module 150 to integrate automatic differentiation and seamless gradient tracking into the training flow, thereby reducing the need for manual configuration of backpropagation computation. The training module 140 can instruct the compiler 170 to integrate the automatic differentiation module 150 into executable instructions (e.g., code) for performing the training process. In some embodiments, it can automatically run the functions in the automatic differentiation module 150 when the NPU 120B executes the executable instructions. In other embodiments, the automatic differentiation module 150 can use the CPU 120A instead of the NPU 120B. The automatic differentiation module 150 provides automatic differentiation capabilities, allowing it to seamlessly offload computationally intensive forward and backward propagations to the NPU while leaving the remaining control flow to the system CPU. The integration of the Automatic Differentiation Module 150 enables end-to-end gradient tracking and updates on the NPU without requiring users to manually configure backward computation for each layer, making it easier and more efficient for real-time training applications.

[0056] Automatic differentiation module 150 can automatically compute the derivatives of tensor operations. In some embodiments, automatic differentiation module 150 can track tensor operations during training, such as one or more MatMul operations during forward passes. For example, automatic differentiation module 150 can construct a dynamic computation graph tracking one or more MatMul operations. Automatic differentiation module 150 can also record the inputs and outputs of one or more MatMul operations. Automatic differentiation module 150 can compute the gradient of the output relative to all tensors that require gradients using the chain rule. An example of automatic differentiation module 150 is PyTorchAutograd. The functionality of automatic differentiation module 150 allows training loops to compute gradients and update weights without recompiling the DNN. Training module 140 can seamlessly offload computationally intensive operations (e.g., MatMul operations) to NPU 120B while leaving the remaining control flow to CPU 120A. By integrating the automatic differentiation module 150, the NPU 120B can perform end-to-end gradient tracking and updates without requiring manual configuration of backpropagation for each layer, making it more user-friendly and efficient for real-time training applications. This approach preserves the speed and efficiency of the NPU 120B because forward and backward propagation can be performed locally on the NPU 120B, and weights remain accessible and modifiable in the NPU 120B's memory. This approach is advantageous compared to currently available frameworks that are typically designed for inference and require specific adaptations to effectively support training on NPUs, especially for layers with specific backpropagation and runtime requirements (e.g., dropout and layer normalization), and for nonlinear operations that can introduce complexity when computing gradients (e.g., max pooling and ReLU), which typically require control flow operations.

[0057] In some embodiments, training module 140 facilitates mixed-precision training on the NPU. For example, BF16 (bfloat16) and FP16 (half-precision floating-point) formats can be used to significantly improve computational efficiency and reduce memory bandwidth requirements. BF16 and FP16 can be ideal for training DNNs because they offer a balance between accuracy and performance. Using these formats allows for faster matrix multiplication and gradient computation while reducing memory footprint without a significant loss of accuracy. The NPU hardware may include dedicated support for BF16 and FP16 operations, enabling high-speed tensor computations directly in these formats. For example, the NPU may include one or more memories capable of storing floating-point data. Furthermore, the NPU may include multipliers, adders, data paths, or other components that support floating-point data formats. Additionally, the NPU architecture can be optimized to handle higher-precision accumulations, mitigating the effects of numerical instability typically associated with low-precision formats. This hardware-based support for mixed-precision training maximizes the throughput of matrix multiplication operations, improves energy efficiency, and accelerates training, enabling the deployment of complex neural network training workflows on resource-constrained edge devices.

[0058] In some embodiments, the training module 140 can also verify the accuracy of the trained or fine-tuned DNN. In some embodiments, the training module 140 feeds samples from the validation dataset into the trained DNN and uses the DNN's output to determine model accuracy. In some embodiments, the validation dataset may consist of some or all of the samples from the training dataset. Additionally or alternatively, the validation dataset includes additional samples outside the training set. In some embodiments, the training module 140 can determine an accuracy score that measures the precision, recall, or a combination of precision and recall of the DNN. The training module 140 can determine the accuracy score using the following metrics: precision = TP / (TP + FP) and recall = TP / (TP + FN), where precision can be the number of correctly predicted DNN instances (TP or true positives) divided by the total number of predicted instances (TP + FP or false positives), and recall can be the number of correctly predicted DNN instances (TP) divided by the total number of objects with the attribute in question (TP + FN or false negatives). The F-score (F-score = 2 * PR / (P + R)) unifies precision and recall into a single metric.

[0059] Training module 140 can compare the accuracy score with a threshold score. In one example, if training module 140 determines that the accuracy score of the DNN is below the threshold score, training module 140 instructs training module 140 to retrain the DNN. In one embodiment, training module 140 can iteratively retrain the DNN until a stopping condition occurs, such as an accuracy measurement indicating that the DNN may be accurate enough or that multiple training epochs have already been performed.

[0060] Compression module 160 compresses the DNN. For example, compression module 160 may add compression operations to DNN layers to reduce computational complexity or memory usage. Compression operations may modify the weights in the DNN layers. This modification may be performed before, during, or after training. In some embodiments, compression module 160 may select one or more layers in the DNN and modify each selected layer using compression operations. For example, compression module 160 may select computationally complex layers, such as layers with a large number of weights. For compression operations on a layer or a class of layers, compression module 160 may determine a weight threshold that will not cause the DNN's accuracy loss to exceed accuracy loss constraints. Compression operations may modify weights with absolute values higher than the weight threshold to lower precision values or zero, while keeping other weights unchanged.

[0061] After compressing the DNN, the compression module 160 can instruct the training module 140 to fine-tune the DNN. During this fine-tuning process, the values of the unpruned weights of the DNN can be modified, while the values of the pruned weights (i.e., zero) remain unchanged. For example, the compression module 160 can place a mask on the pruned weight block, and this mask can prevent the values in the pruned weight block from changing during the fine-tuning process. In other embodiments, the values of all weights (including pruned weights) can be changed during the fine-tuning process. After the fine-tuning process, the compression module 160 can perform a new pruning process (e.g., by changing more weights to zero). In some embodiments, the weight pruning process can be repeated multiple times before the fine-tuning process is complete. In some embodiments, the number of rounds in the fine-tuning process can differ from the number of rounds in the training process that determines the pre-pruned values of the weights. For example, the fine-tuning process can have fewer rounds than the training process. In one example, the number of rounds in the fine-tuning process can be relatively small, such as 2, 3, 4, 5, etc.

[0062] Compiler 170 compiles the DNN to generate instructions (e.g., configuration parameters, etc.) that can be executed by CPU 120A or NPU 120B to perform neural network operations in the DNN for training or deployment purposes. In some embodiments, compiler 170 can generate a graph representing the DNN. This graph can include nodes and edges. Nodes can represent specific neural network operations in the DNN. Edges can connect two nodes and represent a connection between two corresponding neural network operations. In one example, an edge can encode a tensor flowing from one neural network operation to another. This tensor can be the output tensor of the first neural network operation and the input tensor of the second neural network operation. Edges can encode one or more properties of the tensor, such as size, shape, storage format, etc. Compiler 170 can use this graph to generate an executable DNN. For example, the compiler can generate computer program instructions for executing the DNN.

[0063] In some embodiments, compiler 170 can generate configuration parameters that can be used to configure components of NPU 120B for DNN execution. These configuration parameters can be stored in one or more configuration registers associated with the components of NPU 120B. In some embodiments, compiler 170 can compile the DNN before it is trained. During training, compiler 170 may not perform compilation. Compiler 170 can recompile the DNN after it has been trained. Compiler 170 can perform different compilations before and after training. For example, compiler 170 can compile the DNN before training based on the condition that its internal parameters will change during training. Compiler 170 can compile the DNN after training based on the condition that its internal parameters will remain unchanged.

[0064] Data storage device 180 stores data received, generated, used, or associated with DNN module 110. For example, data storage device 180 stores the dataset used by training module 140 to train or fine-tune the DNN. Data storage device 180 may also store data generated by training module 140, such as hyperparameters used to train the DNN, intrinsic parameters of the trained DNN (e.g., weights), etc. Data storage device 180 may also store data generated by compression module 160, such as compressed weights. Data storage device 180 may store instructions, configuration parameters, or other data generated by compiler 170. Data storage device 180 may include one or more memories. Figure 1 In one embodiment, the data storage device 180 is a component of the DNN module 110. In other embodiments, the data storage device 180 may be external to the DNN module 110 and communicate with the DNN module 110 via a network.

[0065] CPU 120A may be a general-purpose processing unit. NPU 120B may be designed to accelerate DNNs. In some embodiments, NPU 120B may leverage parallel processing or data sparsity to accelerate DNN execution. CPU 120A may be used to control DNN training or deployment. For example, training module 140 or compiler 170 may be run using CPU 120A. In some embodiments (e.g., in an embodiment where AI system 100 is part of a computing device such as a personal computer, smartphone, or tablet), CPU 120A may also be used to run other applications, such as word processing applications, image processing applications, browsing applications, etc. NPU 120B may be used to perform computationally intensive operations (e.g., the MatMul operation described above) to train or deploy DNNs. CPU 120A and NPU 120B may be collectively referred to as heterogeneous processing unit 120, and individually as "heterogeneous processing unit 120". Heterogeneous processing unit 120 may be implemented on a separate chip. For example, each heterogeneous processing unit 120 may be implemented as a separate chip. Certain aspects of the NPU will be discussed below. Figure 10-13 Describe it.

[0066] Figure 2 Example convolutions according to various embodiments are shown. The convolution can be a deep learning operation in a convolutional layer of a DNN. The convolution can be performed on activation tensor 210 and filter 220 (referred to separately as "filter 220"). The filters can constitute the weight tensor of the convolution. The result of the convolution is output tensor 230. In some embodiments, the convolution is performed by an NPU (e.g., Figure 1 The convolution is performed by the NPU 120B. A convolution can include one or more MatMul operations. For example, each MatMul operation can be performed on the activation tensor 210 and a single filter 220.

[0067] The activation tensor 210 can be computed in previous layers of the DNN. In some embodiments (e.g., where the convolutional layer is the first layer of the DNN), the activation tensor 210 can be an image. Figure 2 In this embodiment, the activation tensor 210 includes activation values (also referred to as "input activation values," "elements," or "input elements") arranged in a 3D matrix. The activation tensor 210 can also be referred to as the input tensor of the convolution. Input elements are data points in the activation tensor 210. The activation tensor 210 has a spatial size. ,in It is the height of the 3D matrix (i.e., the length along the Y-axis, representing the number of activation values in the column of the 3D matrix for each input channel). It is the width of the 3D matrix (i.e., the length along the X-axis, representing the number of activation values in the rows of the 2D matrix for each input channel). This is the depth of the 3D matrix (i.e., the length along the Z-axis, representing the number of input channels). For simplicity and illustration, activation tensor 210 has a spatial size of 7×7×3, meaning activation tensor 210 includes three input channels, each with a 7×7 2D matrix. Each input element in activation tensor 210 can be represented by (X, Y, Z) coordinates. In other embodiments, the height, width, or depth of activation tensor 210 may be different.

[0068] Each filter 220 includes weights arranged in a 3D matrix. The values of the weights can be determined by training a DNN. Filter 220 has a spatial size. ,in It is the height of the filter (i.e., the length along the Y-axis, representing the number of weights in each column of the core). It is the width of the filter (i.e., the length along the X-axis, representing the number of weights in each row of the core). This is the depth of the filter (i.e., the length along the Z-axis, representing the number of channels). In some embodiments, equal For the sake of simplicity and explanation, Figure 2 Each filter 220 in the configuration has a spatial size of 2×3×3, meaning that filter 220 includes two convolutional kernels with a spatial size of 2×3. In other embodiments, the height, width, or depth of filter 220 may vary. The spatial size of the convolutional kernel is smaller than the spatial size of the 2D matrix of each input channel in activation tensor 210.

[0069] Activation values or weights can occupy one or more bytes in memory. The number of bytes for activation values or weights can depend on the data format. For example, when activation values or weights are in INT8 format, the activation value occupies one byte. When activation values or weights are in FP16 format, the activation value or weight occupies two bytes. Other data formats can be used for activation values or weights.

[0070] In the convolution, each filter 220 slides over the activation tensor 210 and generates a 2D matrix for the output channels in the output tensor 230. Figure 2 In this embodiment, the 2D matrix has a spatial size of 5×5. The output tensor 230 includes activation values (also referred to as "output activation values," "elements," or "output elements") arranged in a 3D matrix. Output activation values are data points in the output tensor 230. The output tensor 230 has a spatial size... ,in It is the height of the 3D matrix (i.e., the length along the Y-axis, representing the number of output activation values in the columns of the 2D matrix for each output channel). It is the width of the 3D matrix (i.e., the length along the X-axis, representing the number of output activation values in the rows of the 2D matrix for each output channel). It is the depth of the 3D matrix (i.e., the length along the Z-axis, representing the number of output channels). It can be equal to the number of filters 220 in the convolution. and This can depend on the activation tensor 210 and the height and width of each filter 220. In one example, the kernel size is 1×1. and They can be equal to respectively and .

[0071] As part of the convolution, the 3×3×3 sub-tensor 215 of the activation tensor 210 (in Figure 2 (Highlighted with a dotted pattern) and each filter 220 performs a MAC operation. The result of performing a MAC operation on subtensor 215 and filter 220 is the output activation value. In some embodiments (e.g., embodiments where the convolution is an integer convolution), the output activation value may include 8 bits, such as one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution), the output activation value may include more than one byte. For example, the output element may include two bytes.

[0072] After completing the MAC operations on subtensor 215 and all filters 220, vector 235 is produced. Vector 235 is... Figure 2 Highlighted using a dotted pattern. Vector 235 comprises a sequence of output activation values arranged along the Z-axis. The output activation values in vector 235 have the same (x, y) coordinates, but these output activation values correspond to different output channels and have different Z-coordinates. The dimension of vector 235 along the Z-axis can be equal to the total number of output channels in output tensor 230. After generating vector 235, further MAC operations are performed to generate additional vectors until output tensor 230 is generated. Figure 2 In this embodiment, the output tensor 230 is computed in Z-major order. When the output tensor 230 is computed in ZXY format, the vectors adjacent to vector 235 along the X-axis can be computed immediately after vector 235. When the output tensor 230 is computed in ZYX format, the vectors adjacent to vector 235 along the Y-axis can be computed immediately after vector 235. The output tensor 230 can be rearranged (e.g., via drain module 390) and stored in memory (e.g., local memory 1040) in either X-major or Y-major order.

[0073] In some embodiments, multiple MAC units can perform MAC operations on a 3×3×3 subtensor (e.g., subtensor 215) and filter 220. One or more MAC units can receive input operands (e.g., ... Figure 2 The activation operand 217 shown) and weight operand (e.g., Figure 2 The weight operand 227 is shown. The activation operand 217 includes a sequence of activation values with the same (x, y) coordinates but different z coordinates. Activation operand 217 includes activation values from each input channel in activation tensor 210. Weight operand 227 includes a sequence of weights with the same (x, y) coordinates but different z coordinates. Weight operand 227 includes weights from each channel in filter 220. The activation values in activation operand 217 and the weights in weight operand 227 can be fed sequentially into the MAC unit. The MAC unit can receive activation values and weights (“activation-weight pairs”) at once and multiply the activation values and weights. The position of the activation values in activation operand 217 can be matched with the position of the weights in weight operand 227. Activation values and weights can correspond to the same channel.

[0074] Activation values or weights can be floating-point numbers. Floating-point numbers can have various data formats, such as FP32, FP16, BF16, etc. Floating-point numbers can be positive or negative numbers with a decimal point. Floating-point numbers can be represented by a bit sequence, which includes one or more bits representing the sign of the floating-point number (e.g., positive or negative), bits representing the exponent of the floating-point number, and bits representing the mantissa of the floating-point number. The mantissa is a part of the floating-point number that represents the significant digits of the number. Multiplying the mantissa by the exponent of the base gives the actual value of the floating-point number.

[0075] In some embodiments, the output activation values in the output tensor 230 may be further processed based on one or more activation functions before being written to memory or input to the next layer of the DNN. Processing based on one or more activation functions may be at least part of the post-processing of the convolution. In some embodiments, post-processing may include one or more other computations, such as offset computation, bias computation, etc. The result of post-processing may be stored in the local memory of the computation block and used as input to the next DNN layer.

[0076] Figure 3The MatMul operation is illustrated according to various embodiments. The MatMul operation is performed on tensors 310 and 320 and produces tensor 330. In some embodiments, the MatMul operation may be an operation in a DNN layer. Tensor 310 may be generated in a previous layer, and tensor 320 may include the intrinsic parameters of the DNN layer. Tensor 330 may be the output or intermediate tensor of the DNN layer. The DNN layer may be a convolutional layer, a multi-head attention (MHA) layer, or other types of layers. The MatMul operation may be performed during the forward pass of the DNN training process. In other embodiments, the MatMul operation may be performed during the backward pass of the DNN training process. The MatMul operation may be performed to compute gradients. For example, tensor 330 may be a tensor of the input gradients relative to a loss function or a tensor of the weight gradients relative to a loss function.

[0077] For illustration, tensors 310 and 320 are 2D tensors. Tensor 310 has a spatial size of 1×4×5. Tensor 320 has a spatial size of 1×5×3. In some embodiments, a dot product is performed between each row of tensor 310 and each column of tensor 320 to generate a single point in tensor 330. Tensor 330 has a spatial size of 1×4×3. In other embodiments, tensors 310, 320, or 330 may have other spatial sizes. Tensors 310, 320, or 330 may be 3D tensors.

[0078] Figure 4 The operation in the forward pass of a DNN training process is illustrated according to various embodiments. This training process is used to train a DNN 410. The forward pass can be a process in which the DNN 410 is executed to predict the output for a given input, and the difference between the DNN's prediction and the accurate prediction is measured. The accurate prediction can be the true value. Figure 4 In one embodiment, the forward pass includes executing the DNN 410 and executing the loss function 420.

[0079] Executing DNN 410 may include performing MatMul operations. DNN 410 receives input 401 and has an internal parameter set 402. The internal parameter set 402 includes learnable parameters in DNN 410. Figure 4 In the example, the execution of DNN 410 is represented as follows: , where F represents the architecture of DNN 410 (e.g., a parameterizable function in DNN 410), x represents the input 401, w represents the internal parameter set 402, and y represents the output 403 predicted by DNN 410.

[0080] Output 403 and reference prediction result 404 are input into loss function 420. Reference prediction result 404 can be a prediction result verified as true or accurate. In some embodiments, reference prediction result 404 includes one or more reference values representing the true labels of input 401. Input 401 and reference prediction result 404 can be data from the training dataset used during training. Figure 4 In the example, the execution of loss function 420 is represented as follows: Where G represents the loss function 420, Let 404 represent the reference prediction result, and L represent the loss 405. The loss 405 indicates the difference between the output 403 of DNN 410 and the reference prediction result 404.

[0081] In some embodiments, forward pass can be represented as: Where N can be the number of training samples in the batch. The loss L can be used to update one or more intrinsic parameters of the DNN in a single pass. After the forward pass, a backward pass can be performed, where the gradient is computed and the set of intrinsic parameters 402 is updated according to the gradient to minimize the loss 405. The training process may include multiple forward passes and multiple backward passes. Some aspects of the backward pass will be discussed below. Figure 6 Describe it.

[0082] Figure 5 The MatMul core 510 is shown, which offloads forward pass to the NPU according to various embodiments. An example of forward pass could be... Figure 4 Forward pass in NPU. Figure 1 The NPU 120B is described. The MatMul core 510 can include one or more computational components within the NPU. For example, the MatMul core 510 can be a processing engine within the NPU. The MatMul core 510 can be designed to accommodate tensors of various dimensions. The NPU also features a loss function core 520. The MatMul core 510 and the loss function core 520 can have the same type of computational components or different types of computational components.

[0083] like Figure 5As shown, the MatMul kernel 510 receives input 501 and weights 502. Input 501 can be a training sample. Weights 502 are internal parameters of the DNN and are learnable, meaning that the value of weights 502 can be changed by training the DNN. The MatMul kernel 510 can perform MatMul operations within the DNN, the result of which is output 503. Output 503 and output reference 504 are provided to the loss function kernel 520. The loss function kernel 520 can apply a loss function to output 503 and output reference 504 to calculate loss 505. In some embodiments, input 501, weights 502, or output reference 504 can be transferred to the MatMul kernel 510 or loss function kernel 520 via a DMA engine. Output 503 or loss 505 can be stored in local memory of the MatMul kernel 510 or loss function kernel 520 for further computation, such as computation in backpropagation.

[0084] Figure 6 The MatMul core 610 is illustrated, which offloads the reverse pass to the NPU according to various embodiments. In some embodiments, the MatMul core 610 may be... Figure 5 MatMul core 510. In other embodiments, MatMul core 610 may be another core coexisting on the same NPU as MatMul core 510. Figure 6 The reverse propagation in is Figure 5 The forward pass is executed after the backward pass. The backward pass is executed by the MatMul core 610 and the automatic differentiation module 620. The automatic differentiation module 620 can also use an NPU. Alternatively, the automatic differentiation module 620 can use a CPU, such as... Figure 1 The CPU is 120A.

[0085] like Figure 6 As shown, output 503, output reference 504, and loss 505 are provided to automatic differentiation module 620. Automatic differentiation module 620 automatically calculates output gradient 601. Output gradient 601 can be a gradient relative to the layer output. Output gradient 601, along with input 501 and weights 502, is provided to MatMul kernel 610 to calculate input gradient 602 and weight gradient 603. Input gradient 602 can be a gradient relative to the layer input. In some embodiments, MatMul kernel 610 can perform a MatMul operation on weights 502 and output gradient 601 to calculate input gradient 602. This MatMul operation can be represented as... ,in This indicates that the input gradient is 602. This indicates that the output gradient is 601. This represents weight 502. The input gradient 602 can be propagated down to previous layers, hence this is backpropagation.

[0086] The weight gradient 603 may include the gradient of each weight relative to the layer output. In some embodiments, the MatMul kernel 610 may perform a MatMul operation on the input 501 and the output gradient 601 to compute the weight gradient 603. This MatMul operation can be represented as... ,in Presentation layer The weight tensor, Presentation layer The weight gradient is 603. This indicates that the output gradient is 601. This indicates that the input is 501. It can be the loss function relative to The gradient. In some embodiments, It can have with Tensors with the same spatial shape.

[0087] During backpropagation, the input can be the gradient relative to the layer output. By performing the two MatMul operations described above, the input gradient and weight gradient can be computed for each layer. The input gradient can be propagated backward through the layers of the DNN. An optimization process can be performed based on the weight gradient to update the weights in the DNN.

[0088] Figure 7 This is a flowchart of a method 700 for training a DNN according to various embodiments. Method 700 can be... Figure 1 The AI system 100 is executed. Although method 700 is a reference. Figure 7 The flowchart shown illustrates this method, but many other methods for training DNNs can also be used alternatively. For example, the flowchart can be modified... Figure 7 The execution order of the steps. As another example, some steps can be changed, eliminated, or combined.

[0089] AI system 100 provides (710) the input tensors and weight tensors of layers in a neural network to a neural processing unit to train the neural network through a training process. The training process includes forward operations and backward operations. In some embodiments, the forward operation includes forward propagation of data through layers in the neural network. In some embodiments, the backward operation includes backward propagation of data through layers in the neural network. In some embodiments, the input tensors or weight tensors have FP16 or BF16 values.

[0090] AI system 100 offloads (720) the forward operation to a MatMul kernel on the neural processing unit. The MatMul kernel is used to execute a layer by performing a first MatMul operation on the input tensor and the weight tensor, and to produce the output tensor of the layer. In some embodiments, the MatMul kernel is configured to perform MatMul operations on tensors with different dimensions.

[0091] AI system 100 offloads the inverse operation (730) to the MatMul kernel. The MatMul kernel is used to compute the gradient of the loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor. In some embodiments, during the forward operation, AI system 100 computes the loss by applying a loss function to the output tensor of the layer and one or more reference values. In some embodiments, during the inverse operation, AI system 100 computes the gradient of the output tensor based on the loss, the output tensor of the layer, and one or more reference values. In some embodiments, the gradient of the output tensor is computed using an automatic differentiation module. In some embodiments, the automatic differentiation module is offloaded to a neural processing unit.

[0092] In some embodiments, the input tensor is the output of a previous layer in the neural network. The AI system 100 propagates the lost input gradient from the layer to the previous layer.

[0093] AI system 100 trains (740) layers by updating weight tensors based on the gradient of the loss. In some embodiments, the gradient of the loss is the weight gradient of the loss. The MatMul kernel is also used to compute the input gradient of the loss of the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor. The weight tensor is also updated based on the input gradient of the loss.

[0094] Figure 8 Example transformer models 800 according to various embodiments are shown. Transformer model 800 can transform an input sequence into an output sequence. In some embodiments, transformer model 800 is a deep neural network (DNN) that can learn context and meaning by tracking relationships in sequence data (e.g., sequence words in a sentence, sequence audio signals, sequence images, etc.). For example, transformer model 800 can be part of an LLM (Limited Linear Model). Transformer model 800 can be an example of the DNN described above. Figure 8 As shown, the converter model 800 includes an encoder block 810, a decoder block 820, and a head block 830. In other embodiments, the converter model 800 may include different or additional components. Furthermore, the functionality implemented by the components of the converter model 800 may be implemented by other components included in the converter model 800 or by other models or modules.

[0095] Encoder block 810 receives the input sequence and generates a matrix representation of the input sequence. Figure 8 In this embodiment, encoder block 810 receives input 801 and generates encoder output 802. Input 801 may be an input prompt. In some embodiments, input 801 may include one or more input tokens, such as words, phrases, sentences, images, audio signals, other types of input tokens, or combinations thereof. For example, input 801 may include a prompt received from a user of transformer model 800. The prompt may include a question or request from the user. Words in the prompt may be input tokens. Encoder output 802 may include one or more vectors that are contextual representations of input 801. Each vector in encoder output 802 may represent a context-aware token in input 801.

[0096] Encoder block 810 includes an embedding layer 813, a position encoding layer 815, and multiple layers 840 (collectively referred to as "layers 840"). In other embodiments, encoder block 810 may have different, fewer, or more components. Furthermore, the arrangement of components in encoder block 810 may be similar to... Figure 8 The arrangements shown are different. For illustration, Figure 8 The encoder block 810 has N layers, where N is an integer. Each layer 840 may include one or more neural network operations. Layer 840 can transform an embedding sequence into a representation that encapsulates information learned from input 801. Different layers 840 may have different intrinsic parameters, such as different weights, biases, or other types of intrinsic parameters. In some embodiments, layers 840 have the same components. Components in layer 840 can be layers or sublayers of layer 840. Figure 8 As shown, layer 840 includes four sub-layers: multi-head attention (MHA) layer 841, add & normalization layer 842, feedforward layer 843, and another add & normalization layer 844.

[0097] Decoder block 820 iteratively generates output 803 using the encoded representation generated by encoder block 810. Decoder block 820 includes an embedding layer 823, a positional coding layer 825, and multiple layers 850 (collectively referred to as "layers 850"). For illustration, Figure 8 The decoder block 820 in the code has N layers, where N is an integer. Figure 8In one embodiment, the number of layers 850 in decoder block 820 is the same as the number of layers 840 in encoder block 810. In other embodiments, the number of layers 850 in decoder block 820 may be different from the number of layers 840 in encoder block 810. Each layer 850 may include one or more neural network operations. Different layers 850 may have different internal parameters. In some embodiments, layers 850 may have the same components. Components in layer 850 may be layers, and may also be referred to as sublayers of layer 850. Figure 8 As shown, layer 850 includes six sub-layers: MHA layer 851, summation and normalization layer 852, another MHA layer 853, another summation and normalization layer 854, feedforward layer 855, and another summation and normalization layer 856.

[0098] In some embodiments, a series of inference stages are performed in decoder block 820 using encoder output (e.g., encoder output 802). A prediction matrix can be generated through each inference stage. Output 803 may include multiple matrices. Each matrix can be further processed in header block 830 to predict lexical units. Multiple matrices can be used to predict lexical sequences. For a first inference stage, decoder block 820 may receive one or more start lexical units as input lexical units and compute a first matrix based on the input lexical units and the output of encoder block 810. The first matrix can be used by header block 830 to predict first lexical units. In a second inference stage, the predicted lexical units, in addition to the start lexical units(s), can be used as new input lexical units. Similarly, second lexical units can be predicted through the second inference stage and used in the third inference stage. This iteration can continue until all inference stages are completed.

[0099] Header block 830 receives the output of decoder block 820 and processes it in linear layer 833 and softmax layer 835. Linear operations can be performed on the output of decoder block 820 in linear layer 833. Linear operations may include multiplying the output of decoder block 820 by a weight matrix. The output of linear layer 833 may be a vector. In some embodiments, header block 830 may act as a classifier. The number of data elements in the vector computed in linear layer 833 may depend on the number of classes involved. For example, if there are M classes, where M is an integer, the vector computed in linear layer 833 may have M data elements, each representing a prediction for one of the M classes.

[0100] The output of linear layer 833 can be fed into softmax layer 835. A softmax function can be applied to the output of linear layer 833 to compute probability scores. The values of the probability scores can range from 0 to 8. In some embodiments, a probability value can be computed for each data element in the vector computed in linear layer 833. The highest probability score can be a key score. The corresponding index of the key score can point to the word that the transformer model 800 predicts as the next word in the sequence. The final output of the transformer model 800 can be a sequence of predicted words. In some embodiments, head block 830 can be a language modeling head.

[0101] An embedding layer (e.g., embedding layer 813 or embedding layer 823) transforms the input of the embedding layer (e.g., input 801 or output 803) into one or more embeddings. An embedding can be a vector, also known as an embedding vector or vector embedding. A vector embedding can include a series of data elements. In some embodiments, embedding layer 813 can generate multiple embeddings, each derived from different input lexical units in input 801. Embeddings can capture the semantic meaning of the lexical units in input 801. An embedding can be a numerical representation capturing the relation or meaning of words, phrases, or other data types. For example, if input 801 is a cue containing a sequence of words, embedding layer 813 can generate an embedding from each word in input 801. Embedding layer 823 in decoder block 820 can generate multiple embeddings from lexical units received from decoder block 820 in a similar manner to embedding layer 813.

[0102] A positional encoding layer (e.g., positional encoding layer 815 or positional encoding layer 825) performs positional encoding on the embeddings generated in the respective embedding layer. In some embodiments, the positional encoding layer may apply one or more positional encoding vectors (e.g., positional encoding vector 804 or positional encoding vector 805) to the vector embeddings from the respective embedding layer to generate new vector embeddings representing embeddings with positional context. The positional encoding vectors may encode information about the position of the embeddings in the embedding sequence. In some embodiments, the positional encoding layer performs an addition operation on the positional encoding vectors and the vector embeddings. The addition operation may be element-wise addition. The positional encoding layer may output an embedding matrix that includes the vector embeddings computed in the positional encoding layer.

[0103] MHA layers (e.g., MHA layer 841, MHA layer 851, or MHA layer 853) can implement a multi-head attention mechanism, which can be a multi-head self-attention mechanism or a multi-head cross-attention mechanism. In some embodiments, MHA layer 841 or MHA layer 851 can implement a self-attention mechanism. For self-attention, the query, key, and value can come from the same place. For example, for MHA layer 841, the query, key, and value can all come from position encoding layer 815. For MHA layer 851, the query, key, and value can all come from position encoding layer 825. The self-attention mechanism enables the transformer model 800 to associate each token with other tokens. The MHA layer can compute an attention score based on the embeddings generated in the respective position encoding layers. In some embodiments, the MHA layer can receive one or more queries, one or more keys, and one or more values. In some embodiments, the MHA layer has multiple heads that receive different linear projection versions of the query, key, and value, and generates outputs in parallel, which are then used to generate the final result.

[0104] In some embodiments, the query, key, and value input to MHA layer 841 can be computed based on vector embeddings generated by positional encoding layer 815. The query, key, and value input to MHA layer 851 can be computed based on vector embeddings generated by positional encoding layer 825. The query, key, or value can be a vector representing a term in the sequence. In some embodiments, the query matrix... By embedding the matrix (For example, the embedding matrix computed in the positional encoding layer) and the weight matrix The calculation is performed by multiplication, where d is the dimension of the vector embeddings, N is the number of vector embeddings in the embedding matrix, and h is the number of attention heads. Each row in the query matrix can be a query. Key matrix. By embedding the matrix (For example, the embedding matrix computed in the positional encoding layer) and the weight matrix Calculations are performed using multiplication. Each row in the key matrix can be a key. Value matrix. By embedding the matrix (For example, the embedding matrix computed in the positional encoding layer) and the weight matrix Calculations are made by multiplication. Each row in the value matrix can be a value.

[0105] In some embodiments, the MHA layer 851 can implement masked multi-head self-attention. The MHA layer 851 can prevent a position from focusing on subsequent positions. For example, each word in the sequence can be unaffected by future words. This masking ensures that the prediction of a particular position can depend on the known output of the positions preceding it, but not on the unknown output of the positions following it.

[0106] In some embodiments, the MHA layer 853 can implement a cross-attention mechanism, such as encoder-decoder cross-attention. The MHA layer 853 can use the output from the previous layer (i.e., the output of the summation and normalization layer 852) as a query and the output of the encoder block 810 as the key and value. Cross-attention can align the encoder input with the decoder input, enabling the decoder block 820 to identify and emphasize the most relevant parts of the encoder input.

[0107] In some embodiments, the MHA layer includes a linear layer, a MatMul layer, a scaling layer, a Softmax layer, another MatMul layer, a connection layer, and another linear layer. These layers may be arranged in sequence. The MHA layer may receive three input matrices: a query matrix, a key matrix, and a value matrix, which are the inputs to the three linear layers, respectively. The linear layers may include matrix multiplication (MatMul) operations. For example, the first linear layer may perform a multiplication of the query matrix and the weight matrix to compute a first parameter matrix. The first parameter matrix may be represented as... ,in It is a query matrix, and This is the weight matrix. The second linear layer can perform multiplication of the key matrix and the weight matrix to compute the second parameter matrix. The second parameter matrix can be represented as... ,in It is a key matrix. This is the weight matrix. The third linear layer can perform multiplication of the value matrix and the weight matrix to calculate the third parameter matrix. The third parameter matrix can be represented as... ,in It is a value matrix. It is a weight matrix. It can represent the index of the header. It is the dimension of the query vector. It is the dimension of the key vector. This is the dimension of the value vector. In some embodiments, In some embodiments, the linear layer can be within a linear block of the MHA layer. In some embodiments, the MHA layer can include multiple linear blocks. For example, the MHA layer includes h linear blocks. The linear blocks can have the same layer as each other. Each linear block can compute three parameter matrices from the query matrix, the key matrix, and the value matrix, respectively.

[0108] The MatMul layer, scaling layer, mask layer, Softmax layer, and MatMul layer can be located within attention blocks of the MHA layer. Attention blocks can implement a scaled dot product attention mechanism. In some embodiments, the MHA layer includes multiple attention blocks, including the attention blocks described above. For illustration, the MHA layer includes h attention blocks. These attention blocks can have the same layer as each other. Linear blocks and attention blocks can constitute the header of the MHA layer. When the MHA layer has h linear blocks and h attention blocks, the MHA layer has h headers. A header can be represented as... .

[0109] In the MatMul layer, matrix multiplication can be performed on the parameter matrix to compute a score matrix. In some embodiments, this score matrix can establish the degree of emphasis each term should receive relative to other terms. The score matrix can include multiple scores. Each term can be assigned a score relative to other terms within the same time step. A higher score can indicate higher attention or emphasis. The score matrix can be scaled in a scaling layer. In some embodiments, this is achieved by dividing the scores in the score matrix by the square root of the dimensions of the query vector and the key vector (which can be represented as...). The scaling layer reduces the score matrix. The output of the scaling layer can be a scaling matrix, including the adjusted scores. A masking layer is optional in some implementations. The masking layer adds an attention mask (which can be the input to the attention block) to the output of the scaling layer to mask some elements in the scaling layer output. The positions of the masked elements can be defined by the attention mask. A softmax function can be applied to the scaling matrix in the softmax layer to compute the attention weight matrix. The attention weight matrix includes attention weights. Attention weights can be probability values from 0 to 1. The softmax function can emphasize high scores while diminuting low scores, which can enhance the model's ability to determine which lemmas should receive more attention.

[0110] In the MatMul layer, a matrix multiplication operation is performed on the attention weight matrix computed in the Softmax layer and the parameter matrix computed from the value matrix in the corresponding linear layer. The result of this matrix multiplication operation is a single-head output matrix, which is the output of the attention block.

[0111] When an MHA layer has h attention blocks, there can be h single-head output matrices. These single-head output matrices are concatenated in connection layers to form connection matrices. Linear operations (also known as "linear transformations") are performed on the connection matrices using weight matrices from linear layers. In some embodiments, MHA can be represented as... ,in Indicates a connection. It is the weight matrix in the corresponding linear layer.

[0112] In converter model 800, the summation and normalization layers (e.g., summation and normalization layers 842, 844, 852, 854, and 856) have addition operations followed by layer normalization operations. The addition operation can be the addition of the output and input of the previous layer. The previous layer is the layer that is directly preceding the summation and normalization layer. For example, the layer preceding summation and normalization layer 842 is MHA layer 841. As another example, the layer preceding summation and normalization layer 854 is MHA layer 853.

[0113] Then, the layer normalization operation is applied to the result of the addition operation, which can be represented as: ,in The representation is normalized, where x is the input of the previous layer. This represents the output of the previous layer. In some embodiments, the layer normalization operation may include a series of calculations. For example, the layer normalization operation may include a mean calculation, which can be represented as... ,in This represents a data element in the input tensor. x can be the index of the data element in one spatial dimension, y can be the index of the data element in another spatial dimension, and z can be the index of the data element in the channel dimension. The output of the mean calculation can be a 2D matrix. The mean calculation can be a channel-by-channel simplification operation. Layer normalization can... Convert to 3D tensor For example, by means of Each data element is copied at each output point to perform the transformation.

[0114] Level normalization operations can also include element-wise subtraction, which can be represented as follows: Layer normalization operations can also include representations as follows: Variance calculation and representation are as follows Division calculation. It can be a 2D tensor. The layer normalization operation will also... Convert to 3D tensor For example, the transformation is performed by copying each data element across z output points. Furthermore, the layer normalization operation can have element-wise multiplication, denoted as... Layer normalization operations can be further computed. and . It can be the output of a layer normalization operation.

[0115] Feedforward layers (e.g., feedforward layers 843 and 855) can be position-wise fully connected feedforward networks. For example, a feedforward layer can include two linear layers with an activation function in between. An example activation function is the rectified linear unit (ReLU).

[0116] Figure 9 An example CNN 900 according to various embodiments is shown. The CNN 900 can be manufactured by... Figure 1 The AI system 100 is used for training or deployment. The CNN 900 can be an example of the DNN described above. For illustration, the CNN 900 includes a sequence of layers comprising multiple convolutional layers 910 (collectively referred to as "convolutional layer 910"), multiple pooling layers 920 (collectively referred to as "pooling layer 920"), and multiple fully connected layers 930 (collectively referred to as "fully connected layer 930"). In other embodiments, the CNN 900 may include fewer, more, or different layers. During the execution of the CNN 900, the layer execution of the DNN includes tensor computations of numerous tensor operations, such as convolution, interpolation, pooling operations, element-wise operations (e.g., element-wise addition, element-wise multiplication, etc.), other types of tensor operations, or certain combinations of these operations.

[0117] Convolutional layer 910 summarizes the presence of features in the input of CNN 900. Convolutional layer 910 acts as a feature extractor. The first layer of CNN900 is convolutional layer 910. In the example, convolutional layer 910 performs a convolution operation on the input tensor 940 (also known as IFM 940) and filter 950. Figure 9 As shown, the IFM 940 is represented by a 7×7×3 three-dimensional (3D) matrix. The IFM 940 includes three input channels, each represented by a 7×7 two-dimensional (2D) matrix. Each row of the 7×7 2D matrix contains 7 input elements (also called input points), and each column contains 7 input elements. The filter 950 is represented by a 3×3×3 3D matrix. The filter 950 includes three kernels, each corresponding to a different input channel of the IFM 940. The kernel is a 2D matrix of weights, where the weights are arranged by column and row. The kernel can be smaller than the IFM. Figure 9 In this embodiment, each kernel is represented by a 3×3 2D matrix. Each row of the 3×3 kernel contains 3 weights, and each column also contains 3 weights. The weights can be initialized and updated using gradient descent via backpropagation. The magnitude of the weights can represent the importance of filter 950 in extracting features from IFM 940.

[0118] The convolution involves a multiply-accumulate (MAC) operation on the input elements in IFM 940 and the weights in filter 950. The convolution can be either a standard convolution 963 or a depthwise convolution 983. In a standard convolution 963, the entire filter 950 slides over the IFM 940. All input channels are combined to produce an output tensor 960 (also called an output feature map (OFM) 960). OFM 960 is represented by a 5×5 2D matrix. Each row of the 5×5 2D matrix contains 5 output elements (also called output points), and each column also contains 5 output elements. For illustration, in... Figure 9 In the embodiments described, the standard convolution includes a filter. In embodiments with multiple filters, the standard convolution can produce multiple open-ended values (OCs) in the OFM 960.

[0119] The multiplication applied between a local patch of IFM 940 kernel size and the kernel can be a dot product. A dot product is an element-wise multiplication between a local patch of IFM 940 kernel size and the corresponding kernel, then summed, always producing a single value. Because it produces a single value, this operation is often called a "scalar product." Using a kernel smaller than IFM 940 is intentional because it allows the same kernel (a set of weights) to be multiplied multiple times by IFM 940 at different points on IFM 940. Specifically, the kernel is systematically applied from left to right and top to bottom to each overlapping portion or local patch of IFM 940 kernel size. Multiplying the kernel by IFM 940 once results in a single value. Since the kernel is applied multiple times to IFM 940, the result of the multiplication is the output element of a 2D matrix. Thus, the 2D output matrix from standard convolution 963 (i.e., OFM 960) is called OFM.

[0120] In depthwise convolution 983, the input channels are not combined. Instead, a MAC operation is performed on individual input channels and individual kernels, producing OC. For example... Figure 9 As shown, depthwise convolution 983 produces a depth output tensor 980. The depth output tensor 980 is represented by a 5×5×3 3D matrix. The depth output tensor 980 includes three open-ended (OCs), each channel represented by a 5×5 2D matrix. Each row of the 5×5 2D matrix contains 5 output elements, and each column also contains 5 output elements. Each OC is the result of a MAC operation performed on the input channel of the IFM 940 and the kernel of the filter 950. For example, the first OC (dot pattern) is the result of a MAC operation on the first input channel (dot pattern) and the first kernel (dot pattern); the second OC (horizontal stripe pattern) is the result of a MAC operation on the second input channel (horizontal stripe pattern) and the second kernel (horizontal stripe pattern); and the third OC (diagonal stripe pattern) is the result of a MAC operation on the third input channel (diagonal stripe pattern) and the third kernel (diagonal stripe pattern). In such depthwise convolution, the number of input channels equals the number of OCs, and each OC corresponds to a different input channel. The input channels and output channels are collectively referred to as depth channels. After depthwise convolution, pointwise convolution 993 is performed on the depth output tensor 980 and the 1×1×3 tensor 990 to produce OFM 960.

[0121] OFM 960 is then passed to the next layer in the sequence. In some embodiments, OFM 960 is passed through an activation function. An example activation function is the Corrected Linear Unit (ReLU). ReLU is a computation that directly returns the value provided as input, or returns 0 if the input is 0 or less. Convolutional layer 910 can receive several images as input and compute the convolution of each of them with each kernel. This process can be repeated several times. For example, OFM 960 is passed to a subsequent convolutional layer 910 (i.e., the convolutional layer 910 after the one that produced OFM 960 in the sequence). The subsequent convolutional layer 910 performs convolution on OFM 960 with a new kernel and generates a new feature map. The new feature map can also be normalized and resized. The new feature map can be kernelized again by further subsequent convolutional layers 910, and so on.

[0122] In some embodiments, the convolutional layer 910 has four hyperparameters: the number of kernels, the kernel size (e.g., the kernel size is F×F×D pixels), the stride S of dragging the window corresponding to the kernel on the image (e.g., a stride of 1 means moving the window one pixel at a time), and zero padding P (e.g., adding a black outline of P pixels thickness to the input image of the convolutional layer 910). The convolutional layer 910 can perform various types of convolutions, such as 2D convolution, dilated or dilated convolution, spatially separable convolution, depthwise separable convolution, transposed convolution, etc. CNN 900 includes 96 convolutional layers 910. In other embodiments, CNN900 may include other numbers of convolutional layers.

[0123] Pooling layer 920 downsamples the feature map generated by the convolutional layer, for example, by downsampling the presence of features in local blocks that summarize the feature map. Pooling layer 920 is positioned between two convolutional layers 910: a pre-convolutional layer 910 (the convolutional layer 910 preceding pooling layer 920 in the layer sequence) and a post-convolutional layer 910 (the convolutional layer 910 following pooling layer 920 in the layer sequence). In some embodiments, pooling layer 920 is added after convolutional layer 910, for example, after an activation function (e.g., ReLU, etc.) has been applied to OFM 960.

[0124] Pooling layer 920 receives feature maps generated by the preceding convolutional layer 910 and applies pooling operations to these feature maps. Pooling operations reduce the size of the feature maps while preserving their important characteristics. Therefore, pooling operations improve the efficiency of the DNN and avoid overlearning. Pooling layer 920 can perform pooling operations using average pooling (calculating the average value of each local block on the feature map), max pooling (calculating the maximum value of each local block on the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature map. In various embodiments, the pooling operation is applied with a stride of 2×2 pixels, thus reducing the size of the feature map by a factor of 2, for example, reducing the number of pixels or values in the feature map to one-quarter of its original size. In one example, pooling layer 920 applied to a 6×6 feature map produces a 3×3 output pooled feature map. The output of pooling layer 920 is fed into the subsequent convolutional layer 910 for further feature extraction. In some embodiments, pooling layer 920 operates on each feature map individually to create a new set of the same number of pooled feature maps.

[0125] Fully connected layer 930 is the last layer of the DNN. Fully connected layer 930 may or may not be convolutional. Fully connected layer 930 receives input operands. The input operands define the outputs of convolutional layer 910 and pooling layer 920 and include the values of the final feature map generated by the last pooling layer 920 in the sequence. Fully connected layer 930 applies a linear combination and activation function to the input operands and generates a vector. This vector can include as many elements as there are classes: element i represents the probability that the image belongs to class i. Therefore, each element is between 0 and 1, and the sum of all elements is 1. These probabilities are calculated by the last fully connected layer 930 using either a logistic function (for binary classification) or a SoftMax function (for multi-class classification) as the activation function. In some embodiments, fully connected layer 930 multiplies each input element by a weight, sums the results, and then applies the activation function (e.g., logistic if N=2, SoftMax if N>2). This is equivalent to multiplying the input operands by a matrix including the weights.

[0126] Figure 10 This is a block diagram of an NPU 1000 according to various embodiments. The NPU 1000 can execute a DNN. For example, the NPU 1000 can execute layers in a DNN by executing neural network operations within those layers. These layers can be arranged sequentially, and the NPU 1000 can execute these layers sequentially. Execution of the DNN can be used to train the DNN or to perform AI tasks using the DNN. The NPU 1000 can also perform computations during the backpropagation of training the DNN. The NPU 1000 can be... Figure 1 An example of the NPU 120B. (e.g.,...) Figure 10As shown, the NPU 1000 includes a memory 1010, a DMA engine 1020, and a compute block 1030 (referred to separately as "compute block 1030"). In other embodiments, the NPU 1000 may include alternative configurations, different, or additional components. For example, the NPU 1000 may include multiple memories 1010 or DMA engines 1020. Alternatively, the NPU 1000 may include a single compute block 1030. Furthermore, the functionality implemented by the components of the NPU 1000 may be performed by other components within the NPU 1000 or by other systems. The components of the NPU 1000 may be implemented through hardware, software, firmware, or a combination thereof.

[0127] Memory 1010 stores data related to neural network operations performed by NPU 1000. In some embodiments, memory 1010 may store data to be used by computation block 1030 to perform neural network operations. Memory 1010 may store the inputs and outputs of the DNN. Memory 1010 may also store activation values (e.g., input and output activation values for neural network operations) and weights (e.g., weights determined by training the DNN) in the DNN. In some embodiments, memory 1010 may store activation values and weights with floating-point precision, such as FP4, SF4, NF4, FP16, BP16, FP32, etc. Memory 1010 may also store quantized activation values or weights. Memory 1010 includes one or more dynamic random access memories (DRAM).

[0128] DMA engine 1020 facilitates data transfer between memory 1010 and computation block 1030. For example, DMA engine 1020 can read data from memory 1010 and write the data to the local memory of computation block 1030. Similarly, DMA engine 1020 can read data from the local memory of computation block 1030 and write the data to memory 1010. For instance, DMA engine 1020 can read the input activation values and weights of a convolution from memory 1010 and load the input activation values and weights into one or more computation blocks 1030. DMA engine 1020 can also write the convolution output activation values computed by one or more computation blocks 1030 to memory 1010. DMA engine 1020 provides DMA functionality, allowing data transfer between memory 1010 and the local memory of computation block 1030 to be initiated within computation block 1030, and enabling other operations to be performed concurrently with the data transfer. In some embodiments, the DMA engine 1020 may read tensors from the memory 1010 and modify the tensors in a manner optimized for the computation block 1030 before writing them into the local memory of the computation block 1030.

[0129] Computation block 1030 executes neural network operations in a DNN. For example, computation block 1030 can execute a DNN layer by running one or more deep learning operations within the DNN layer. Computation block 1030 can execute one layer or a portion of a layer at a time. In some embodiments, the operations of a DNN layer can be run in parallel by multiple computation blocks 1030. For example, multiple computation blocks 1030 can each execute a portion of the workload of the neural network operations. Data can be shared among computation blocks 1030. Computation block 1030 can also be referred to as a computation tile. Computation block 1030 can run various types of neural network operations, such as convolution, matrix multiplication, softmax operations, pooling, element-wise operations, linear operations, non-linear operations, etc. The neural network operations executed by computation block 1030 include tensor operations, i.e., operations with tensors as input or tensors as output. For example, computation block 1030 receives an input tensor and one or more convolution kernels, and performs convolution on the input tensor and the convolution kernels. The result of convolution can be an output tensor, which can be further computed, for example by computation block 1030 or another computation block 1030.

[0130] exist Figure 10 In one embodiment, each compute block 1030 includes a local memory 1040, a digital signal processor (DSP) 1050, and a data processing unit (DPU) 1055. The DPU 1055 includes an input delivery unit (IDU) 1060, a processing engine 1070, a post-processing engine 1080, and an output delivery unit (ODU) 1090. Some or all of the components of compute block 1030 may be implemented on the same chip. In other embodiments, compute block 1030 may include alternative configurations, different, or additional components. Furthermore, the functionality implemented by the components of compute block 1030 may be performed by other components included in compute block 1030, another compute block 1030, other components of NPU 1000, or another system. The components of compute block 1030 may be implemented by hardware, software, firmware, or a combination thereof.

[0131] Local memory 1040 is located locally within the corresponding compute block 1030. Both the DSP 1050 and DPU 1055 can access local memory 1040. Figure 10In one embodiment, local memory 1040 is located inside compute block 1030. In other embodiments, local memory 1040 may be located outside compute block 1030. Data in local memory 1040 may be transferred to or from memory 1010, for example, via DMA engine 1020. In some embodiments, data in local memory 1040 may be transferred to or from the local memory of another compute block 1030. Local memory 1040 may store data received, used, or generated by IDU 1060, processing engine 1070, post-processing engine 1080, or ODU 1090. Examples of data may include input activation values, weights, output activation values, configuration parameters, etc.

[0132] In some embodiments, local memory 1040 includes one or more static random access memories (SRAMs). Local memory 1040 may be byte-addressable, with each memory address identifying one byte (8 bits) of storage space. In some embodiments, local memory 1040 may include a bank of memory. The number of banks of memory in local memory 1040 may be 16, 64, 128, 256, 512, 1024, 2048, or other values. A bank of memory may include multiple memory cells. For example, a bank of memory may include 8, 16, 64, or other numbers of memory cells. A bank of memory or a memory cell within a bank may have a memory address. For example, a memory cell may store one byte, and data larger than one byte may be stored in memory cells with contiguous memory addresses, i.e., adjacent memory cells. For example, a memory cell may store integers in INT8 format, while numbers in FP16 or BF16 format (with 16 bits) may require two memory cells. In some embodiments, 16-bit data may be transferred from local memory 1040 in a single read cycle. In other embodiments, 16-bit data can be transferred from local memory 1040 over multiple read cycles (e.g., two cycles).

[0133] The DSP 1050 performs computations in DNN layers, including computations in neural network operations based on grouped quantization. In some embodiments, the DSP 1050 can perform general-purpose computations, such as addition, subtraction, multiplication, division, logical operations, bitwise operations, and other nonlinear computations (implemented via table lookup or polynomial approximation). The DSP 1050 can be a Very Long Instruction Word (VLIW) processor. In some embodiments, the DSP 1050 can have an architecture optimized for the operational requirements of digital signal processing. In some embodiments, the DSP 1050 can perform some computations in a neural network operation, while other computations in that neural network operation can be performed by the DPU 1055. The DSP 1050 can support non-traditional operations or non-MatMul or non-convolution-based operations in the DNN.

[0134] In some embodiments, the DSP 1050 can operate according to a clock signal. For example, the timing of instructions executed by the DSP 1050 can be synchronized with a clock signal. In some embodiments, the DSP 1050 can be pipelined with a DMA engine 1020 or a DPU 1055 to enable parallel computing and improve overall performance. The DSP 1050 can be implemented on a microprocessor chip, which can be separate from the chip implementing the DPU 1055. In some embodiments, the DSP 1050 can be a streaming hybrid architecture vector engine (SHAVE) processor. Although Figure 10 A single DSP is shown, but computing block 1030 can include multiple DSPs. These DSPs can be arranged in an array.

[0135] IDU 1060 loads data from local memory 1040 into processing engine 1070 or post-processing engine 1080. IDU 1060 can read tensors from local memory 1040. These tensors may include activation tensors, weight tensors, etc. IDU 1060 can perform grouped loading of activation values or weights. In some embodiments, IDU 1060 can read data from local memory 1040 and write the data into storage units of processing engine 1070. For example, IDU 1060 can load activation values into the activation register file of processing engine 1070 and load weights into the weight register file of processing engine 1070. IDU 1060 may have an activation reader for loading activation values and a weight reader for loading weights. In some embodiments, IDU 1060 can read configuration parameters from local memory 1040 and load these configuration parameters into configuration registers or other configurable components (e.g., LUTs) of processing engine 1070 or post-processing engine 1080.

[0136] Processing engine 1070 performs operations in the DNN. Processing engine 1070 may include one or more processing units. In some embodiments, processing units may be arranged in processing engine 1070 in one or more rows and one or more columns. Each processing unit may include processing elements (PEs), which may be arranged in an array including rows and columns. All PEs in processing engine 1070 may form a larger array including more rows and columns. Example PEs may be or may include one or more multiply-accumulate (MAC) units that perform MAC operations. In some embodiments (e.g., embodiments where computation block 1030 performs convolutional layers), computation in a MAC unit may be performing MAC operations on activation operands and weight operands. Activation operands may be activation tensors, which may include one or more activation values in the input tensor of the convolution. Different activation values may be in different input channels. Weight operands may be weight tensors, which may include one or more weights in the filters of the convolution. The values of the weights are determined by training the DNN or by compressing the neural network operations after training. The weights in the weight operands may be in different input channels. In some embodiments, activation operands or weight operands are vectors along the input channel dimension.

[0137] In some embodiments, a MAC unit includes one or more multipliers for performing multiplication. A MAC unit may also include one or more accumulators (“adders”) for performing accumulation. A MAC unit may also include one or more shifters to facilitate mixed-precision computation. A column of MAC units is referred to as a MAC column. A MAC column may be associated with one or more MAC lanes. A MAC lane is a path used to load data into a MAC column, for example, by an IDU 1060. A MAC lane may also be referred to as a data transfer lane or a data loading lane. A MAC column may have multiple MAC lanes. The loading bandwidth of a MAC column is the sum of the loading bandwidths of all MAC lanes associated with that MAC column. Using a certain number of MAC lanes, data can be fed into the same number of independent MAC units simultaneously. In some embodiments, a MAC column has four MAC lanes to feed activation values or weights into the MAC column, each MAC lane may have 16 bytes of bandwidth, and the four MAC lanes may have a total loading bandwidth of 64 bytes.

[0138] In some embodiments, the processing unit may have a sparse logic unit for accelerating computation in the DNN based on data sparsity. For example, the sparse logic unit may acquire or generate a sparse bitmap, use the sparse bitmap to identify non-zero values in the activation register file or weight register file, and send the non-zero values to the PE to perform computation, while skipping zero values in the activation register file or weight register file.

[0139] Post-processing engine 1080 processes the output of processing engine 1070. Post-processing engine 1080 may include one or more post-processing elements (PPEs). In some embodiments, the PPEs in post-processing engine 1080 may be arranged in an array having rows and columns. In some embodiments, post-processing engine 1080 computes an activation function. Post-processing engine 1080 may receive the output of processing engine 1070 as input to the activation function. In addition to or instead of the activation function, post-processing engine 1080 may perform other types of post-processing on the output of processing engine 1070. For example, post-processing engine 1080 may apply a bias to the output of processing engine 1070. In some embodiments, post-processing engine 1080 may be bypassed for certain neural network operations.

[0140] ODU 1090 ejects data from processing engine 1070 or post-processing engine 1080 (e.g., from register files within processing engine 1070 or post-processing engine 1080). The ejection module can write data to local memory 1040. The ejected data can be tensors, such as the output tensors of neural network operations. In some embodiments, ODU 1090 can eject data at the cell level. For each processing unit, ODU 1090 can eject the output of a PE in the processing unit based on the row or column index of each PE. For example, ODU 1090 can use a periodic sequence to eject data from a processing unit. ODU 1090 can eject the output of some PEs in each cycle. The periodic sequence can be configured based on configuration parameters indicating the operating mode of IDU 1060.

[0141] In some embodiments, ODU 1090 includes sparse encoding logic that can convert the output of processing engine 1070 from a dense format to a sparse format. For example, ODU 1090 can be implemented using one or more sparse encoders. A sparse encoder converts dense data into compressed data based on the sparsity in the dense data. For example, a sparse encoder can remove zeros from the data computed by processing engine 1070. A sparse encoder can also generate a sparse graph representing the sparsity of the dense data.

[0142] In some embodiments, the data discharged from the processing engine 1070 may be output data elements of a DNN layer. A sparse encoder can generate a compressed version of the output tensor. The sparse encoder can identify each zero activation value in the output tensor and remove these activation values from the output tensor to generate a compressed activation tensor (also known as a "sparse activation tensor"). The sparse encoder can also generate one or more sparsity maps for the output tensor. The sparsity map can indicate the sparsity of at least a portion of the output tensor. The sparsity map may include sparse elements (e.g., bits), each element corresponding to a different activation value in a vector and indicating whether the corresponding activation value is set to zero.

[0143] ODU 1090 can write the compressed activation tensor and one or more sparse graphs to local memory 1040. The sparse activation tensor and one or more sparse graphs can be further loaded into memory 1010 (e.g., via DMA engine 1020). Additionally or alternatively, the sparse activation tensor and one or more sparse graphs can be loaded by IDU 1060 into processing engine 1070 for further computation, e.g., for performing deep learning operations at the next layer.

[0144] Figure 11 An example sparse unit 1100 according to various embodiments is shown. The sparse unit 1100 may be a processing engine (e.g., Figure 10 The sparse unit 1100 is a processing unit within the processing engine 1070. The sparse unit 1100 includes 16 MAC units 1110 (referred to individually as "MAC units 1110") forming a 4x4 MAC array. The MAC array has a 4x4 spatial shape, meaning the height and width of the MAC array are both four. The sparse unit 1100 also includes 16 weight register files 1120 (referred to individually as "weight register files 1120"), 16 activation register files 1130 (referred to individually as "activation register files 1130"), four row buffers 1140 (referred to individually as "row buffers 1140"), and an acceleration module 1160 (referred to individually as "acceleration module 1160"). In other embodiments, the sparse unit 1100 may include fewer, more, or different components. For example, the sparse unit 1100 may include other numbers of MAC units 1110, weight register files 1120, activation register files 1130, row buffers 1140, or acceleration modules 1160. As another example, sparse cell 1100 may include column buffers to replace or supplement row buffer 1140. Additionally, the shape (e.g., height or width) of the MAC array may vary.

[0145] MAC unit 1110 is configured to perform MAC operations. Each MAC unit 1110 may include one or more multipliers and one or more adders. A multiplier may multiply the activation value and weight at a time to compute a product. In some embodiments (e.g., embodiments where MAC unit 1110 includes multiple multipliers), the multipliers may operate simultaneously to process multiple activation-weight pairs and compute multiple products in one cycle. Adders may accumulate the products computed by the multipliers. Although in Figure 11 Not shown, but the sparse unit may include an adder tree comprising multiple adder layers. A first layer may receive the outputs of multiple MAC units 1110. The number of adders in the first layer may be half the number of MAC units 1110, and each adder may sum the outputs of two MAC units 1110. A second layer may receive the outputs of the adders in the first layer. The number of adders in the second layer may be half the number of adders in the first layer, and each adder in the second layer may sum the outputs of two adders in the first layer. The adder tree may include one or more other layers. The final layer may include a single adder that sums the outputs of the adders in the penultimate layer to a partial sum of the sparse unit 1100.

[0146] Weight register file 1120 stores the weights to be processed in MAC operations. Figure 11 In one embodiment, four weight register files 1120 are grouped into a storage set that stores data to be used by the columns of MAC unit 1110. There are four storage sets corresponding to the four columns of MAC unit 1110. In some embodiments, weight register files 1120 may correspond to MAC unit 1110 and store data to be processed by the MAC unit. In some embodiments, all 16 weight register files 1120 constitute a weight storage unit.

[0147] Activation register file 1130 stores the activation values to be processed during MAC operations. Figure 11 In one embodiment, four activation register files 1130 are grouped into a storage set that stores data to be used by the rows of MAC unit 1110. There are four storage sets corresponding to the four rows of MAC unit 1110. In some embodiments, the activation register files 1130 may correspond to MAC unit 1110 and store data to be processed by the MAC unit. In some embodiments, all 16 activation register files 1130 constitute an activation storage unit. A row buffer 1140 stores the output of MAC unit 1110. Each row buffer 1140 can eject the output of a single row of MAC unit 1110.

[0148] The acceleration module 1160 uses a weighted hybrid scheme to accelerate computation in the sparse unit 1100. Figure 11 In one embodiment, each acceleration module 1160 can control the computational acceleration of different MAC units 1110. The number of acceleration modules 1160 in the sparse unit 1100 is the same as the number of MAC units 1110 in the sparse unit 1100. In other embodiments, the acceleration module 1160 can control the acceleration of multiple MAC units 1110. Figure 11 As shown, each acceleration module 1160 includes a storage unit 1165 and control logic 1167. The storage unit 1165 stores a mixed-format diagram. The control logic 1167 can control the distribution of the stored activation values and weights from the weight register file 1120 and the activation register file 1130 to the MAC unit 1110 based on the mixed-format diagram. In some embodiments, the control logic 1167 can distribute weight operands and corresponding activation operands to the MAC unit 1110 for MAC operations. Weight operands can be sub-blocks (e.g., columns) of weight blocks. All weights in a weight operand can be in the same output channel and have the same spatial location, but these weights can be in different input channels.

[0149] In some embodiments, the weight operands may include one or more uncompressed weights and one or more compressed weights. The way control logic 1167 distributes compressed weights to MAC unit 1110 may differ from the way control logic 1167 distributes uncompressed weights. In some embodiments (e.g., embodiments with zero compressed weights), control logic 1167 may select non-zero weights stored in weight register file 1120 based on a mixed format diagram and distribute these non-zero weights to MAC unit 1110 for computation. Control logic 1167 may also distribute activation values corresponding to non-zero weights from activation register file 1130 to MAC unit 1110. Control logic 1167 may ignore zero weights and corresponding activation values, thereby skipping these weights and activation values in the computation.

[0150] In other embodiments (e.g., embodiments where the precision of compressed weights is lower than that of uncompressed weights), control logic 1167 may distribute both compressed and uncompressed weights to MAC unit 1110 in different ways. For example, control logic 1167 may distribute a compressed weight to MAC unit 1110 within one computation cycle, but distribute an uncompressed weight to MAC unit 1110 over multiple computation cycles. MAC unit 1110 may have a multiplier that can compute the product of a compressed weight and its corresponding activation value within one computation cycle. The multiplier may compute multiple products of uncompressed weights. Each of these products may be the result of multiplying a portion of the uncompressed weight by its corresponding activation value within one computation cycle. One or more of these products may be shifted and then added with one or more other products to compute the product of the uncompressed weight and the activation value. As another example, control logic 1167 can distribute multiple compressed weights to MAC unit 1110 within one computation cycle, but distribute one uncompressed weight to MAC unit 1110 within one computation cycle. In this example, MAC unit 1110 can have multiple multipliers that can compute multiple products for the uncompressed weight within one operation cycle, where each multiplier can multiply a portion of the uncompressed weight by its corresponding activation value. Each multiplier can multiply a compressed weight by its corresponding activation value within one computation cycle, allowing multiple multipliers to process multiple uncompressed weights within a single computation cycle.

[0151] like Figure 11 As shown, sparse unit 1100 is associated with multiplexers (MUX) 1103, 1104, 1105, and 1106. In other embodiments, sparse unit 1100 may be associated with other numbers of MUXs or other devices. MUX 1103 facilitates loading weights (e.g., from local memory 1040) into weight register file 1120. MUX 1104 facilitates loading activation values (e.g., from local memory 1040) into activation register file 1130. MUX 1105 facilitates loading a mixed-format graph into storage unit 1165. MUX 1106 may be an exhaust MUX, which may facilitate exhausting the output of MAC unit 1110 (e.g., exhausting to local memory 1040).

[0152] Figure 12 An example sparse cell array 1170 according to various embodiments is shown. The sparse cell array 1170 can be... Figure 10 An example of the 1070 processing engine. Figure 12In this configuration, the sparse cell array 1170 includes sparse cells 1180 (referred to individually as "sparse cells 1180") arranged in four columns and four rows, an activation value memory 1190, and a weight memory 1195. In other embodiments, the sparse cell array 1170 may include fewer, more, or different components. For example, the sparse cell array 1170 may include other numbers of columns, rows, or sparse cells 1180.

[0153] Each sparse unit 1180 can perform accelerated MAC operations. The MAC operations in the sparse unit 1180 can be accelerated based on a weighted mixing format. One embodiment of the sparse unit 1180 may be... Figure 11 The sparse unit 1100 is used in the neural network operation. The activation value memory 1190 stores activation values, such as those in the input tensor of a neural network operation. Activation values can be loaded from the activation value memory 1190 into the sparse unit 1180, for example, into an activation register file. The weight memory 1195 stores weights, such as those in a filter of a neural network operation. Weights can be loaded from the weight memory 1195 into the sparse unit 1180, for example, into a weight register file. The activation value memory 1190 or the weight memory 1195 can be a buffer.

[0154] Figure 13 An example PE 1300 according to various embodiments is shown. PE 1300 may be a processing unit (e.g., Figure 10 The processing unit in the 1070 processing engine is a unit component. Figure 13 In one embodiment, PE 1300 includes a MAC unit 1305, an activation register file 1310, a weight register file 1320, an output register file 1350, and a sparsity accelerator 1360. MAC unit 1305 includes a multiplier 1330 and an adder 1340. In other embodiments, PE 1300 may include fewer, more, or different components.

[0155] Activation register file 1310 stores activation operands, which can be context. Activation register file 1310 can be... Figure 11 An example of activation register file 1130. Weight register file 1320 stores weight operands. Weight register file 1320 can be... Figure 11An example of a weight register file 1120. Activation operands and weight operands can be loaded from memory (e.g., memory 1040) into activation register file 1310 and weight register file 1320, respectively. The sparsity accelerator 1360 receives a sparse bitmap 1315 corresponding to the sparse tensors in weight register file 1320. When MAC unit 1305 operates in combinatorial computation mode, sparse bitmap 1315 can be a combinatorial sparse bitmap. When MAC unit 1305 operates in activation computation mode, sparse bitmap 1315 can be an activation sparse bitmap. When MAC unit 1305 operates in weight computation mode, sparse bitmap 1315 can be a weight sparse bitmap. Sparse bitmap 1315 can have the same size (e.g., the same number of elements) or a larger size than the activation operands or weight operands.

[0156] Using sparse bitmap 1315, sparse accelerator 1360 selects four activation values from activation register file 1310 and four weights from weight register file 1320. Sparse accelerator 1360 transmits the selected activation values and weights to multiplier 1330. These selected data elements correspond to the non-zero elements of sparse bitmap 1315. These four selected activation values and four selected weights can form four activation-weight pairs. Multiplier 1330 can compute a product based on each activation-weight pair, thus computing a total of four products. These four products can be provided to adder 1340. Although... Figure 13 A single multiplier 1330 is shown, but the MAC unit 1305 may include multiple multipliers that can perform multiple multiplication operations simultaneously.

[0157] Adder 1340 sums these four multipliers and computes the cell-level internal partial sum. The four unselected elements of the dense tensor are not processed to save power and time, which does not affect the value of the cell-level internal partial sum. For example, when the dense tensor is a dense activation tensor, the weights corresponding to the unselected activation values are zero, so the product of the unselected activation values and the weights is zero, contributing nothing to the cell-level internal partial sum or other partial sums computed by the sparse unit. Similarly, when the dense tensor is a dense weight tensor, the activation values corresponding to the unselected weights are zero, so the product of the unselected weights and the activation values is zero, contributing nothing to the cell-level internal partial sum or other partial sums computed by the sparse unit. In other embodiments, MAC unit 1305 can operate in a dense mode, in which the sparsity bitmap 1315 is not used and the sparsity accelerator 1360 is inactive. MAC unit 1305 can process all activation values in the activation operand and all weights in the weight operand.

[0158] The cell-level internal partial sums can be stored in the output register file 1350. In some embodiments, the cell-level internal partial sums can be used multiple times. For example, the activation operand can represent N data blocks in the input tensor of the convolution, where N is an integer greater than 1. Instead of processing all N data blocks to compute N cell-level internal partial sums, the cell-level internal partial sums are computed once and used N times in the convolutional layer as N cell-level internal partial sums.

[0159] In some embodiments, PE 1300 receives one or more internal portions of PE level from one or more other PEs. Adder 1340 or accumulator ( Figure 13 (Not shown in the image) The internal partial sum of one or more PE levels can be accumulated with the internal partial sum of the PE level of PE1300, and the accumulated result (i.e., the multi-PE internal partial sum) is stored in the output register file 1350. The one or more other PEs can be in the same column as PE1300 in sparse cells. The multi-cell internal partial sum can be a column-level internal partial sum. In some embodiments, the internal partial sum of the PE level of PE1300 or the multi-cell internal partial sum can be sent to one or more other PEs for further accumulation.

[0160] Figure 14 This is a block diagram of an example computing device 2000 according to various embodiments. In some embodiments, the computing device 2000 may be used as at least a portion of a DNN system 300. Multiple components are... Figure 14 The components are shown as being included in computing device 2000, but any one or more of these components may be omitted or copied to suit the application. In some embodiments, some or all of the components included in computing device 2000 may be attached to one or more motherboards. In some embodiments, some or all of these components are manufactured on a single system-on-a-chip (SoC) die. Furthermore, in various embodiments, computing device 2000 may not include... Figure 14 The computing device 2000 may include one or more of the components shown, but may include interface circuitry for coupling to said one or more components. For example, the computing device 2000 may not include the display device 2006, but may include display device interface circuitry (e.g., connector and driver circuitry) to which the display device 2006 may be coupled. In another set of examples, the computing device 2000 may not include the audio input device 2018 or the audio output device 2008, but may include audio input or output device interface circuitry to which the audio input device 2018 or the audio output device 2008 may be coupled.

[0161] Computing device 2000 may include processing device 2002 (e.g., one or more processing devices). Processing device 2002 processes electronic data from registers and / or memory to convert the electronic data into other electronic data that can be stored in registers and / or memory. Computing device 2000 may include memory 2004, which itself may include one or more memory devices, such as volatile memory (e.g., DRAM), non-volatile memory (e.g., read-only memory (ROM)), high-bandwidth memory (HBM), flash memory, solid-state memory, and / or hard disk drive. In some embodiments, memory 2004 may include memory sharing a die with processing device 2002. In some embodiments, memory 2004 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for training a DNN (e.g., combining...). Figure 7 The described method 700) or some operations performed by one or more components of the AI system 100. Instructions stored in one or more non-transitory computer-readable media can be executed by the processing device 2002.

[0162] In some embodiments, computing device 2000 may include communication chip 2012 (e.g., one or more communication chips). For example, communication chip 2012 may be configured to manage wireless communication for transmitting data to and from computing device 2000. The term "wireless" and its derivatives can be used to describe circuits, devices, systems, methods, technologies, communication channels, etc., which can transmit data via a non-solid medium using modulated electromagnetic radiation. This term does not imply that the associated device does not include any wires, however, in some embodiments they may not.

[0163] The 2012 communication chip can implement any of many wireless standards or protocols, including but not limited to Institute of Electrical and Electronics Engineers (IEEE) standards, such as Wi-Fi (IEEE 802.10 series), IEEE 802.16 standards (e.g., IEEE 802.16-2005 amendments), the Long Term Evolution (LTE) project, and any amendments, updates, and / or revisions (e.g., the improved LTE project, the Ultra Mobile Broadband (UMB) project (also known as "3GPP2"), etc.). Broadband Wireless Access (BWA) networks compatible with IEEE 802.16 are often referred to as WiMAX networks, an abbreviation for Global Microwave Access Interoperability, which is a certification mark for products that have passed conformance and interoperability testing of the IEEE 802.16 standard. The 2012 communication chip can operate according to Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Evolved HSPA (E-HSPA), or LTE networks. The 2012 communication chip can also operate according to Enhanced Data for GSM Evolution (EDGE), GSM EDGE Radio Access Network (GERAN), Universal Terrestrial Radio Access Network (UTRAN), or Evolved UTRAN (E-UTRAN). The communication chip 2012 can operate according to Code-Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Digital Enhanced Cordless Telecommunication (DECT), Evolution-Data Optimized (EV-DO) and its derivatives, as well as any other wireless protocol specified as 3G, 4G, 5G, etc. In other embodiments, the communication chip 2012 can operate according to other wireless protocols.The computing device 2000 may include an antenna 2022 to facilitate wireless communication and / or receive other wireless communications (e.g., AM or FM radio transmissions).

[0164] In some embodiments, the communication chip 2012 can manage wired communications such as electrical, optical, or any other suitable communication protocol (e.g., Ethernet). As described above, the communication chip 2012 may include multiple communication chips. For example, a first communication chip 2012 may be dedicated to short-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 2012 may be dedicated to long-range wireless communications such as GPS, EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, the first communication chip 2012 may be dedicated to wireless communications, and the second communication chip 2012 may be dedicated to wired communications.

[0165] The computing device 2000 may include a battery / power circuit 2014. The battery / power circuit 2014 may include one or more energy storage devices (e.g., batteries or capacitors) and / or circuitry for coupling components of the computing device 2000 to a power source (e.g., AC line power) that is separate from the computing device 2000.

[0166] The computing device 2000 may include a display device 2006 (or the corresponding interface circuitry described above). For example, the display device 2006 may include any visual indicator, such as a head-up display, computer monitor, projector, touch screen display, liquid crystal display (LCD), light-emitting diode display, or flat panel display.

[0167] The computing device 2000 may include an audio output device 2008 (or a corresponding interface circuit as described above). For example, the audio output device 2008 may include any device that generates audible indicators, such as a speaker, headphones, or earphones.

[0168] The computing device 2000 may include an audio input device 2018 (or a corresponding interface circuit as described above). The audio input device 2018 may include any device that generates a signal representing sound, such as a microphone, microphone array, or digital musical instrument (e.g., a musical instrument with a Musical Instrument Digital Interface (MIDI) output).

[0169] The computing device 2000 may include a GPS device 2016 (or a corresponding interface circuit as described above). As is known in the art, the GPS device 2016 can communicate with a satellite-based system and can receive the location of the computing device 2000.

[0170] The computing device 2000 may include other output devices 2010 (or corresponding interface circuitry as described above). Examples of other output devices 2010 may include audio codecs, video codecs, printers, wired or wireless transmitters for providing information to other devices, or additional storage devices.

[0171] The computing device 2000 may include other input devices 2020 (or corresponding interface circuits as described above). Examples of other input devices 2020 may include accelerometers, gyroscopes, compasses, image capture devices, keyboards, cursor control devices such as mice, styluses, touchpads, barcode readers, Quick Response (QR) code readers, any sensors, or radio frequency identification (RFID) readers.

[0172] The computing device 2000 can have any desired form factor, such as a handheld or mobile computer system (e.g., a mobile phone, smartphone, mobile internet device, music player, tablet computer, laptop computer, netbook computer, ultrabook computer, personal digital assistant (PDA), ultraportable personal computer, etc.), desktop computer system, server or other networked computing component, printer, scanner, monitor, set-top box, entertainment control unit, vehicle control unit, digital camera, digital video recorder, or wearable computer system. In some embodiments, the computing device 2000 can be any other electronic device that processes data.

[0173] The following paragraphs provide various examples of the embodiments disclosed herein.

[0174] Example 1 provides a method for training a DNN, comprising: providing an input tensor and a weight tensor of a layer in the DNN to an NPU to train the DNN through a training process including a forward operation and a backward operation; offloading the forward operation to a MatMul kernel on the NPU, the MatMul kernel being used to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and generating an output tensor of the layer; offloading the backward operation to the MatMul kernel, the MatMul kernel being used to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensor based on the gradient of the loss.

[0175] Example 2 provides the method described in Example 1, wherein the gradient of the loss is the weight gradient of the loss, wherein the MatMul kernel is further configured to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

[0176] Example 3 provides the method of Example 2, wherein the input tensor is the output of a previous layer in the DNN, and wherein the method further includes: propagating the input gradient of the loss from the layer to the previous layer.

[0177] Example 4 provides a method as described in any one of Examples 1-3, further comprising: during the forward operation, calculating the loss by applying a loss function to the output tensor of the layer and one or more reference values.

[0178] Example 5 provides a method according to any one of Examples 1-4, further comprising: during the reverse operation, calculating the gradient of the output tensor based on the loss, the output tensor of the layer, and one or more reference values.

[0179] Example 6 provides the method of any one of Examples 1-5, wherein the gradient of the output tensor is computed using an automatic differentiation module, wherein the automatic differentiation module is offloaded to the NPU.

[0180] Example 7 provides a method as described in any one of Examples 1-6, wherein the input tensor or the weight tensor comprises half-precision floating-point values or full-precision floating-point values.

[0181] Example 8 provides a method as described in any one of Examples 1-7, wherein the MatMul kernel is configured to perform MatMul operations on tensors of different dimensions.

[0182] Example 9 provides one or more non-transitory computer-readable media storing instructions executable to perform operations for training a DNN, the operations including: providing an input tensor and a weight tensor of a layer in the DNN to an NPU to train the DNN through a training process including a forward operation and a backward operation; offloading the forward operation to a matrix multiplication (MatMul) kernel on the NPU, the MatMul kernel being configured to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and generating an output tensor of the layer; offloading the backward operation to the MatMul kernel, the MatMul kernel being configured to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensor based on the gradient of the loss.

[0183] Example 10 provides one or more non-transitory computer-readable media as described in Example 9, wherein the gradient of the loss is the weight gradient of the loss, wherein the MatMul kernel is further configured to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

[0184] Example 11 provides one or more non-transitory computer-readable media as described in Example 10, wherein the input tensor is the output of a previous layer in the DNN, wherein the operation further includes: propagating the input gradient of the loss from the layer to the previous layer.

[0185] Example 12 provides one or more non-transitory computer-readable media as described in any one of Examples 9-11, wherein the operation further includes: during the forward operation, calculating the loss by applying a loss function to the output tensor of the layer and one or more reference values.

[0186] Example 13 provides one or more non-transitory computer-readable media as described in any one of Examples 9-12, wherein the operation further includes: during the reverse operation, calculating the gradient of the output tensor based on the loss, the output tensor of the layer, and one or more reference values.

[0187] Example 14 provides one or more non-transitory computer-readable media as described in any one of Examples 9-13, wherein the gradient of the output tensor is computed using an automatic differentiation module, wherein the automatic differentiation module is offloaded to the NPU.

[0188] Example 15 provides one or more non-transitory computer-readable media as described in any one of Examples 9-14, wherein the input tensor or the weight tensor comprises half-precision floating-point values or brain-float values.

[0189] Example 16 provides an apparatus comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable storage memory storing the computer program instructions executable by the computer processor to perform operations for training a deep neural network (DNN), the operations including: providing input tensors and weight tensors of layers in the DNN to an NPU to train the DNN through a training process including forward and backward operations; offloading the forward operation to a matrix multiplication (MatMul) kernel on the NPU, the MatMul kernel being configured to perform the layer by performing a first MatMul operation on the input tensors and the weight tensors and generating an output tensor of the layer; offloading the backward operation to the MatMul kernel, the MatMul kernel being configured to compute a gradient of a loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and training the layer by updating the weight tensors based on the gradient of the loss.

[0190] Example 17 provides the apparatus described in Example 16, wherein the gradient of the loss is the weight gradient of the loss, wherein the MatMul kernel is further configured to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

[0191] Example 18 provides the apparatus described in Example 17, wherein the input tensor is the output of a previous layer in the DNN, and wherein the operation further includes: propagating the input gradient of the loss from the layer to the previous layer.

[0192] Example 19 provides an apparatus as described in any one of Examples 16-18, wherein the operation further includes: during the forward operation, calculating the loss by applying a loss function to the output tensor of the layer and one or more reference values; and during the reverse operation, calculating the gradient of the output tensor based on the loss, the output tensor of the layer, and one or more reference values.

[0193] Example 20 provides the apparatus of any one of Examples 16-19, wherein the gradient of the output tensor is calculated using an automatic differentiation module, wherein the automatic differentiation module is offloaded to the NPU.

[0194] The foregoing description of the embodiments illustrated herein, including the content described in the abstract, is not intended to be exhaustive or to limit the present disclosure to the precise forms disclosed. Although specific implementations and examples of the present disclosure have been described herein for illustrative purposes, various equivalent modifications can be made within the scope of this disclosure, as will be appreciated by those skilled in the art. These modifications can be made to the present disclosure based on the foregoing detailed description.

Claims

1. A method for training a neural network, comprising: The input tensors and weight tensors of the layers in the neural network are provided to the neural processing unit to train the neural network through a training process, which includes forward operations and backward operations. The forward operation is offloaded to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being used to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and to generate the output tensor of the layer; The inverse operation is offloaded to the MatMul kernel, which is used to compute the gradient of the loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; as well as The layer is trained by updating the weight tensor based on the gradient of the loss.

2. The method according to claim 1, wherein, The gradient of the loss is the weight gradient of the loss, wherein the MatMul kernel is further used to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

3. The method according to claim 2, wherein, The input tensor is the output of a previous layer in the neural network, wherein the method further includes: propagating the input gradient of the loss from the layer to the previous layer.

4. The method according to any one of claims 1-3, further comprising: During the forward operation, the loss is calculated by applying a loss function to the output tensor of the layer and one or more reference values.

5. The method according to any one of claims 1-3, further comprising: During the reverse operation, the gradient of the output tensor is calculated based on the loss, the output tensor of the layer, and one or more reference values.

6. The method according to any one of claims 1-3, wherein, The gradient of the output tensor is calculated using an automatic differentiation module, which is then offloaded to the neural processing unit.

7. The method according to any one of claims 1-3, wherein, The input tensor or the weight tensor includes half-precision floating-point values or full-precision floating-point values.

8. The method according to any one of claims 1-3, wherein, The MatMul kernel is configured to perform MatMul operations on tensors of different dimensions.

9. One or more non-transitory computer-readable media storing instructions executable to perform operations of training a neural network, the operations including: The input tensors and weight tensors of the layers in the neural network are provided to the neural processing unit to train the neural network through a training process, which includes forward operations and backward operations. The forward operation is offloaded to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being used to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and to generate the output tensor of the layer; The inverse operation is offloaded to the MatMul kernel, which is used to compute the gradient of the loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; as well as The layer is trained by updating the weight tensor based on the gradient of the loss.

10. One or more non-transitory computer-readable media according to claim 9, wherein, The gradient of the loss is the weight gradient of the loss, wherein the MatMul kernel is further used to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

11. One or more non-transitory computer-readable media according to claim 10, wherein, The input tensor is the output of a previous layer in the neural network, wherein the operation further includes: propagating the input gradient of the loss from the layer to the previous layer.

12. One or more non-transitory computer-readable media according to any one of claims 9-11, wherein, The operation also includes: During the forward operation, the loss is calculated by applying a loss function to the output tensor of the layer and one or more reference values.

13. One or more non-transitory computer-readable media according to any one of claims 9-11, wherein, The operation also includes: During the reverse operation, the gradient of the output tensor is calculated based on the loss, the output tensor of the layer, and one or more reference values.

14. One or more non-transitory computer-readable media according to any one of claims 9-11, wherein, The gradient of the output tensor is calculated using an automatic differentiation module, which is then offloaded to the neural processing unit.

15. One or more non-transitory computer-readable media according to any one of claims 9-11, wherein, The input tensor or the weight tensor includes half-precision floating-point values or full-precision floating-point values.

16. One or more non-transitory computer-readable media according to any one of claims 9-11, wherein, The MatMul kernel is configured to perform MatMul operations on tensors of different dimensions.

17. An apparatus comprising: A computer processor is used to execute computer program instructions; as well as A non-transitory computer-readable storage device stores computer program instructions that can be executed by the computer processor to perform operations for training a neural network, the operations including: The input tensors and weight tensors of the layers in the neural network are provided to the neural processing unit to train the neural network through a training process, which includes forward operations and backward operations. The forward operation is offloaded to a matrix multiplication (MatMul) kernel on the neural processing unit, the MatMul kernel being used to perform the layer by performing a first MatMul operation on the input tensor and the weight tensor and to generate the output tensor of the layer; The inverse operation is offloaded to the MatMul kernel, which is used to compute the gradient of the loss by performing a second MatMul operation on the gradient of the output tensor and the input tensor; and The layer is trained by updating the weight tensor based on the gradient of the loss.

18. The apparatus according to claim 17, wherein, The gradient of the loss is the weight gradient of the loss.

19. The apparatus according to claim 18, wherein, The MatMul kernel is also used to compute the input gradient of the loss for the inverse operation by performing a third MatMul operation on the gradient of the output tensor and the weight tensor, wherein the weight tensor is further updated based on the input gradient of the loss.

20. The apparatus according to claim 19, wherein, The input tensor is the output of a previous layer in the neural network, wherein the operation further includes: propagating the input gradient of the loss from the layer to the previous layer.

21. The apparatus according to any one of claims 17-20, wherein, The operation also includes: During the forward operation, the loss is calculated by applying a loss function to the output tensor of the layer and one or more reference values.

22. The apparatus according to any one of claims 17-20, wherein, The operation also includes: During the reverse operation, the gradient of the output tensor is calculated based on the loss, the output tensor of the layer, and one or more reference values.

23. The apparatus according to any one of claims 17-20, wherein, The gradient of the output tensor is calculated using an automatic differentiation module, which is then offloaded to the neural processing unit.

24. The apparatus according to any one of claims 17-20, wherein, The input tensor or the weight tensor includes half-precision floating-point values or full-precision floating-point values.

25. The apparatus according to any one of claims 17-20, wherein, The MatMul kernel is configured to perform MatMul operations on tensors of different dimensions.