Npu and method thereof for generating kernels of artificial neural network model
By generating modulation kernels and utilizing basic kernels and kernel filters, the problems of slow processing speed and high power consumption caused by frequent memory reads are solved, achieving more efficient neural network processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- DEEPX CO LTD
- Filing Date
- 2021-11-11
- Publication Date
- 2026-06-26
AI Technical Summary
In the inference process of artificial neural network models, frequent reading of node and kernel weight values from memory leads to slow processing speed and high power consumption. Furthermore, the high similarity between kernels can be expressed by simple equations to reduce the number of reads.
By generating a modulation kernel, utilizing a basic kernel and kernel filters, memory read operations are reduced, power consumption is lowered, and processing speed is increased.
It reduces memory read time and power consumption, and improves the operating efficiency of artificial neural network processors.
Smart Images

Figure CN114692854B_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims priority to Korean Patent Application No. 10-2020-0186375, filed with the Korean Intellectual Property Office on December 29, 2020, the disclosure of which is incorporated herein by reference. Technical Field
[0003] This disclosure relates to artificial neural networks. Background Technology
[0004] Humans possess intelligence capable of recognition, classification, reasoning, prediction, and control / decision-making. Artificial intelligence (AI) refers to the artificial imitation of human intelligence.
[0005] The human brain is composed of a large number of nerve cells called neurons. Each neuron is connected to hundreds to thousands of other neurons through connections called synapses. Modeling how biological neurons work and the connections between them to simulate human intelligence is known as an artificial neural network (ANN) model. In other words, an artificial neural network is a system in which nodes simulating neurons are connected in a layered structure.
[0006] ANN models are classified into single-layer neural networks and multi-layer neural networks based on the number of layers. A typical multi-layer neural network consists of an input layer, hidden layers, and an output layer. Here, the input layer is the layer that receives external data, and the number of neurons in the input layer is the same as the number of input variables. The hidden layer is located between the input layer and the output layer, receiving signals from the input layer to extract features and transmitting the features to the output layer. The output layer receives signals from the hidden layer and outputs the received signals to the outside. The input signals between neurons are multiplied by the connection strength of each connection from zero (0) to one (1), and then summed. If the sum is greater than the threshold of the neuron, the neuron is activated and implemented as the output value through an activation function.
[0007] Meanwhile, in order to achieve higher levels of artificial intelligence, artificial neural networks with an increased number of hidden layers are called deep neural networks (DNNs).
[0008] There are many types of DNNs, but as is well known, Convolutional Neural Networks (CNNs) are good at extracting features from input data and recognizing feature patterns.
[0009] A convolutional neural network (CNN) is a type of neural network that functions similarly to image processing in the human brain's visual cortex. CNNs are well-known for their suitability for image processing.
[0010] refer to Figure 7Convolutional neural networks (CNNs) are configured with alternating convolutional and pooling channels. In a CNN, most of the computation time is consumed by convolution operations. CNNs identify objects by extracting image features for each channel via matrix-like kernels and providing a dynamic balance, such as dynamism or distortion, through pooling. For each channel, a feature map is obtained by convolving the input data with the kernel, and an activation function, such as a rectified linear unit (ReLU), is applied to generate the corresponding channel's activation map. Pooling can then be applied. The actual neural network classifying the patterns is located at the end of the feature extraction neural network and is called a fully connected layer. In the computational processing of a CNN, most of the computation is performed through convolution or matrix multiplication. Therefore, the frequency of fetching the necessary kernels from memory is quite frequent. A large portion of the operations in a CNN requires time to fetch the kernel corresponding to each channel from memory.
[0011] The memory consists of multiple storage cells, each with a unique memory address. When an artificial neural network processor generates a kernel read command stored in memory, there may be a delay of several clock cycles until the memory cell corresponding to that memory address is accessed.
[0012] Therefore, there is a problem that the time and power consumed in reading the necessary kernel from memory and performing convolution is very large. Summary of the Invention
[0013] The inventors of this disclosure have recognized that during the inference operation of an artificial neural network model, the NPU frequently reads the weight values of nodes and / or kernels of each layer of the artificial neural network model from a separate memory.
[0014] The inventors of this disclosure have recognized that the neural processing unit (NPU) is slow and consumes a lot of energy when reading the weight values of nodes and / or kernels of an artificial neural network model from a separate memory.
[0015] The inventors of this disclosure have recognized that the kernels of trained artificial neural network models have a very high degree of similarity to each other.
[0016] The inventors of this disclosure have recognized that even if the weights of some kernels of an artificial neural network model are partially adjusted within a certain range, the inference accuracy of the artificial neural network model will not decrease significantly.
[0017] Therefore, the inventors of this disclosure have recognized that kernels that are highly similar to each other can be expressed by a simple equation using a reference kernel.
[0018] Furthermore, the inventors of this disclosure have recognized that even if the model is trained or retrained to make the similarity between the kernels of the artificial neural network model very high, i.e. the deviation between kernels is very small, the inference accuracy of the artificial neural network model can be maintained at a commercially available level.
[0019] Therefore, the inventors of this disclosure have recognized that artificial neural network models can be trained by setting a cost function during training to improve target accuracy and minimize the maximum deviation between the reference kernel and other kernels of the artificial neural network model.
[0020] Furthermore, the inventors of this disclosure have recognized that the processing speed of a system for processing artificial neural network models can be improved and / or its power consumption reduced by minimizing the reading of node and / or kernel weight values from a separate memory and by using simple operations within a neural processing unit (NPU) to compute and use the weight values of nodes and / or kernels close to the reference node and / or kernel.
[0021] Therefore, one aspect of this disclosure is to provide a neural processing unit and its operation method that can generate modulation kernels with a simple algorithm, reduce the number of memory read operations, and reduce power consumption.
[0022] However, this disclosure is not limited thereto, and other aspects will be clearly understood by those skilled in the art from the following description.
[0023] According to embodiments of this disclosure, a neural processing unit (NPU) including circuitry is provided. The circuitry may include: at least one processing element (PE) configured to process operations of an artificial neural network (ANN) model; and at least one memory configured to store a first kernel and a first kernel filter. The NPU may be configured to generate a first modulation kernel based on the first kernel and the first kernel filter.
[0024] The first kernel may include a K×M matrix, where K and M are integers, and the K×M matrix may include at least one first weight value or multiple weight values applicable to the first layer of the ANN model.
[0025] The first kernel filter can be configured to be generated based on the difference between at least one kernel weight value of the first kernel and at least one modulation kernel weight value of the first modulation kernel.
[0026] The first kernel filter is set during the training process of the ANN model.
[0027] The circuit can be configured to generate the first modulation core based on the first core and the first core filter.
[0028] The circuit can be configured to generate a second modulation kernel based on the first kernel and the second kernel filter. The second kernel filter can be generated by applying a mathematical function to the first kernel filter, and the mathematical function can include at least one of a delta function, a rotation function, a transpose function, a bias function, and a global weight function.
[0029] The circuit can be configured to be based on one of the first kernel, the first kernel filter, the mathematical function applied to the first kernel or the first kernel filter, the coefficients applied to the first kernel or the first kernel filter, and the offset applied to the first kernel or the first kernel filter.
[0030] The at least one memory may also be configured to store mapping information between at least one core and at least one core filter to generate at least one modulation core.
[0031] The ANN model includes information about the bit allocation of a first weight bit, which is included in a first kernel filter used for the first mode.
[0032] The NPU can operate in one of several modes, including: a first mode in which a first portion of a plurality of weight bits included in the first kernel is applied to the ANN model; and a second mode in which all of the plurality of weight bits included in the first kernel are applied to the ANN model. If the first portion is activated according to the first mode, the weight bits in the first portion can be selected.
[0033] The first kernel may include multiple weight bits grouped into a first part and a second part, and the first part and the second part may be configured to be used selectively.
[0034] The first kernel filter can be configured such that the bit width of the values in the first kernel filter is smaller than the bit width of the weights of the first kernel.
[0035] According to another embodiment of this disclosure, a method for driving an artificial neural network (ANN) model is provided. The method may include: performing multiple operations on the ANN model; and storing multiple kernels and multiple kernel filters for use in the multiple operations. The multiple operations may include generating multiple modulation kernels based on a corresponding kernel filter of at least one of the multiple kernels and at least one of the multiple kernel filters.
[0036] The plurality of operations performed on the ANN model may further include: setting an arbitrary kernel in the plurality of kernels of the ANN model, the arbitrary kernel corresponding to the base kernel in the plurality of kernels; and setting an arbitrary kernel filter in the plurality of kernel filters for the arbitrary kernel corresponding to the base kernel.
[0037] The plurality of operations performed for the ANN model may further include: training the ANN model based on a training dataset and a validation dataset according to an accuracy cost function and a weight size cost function; and determining mapping data between the base kernel among the plurality of kernels and any kernel filter among the plurality of kernel filters.
[0038] The plurality of operations performed for the ANN model can be performed by a neural processing unit (NPU) including circuitry, the circuitry including at least one processing element (PE) and at least one memory. The plurality of operations performed for the ANN model may further include: reading a first kernel from the plurality of kernels from the at least one memory; performing a first operation by applying the first kernel from the plurality of kernels to a first layer of the ANN model or a first channel of the ANN model; reading the kernel filter from the at least one memory; generating a first modulation kernel based on the first kernel from the plurality of kernels and the first kernel filter from the plurality of kernel filters; and performing a second operation for the ANN model by applying the first modulation kernel to a second layer of the ANN model or a second channel of the ANN model.
[0039] According to another embodiment of this disclosure, an apparatus is provided. The apparatus may include: a semiconductor substrate on which conductive patterns are formed; at least one first memory electrically connected to the semiconductor substrate and configured to store information about a first kernel; and at least one neural processing unit (NPU) electrically connected to the substrate and configured to access the at least one first memory, the NPU including semiconductor circuitry comprising: at least one processing element (PE) configured to process operations of an artificial neural network (ANN) model, and at least one internal memory configured to store information about a first kernel filter. If the information about the first kernel is read from the at least one first memory, the first kernel may be stored in the at least one internal memory, and the operations of the ANN model may include generating a first modulation kernel based on the first kernel and the first kernel filter.
[0040] According to this disclosure, by generating at least one basic kernel and processing the convolution operation of the convolutional neural network, the power consumption required to read the corresponding kernel for each convolution operation can be reduced, and the memory read time can be reduced.
[0041] According to this disclosure, by utilizing a basic kernel and kernel filters, the number of kernels and / or the size of data stored in memory can be reduced.
[0042] Furthermore, due to the reduction in the amount of data read from memory to the core of the artificial neural network processor and / or the reduction in the number of memory read requests, it has the effect of reducing power consumption and memory read time.
[0043] Furthermore, according to this disclosure, the amount of data transfer and / or the number of memory read requests between the memory and the neural processing unit can be reduced. Since the occurrence of data scarcity and / or idle time in the artificial neural network processor is reduced, the operating efficiency of the artificial neural network processor can be improved. Attached Figure Description
[0044] Figure 1 This is a schematic diagram illustrating a neural processing unit according to the present disclosure.
[0045] Figure 2 This is a schematic diagram illustrating a processing element that can be applied to the processing element array of this disclosure.
[0046] Figure 3 It is shown Figure 1 Example diagram of a modified implementation of the neural processing unit 100.
[0047] Figure 4 This is a schematic diagram illustrating an exemplary artificial neural network model.
[0048] Figure 5A It shows including Figure 1 or Figure 3 An example diagram showing the configuration of the ANN driver in the neural processing unit 100. Figure 5B This is an example diagram illustrating energy consumption during the operation of the neural processing unit 100.
[0049] Figure 6A It shows including Figure 1 or Figure 3 An example diagram of a modified configuration of the ANN driver of the neural processing unit 100.
[0050] Figure 6B It shows including Figure 1 or Figure 3 An example diagram of a modified configuration of the ANN driver of the neural processing unit 100.
[0051] Figure 7 This is a diagram illustrating the basic structure of a convolutional neural network.
[0052] Figure 8 This is a diagram showing the input data of the convolutional layer and the kernel used for convolution operations.
[0053] Figure 9 This is a diagram illustrating the operation of a convolutional neural network that uses a kernel to generate activation maps.
[0054] Figure 10 It is shown Figures 7 to 9 A general diagram of the operations of convolutional neural networks described in the diagram is provided for better understanding.
[0055] Figure 11 This is a diagram showing the generation of the kernel filter.
[0056] Figure 12 This is an example diagram illustrating how to restore the original kernel or generate a kernel similar to the original kernel.
[0057] Figure 13 This is an example diagram illustrating another example of restoring the original kernel or generating a kernel similar to the original kernel.
[0058] Figure 14 This is an example diagram illustrating another example of restoring the original kernel or generating a kernel similar to the original kernel.
[0059] Figure 15 An example of generating another kernel by rotating the base kernel is shown.
[0060] Figure 16 An example is shown of generating another kernel by transposing the base kernel.
[0061] Figure 17 An example is shown of generating another kernel by transposing the base kernel.
[0062] Figure 18 This is an example diagram showing the kernel generation algorithm (or kernel recovery algorithm) arranged in a table for better understanding.
[0063] Figure 19 This is an example diagram illustrating the concept of using multiple basic kernels and multiple kernel filters to recover the structure of an artificial neural network (e.g., a CNN) model.
[0064] Figure 20 This is a flowchart illustrating the steps used to determine the basic kernel and kernel filters.
[0065] Figure 21This is a flowchart illustrating the steps after the kernel of a convolutional neural network is restored.
[0066] Figure 22 yes Figure 1 or Figure 3 An exemplary flowchart of the operation of the neural processing unit.
[0067] Figure 23A and 23B This is an example diagram showing the active bits of the kernel for each mode. Detailed Implementation
[0068] The specific structure or step-by-step description of the implementation scheme based on the concept disclosed in this specification or application is merely illustrative of the purpose of implementing the scheme based on the concept disclosed in this specification or application, and the implementation scheme based on the concept disclosed in this specification may be embodied in various forms and should not be construed as limited to the implementation scheme described in this specification or application.
[0069] Because embodiments based on the concept of this disclosure can have various modifications and can take various forms, specific embodiments will be shown in the accompanying drawings and described in detail in this specification or application. However, this is not intended to limit embodiments based on the concept of this disclosure in terms of a particular form of disclosure, and should be understood to include all modifications, equivalents, and alternatives within the spirit and scope of this disclosure.
[0070] Terms such as first and / or second may be used to describe various elements, but these elements should not be limited by these terms. The terms above are used only to distinguish one element from another; for example, without departing from the scope of the conception according to this disclosure, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element.
[0071] When one element is said to be "connected" or "in contact" with another element, it should be understood that the other element may be directly connected to or in contact with the other element, but other elements may be positioned between them. On the other hand, when it is said that an element is "directly connected" or "directly in contact" with another element, it should be understood that there are no other elements between them. Other expressions describing the relationship between elements, such as "between" and "immediately adjacent" or "nearby" and "closely adjacent," should be interpreted similarly.
[0072] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. Singular expressions may include plural expressions unless the context clearly specifies otherwise.
[0073] It should be understood that, as used herein, terms such as “comprising” or “having” are intended to indicate the presence of the stated features, quantities, steps, actions, components, parts, or combinations thereof, but do not preclude the possibility of the addition or presence of at least one other feature or quantity, step, action, element, part, or combination thereof.
[0074] Unless otherwise defined, all terms used herein, including technical or scientific terms, shall have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Terms such as those defined in common dictionaries shall be construed as having meanings consistent with their meanings in the relevant technical context and shall not be construed as having ideal or overly formal meanings unless expressly defined in this specification.
[0075] When describing the implementation scheme, descriptions of technical content that is well-known in the technical field to which this disclosure pertains and is not directly related to this disclosure may be omitted. This is to more clearly convey the main points of this disclosure without obscuring them by omitting unnecessary descriptions.
[0076] In the following text, for the purpose of understanding the disclosure presented in this specification, the terminology used in this specification will be briefly summarized.
[0077] NPU: An abbreviation for Neural Processing Unit (NPU), which can refer to a processor specifically designed for computing artificial neural network models, separate from the Central Processing Unit (CPU).
[0078] ANN: An abbreviation for Artificial Neural Network. In an attempt to mimic human intelligence, it refers to a network of nodes connected by synapses in a layered structure, mimicking the neurons in the human brain.
[0079] Information about the structure of artificial neural networks includes information about the number of layers, the number of nodes in each layer, the value of each node, the computational processing methods, the weight matrix applied to each node, and so on.
[0080] Information about data locality in artificial neural networks: Information that allows neural processing units to predict the order of operations of an artificial neural network model processed by a neural processing unit based on the order of data access requests to individual memory locations.
[0081] DNN: an abbreviation for Deep Neural Network, which can refer to increasing the number of hidden layers in an artificial neural network to achieve a higher level of artificial intelligence.
[0082] CNN: an abbreviation for Convolutional Neural Network, a type of neural network that functions similarly to image processing in the human brain's visual cortex. Convolutional neural networks are well-known for their suitability for image processing and are generally considered superior to patterns in extracting and recognizing features from input data.
[0083] Kernel: It can refer to the weight matrix applied to a CNN.
[0084] Basic kernel: It can refer to the representative kernel among the multiple kernels applied to a CNN.
[0085] Kernel filter: It can refer to the value used to generate / regenerate the value of another kernel from a base kernel or a matrix containing values.
[0086] The present disclosure will be described in detail below with reference to the accompanying drawings, which illustrate preferred embodiments of the present disclosure.
[0087] Figure 1 A neural processing unit according to this disclosure is shown.
[0088] Figure 1 The Neural Processing Unit (NPU) 100 is a processor specifically designed to perform operations on artificial neural networks.
[0089] An artificial neural network is a network of artificial neurons that, when receiving multiple inputs or stimuli, multiplies and adds weights, and then transforms and transmits values with added bias through an activation function. Artificial neural networks trained in this way can be used to output inference results from input data.
[0090] The neural processing unit 100 may be a semiconductor implemented as an electrical / electronic circuit. The electrical / electronic circuit may include multiple electronic components (e.g., transistors and capacitors). The neural processing unit 100 may include a processing element (PE) array 110, an NPU internal memory 120, an NPU scheduler 130, and an NPU interface 140. Each of the processing element array 110, the NPU internal memory 120, the NPU scheduler 130, and the NPU interface 140 may be a semiconductor circuit connected to multiple transistors. Therefore, some of these may be difficult to identify and distinguish with the human eye and may only be identifiable by their operation. For example, any circuit may operate as the processing element array 110 or as the NPU scheduler 130.
[0091] The neural processing unit 100 may include a processing element array 110, an NPU internal memory 120 configured to store an artificial neural network model that can be reasoned by the processing element array 110, and an NPU scheduler 130 configured to control the processing element array 110 and the NPU internal memory 120 based on information about the locality or structure of the artificial neural network model. Here, the artificial neural network model may include information about the data locality or structure of the artificial neural network model. The artificial neural network model may refer to an AI recognition model trained to perform a specific reasoning function.
[0092] The processing element array 110 can perform operations of an artificial neural network. For example, when input data is input, the processing element array 110 can train the artificial neural network. After training is complete, if input data is input, the processing element array 110 can perform operations to derive inference results through the trained artificial neural network.
[0093] The NPU interface 140 can connect via the system bus to... Figure 5A , Figure 6A or Figure 6B The various components (e.g., memory) in the ANN driver communicate.
[0094] For example, the neural processing unit 100 can store data from... Figure 5A , Figure 6A or Figure 6B The data of the artificial neural network model in memory 200 is loaded into NPU internal memory 120 through NPU interface 140.
[0095] The NPU scheduler 130 can be configured to control the operation of the processing element array 110 for inference operations of the neural processing unit 100 and read and write sequences of the NPU internal memory 120.
[0096] The NPU scheduler 130 can be configured to control the processing element array 110 and the NPU internal memory 120 by analyzing data locality information or information about the structure of the artificial neural network model.
[0097] The NPU scheduler 130 can analyze the structure of the artificial neural network model for operation within the processing element array 110. The artificial neural network model may include artificial neural network data that can store node data for each layer, information about the locality or structure of the layer arrangement data, and weight data for each connection network connecting the nodes of each layer. The artificial neural network data can be stored in the NPU scheduler 130 or in memory provided within the NPU internal memory 120. The NPU scheduler 130 can access... Figure 5A , Figure 6A or Figure 6B The memory 200 is used to utilize the necessary data. However, this disclosure is not limited thereto; that is, data locality information or information about the structure of an artificial neural network model can be generated from data such as node data and weight data. Weight data may also be referred to as weight kernels. Node data may also be referred to as feature maps. For example, data defining the structure of an artificial neural network model can be generated when designing the model or completing training. However, this disclosure is not limited thereto.
[0098] The NPU scheduler 130 can schedule the operation sequence of an artificial neural network model based on the data locality information or structural information of the artificial neural network model.
[0099] The NPU scheduler 130 can obtain the memory address values of the node data of the layers and the weight data of the connections in the artificial neural network model based on the data locality information or structural information of the model. For example, the NPU scheduler 130 can obtain the memory address values of the node data of the layers and the weight data of the connections in the artificial neural network model stored in memory. Therefore, the NPU scheduler 130 can retrieve the node data of the layers and the weight data of the connections in the artificial neural network model to be driven from memory 200 and store them in the NPU's internal memory 120. Each layer's node data can have its own corresponding memory address value. Each connection's weight data can have its own corresponding memory address value.
[0100] The NPU scheduler 130 can schedule the operation sequence of the processing element array 110 based on the data locality information or structural information of the artificial neural network model (e.g., information about the locality information or structure of the layers of the artificial neural network model).
[0101] Because the NPU scheduler 130 performs scheduling based on the data locality or structural information of artificial neural network models, its operation can differ from the general CPU scheduling concept. General CPU scheduling achieves optimal efficiency by considering fairness, efficiency, stability, and response time. That is, it schedules the most processing jobs to be executed simultaneously, taking into account priority and operation time.
[0102] Taking into account data such as the priority order of each process and the processing time of operations, traditional CPUs use algorithms for scheduling tasks.
[0103] However, the NPU scheduler 130 can determine the processing sequence based on information about the data locality or structure of the artificial neural network model.
[0104] Furthermore, the NPU scheduler 130 can determine the processing sequence to be used based on information about the data locality or structure of the artificial neural network model and / or information about the data locality or structure of the neural processing unit 100.
[0105] However, this disclosure is not limited to information regarding the locality or structure of the neural processing unit 100. For example, information regarding the locality or structure of the neural processing unit 100 can be used to determine the processing order by utilizing at least one of the following: the memory size of the NPU internal memory 120, the hierarchical structure of the NPU internal memory 120, the amount of data in processing elements PE1 to PE12, and the operator architecture of processing elements PE1 to PE12. That is, information regarding the locality or structure of the neural processing unit 100 may include at least one of the following: the memory size of the NPU internal memory 120, the hierarchical structure of the NPU internal memory 120, and the operator architecture of processing elements PE1 to PE12. However, this disclosure is not limited to information regarding the locality or structure of the neural processing unit 100. The memory size of the NPU internal memory 120 may include information about the memory capacity. The hierarchical structure of the NPU internal memory 120 may include information about the connection relationships between specific levels of each hierarchical structure. The operator architecture of processing elements PE1 to PE12 may include information about the components within the processing elements.
[0106] The neural processing unit 100 according to embodiments of the present disclosure may include at least one processing element, an NPU internal memory 120 for storing artificial neural network models that can be inferred by the at least one processing element, and an NPU scheduler 130 configured to control the at least one processing unit and the NPU internal memory 120 based on data locality information or structural information of the artificial neural network model. The NPU scheduler 130 may be configured to further receive information regarding the data locality information or structure of the neural processing unit 100. Furthermore, the information regarding the data locality information or structure of the neural processing unit 100 may include the memory size of the NPU internal memory 120, the hierarchical structure of the NPU internal memory 120, numbering data of the at least one processing unit, and at least one piece of data regarding the operator architecture of the at least one processing unit.
[0107] Based on the structure of the artificial neural network model, calculations are performed sequentially for each layer. In other words, once the structure of the artificial neural network model is determined, the sequence of operations for each layer can be determined. This sequence of operations or data flow based on the structure of the artificial neural network model can be defined as the data locality of the artificial neural network model at the algorithmic level.
[0108] When the compiler compiles the neural network model to be executed in the neural processing unit 100, it can reconstruct the neural network data locality at the neural processing unit-memory level.
[0109] In other words, the data locality of the neural network model at the neural processing unit-memory level can be configured according to the compiler, the algorithm applied to the neural network model, and the operating characteristics of the neural processing unit 100.
[0110] For example, even with the same artificial neural network model, the locality of the artificial neural network data of the artificial neural network model to be processed can be configured differently depending on the method by which the neural processing unit 100 computes the corresponding artificial neural network model. This includes factors such as feature map tiling, the stationarity technique of the processing unit, the number of processing units in the neural processing unit 100, the size of the feature map cache, the weights in the neural processing unit 100, the memory hierarchy in the neural processing unit 100, and the algorithm characteristics of the compiler that determine the sequence of computational operations of the neural processing unit 100 used to process the artificial neural network model. This is because, even when processing the same artificial neural network model using the above factors, the neural processing unit 100 can determine the data order required for each operation differently within a clock cycle unit.
[0111] The compiler can configure the neural network data locality of the neural network model at the neural processing unit-memory level in the word unit of the neural processing unit 100 to determine the data sequence required for physical operation processing.
[0112] In other words, the neural network data locality of an artificial neural network model existing at the neural processing unit-memory level can be defined as information that enables the neural processing unit 100 to predict the operation sequence of the artificial neural network model processed by the neural processing unit 100 based on the sequence of data access requests requested from the memory 200.
[0113] The NPU scheduler 130 can be configured to store information about the data locality or structure of the artificial neural network.
[0114] In other words, even using only information about the data locality or structure of the artificial neural network model, the NPU scheduler 130 can determine the processing sequence. Specifically, the NPU scheduler 130 can determine the operation sequence by using information about the data locality or structure from the input layer to the output layer of the artificial neural network. For example, input layer operations can be scheduled first, and output layer operations can be scheduled last. Therefore, when information about the data locality or structure of the artificial neural network model is provided to the NPU scheduler 130, all operation sequences of the artificial neural network model can be known. Thus, it has the effect of being able to determine all scheduling sequences.
[0115] Furthermore, the NPU scheduler 130 can determine the processing sequence by considering data locality information or structural information about the artificial neural network model and data locality information or information about the structure of the neural processing unit 100. Additionally, the NPU scheduler 130 can optimize processing for each determined sequence.
[0116] Therefore, when the NPU scheduler 130 receives information about the data locality or structure of the artificial neural network model and information about the data locality or structure of the neural processing unit 100, it has the effect of further improving the computational efficiency of each scheduling sequence determined by the data locality or structure information of the artificial neural network model. For example, the NPU scheduler 130 can obtain network data with four artificial neural network layers and three layers of weight data connecting each layer. In this case, the NPU scheduler 130 will be described below, for example, a method for scheduling processing sequences based on information about the data locality or structure of the artificial neural network model.
[0117] For example, the NPU scheduler 130 can schedule the input data for inference operations to first be set as node data of the first layer of the input layer of an artificial neural network model, and then first perform a multiplication and accumulation (MAC) operation on the node data of the first layer and the weight data of the first connection network corresponding to the first layer. However, the examples of this disclosure are not limited to MAC operations; multipliers and adders can be used to perform artificial neural network operations, and the multipliers and adders can be modified and implemented in various ways to perform artificial neural network operations. In the following, for ease of description, the corresponding operation may be referred to as the first operation, the result of the first operation as the first operation value, and the corresponding schedule as the first schedule.
[0118] For example, the NPU scheduler 130 can set the first operation value to the node data of the second layer corresponding to the first connection network, and can schedule the MAC operation to be executed after the first scheduling of the node data of the second layer and the weight data of the second connection network corresponding to the second layer. In the following text, for the convenience of description, the corresponding operation can be referred to as the second operation, the result of the second operation can be referred to as the second operation value, and the corresponding scheduling can be referred to as the second scheduling.
[0119] For example, the NPU scheduler 130 can set the second operation value to the node data of the third layer corresponding to the second connection network, and can schedule the MAC operation of the node data of the third layer and the weight data of the third connection network corresponding to the third layer to be executed in the second scheduling. In the following text, for the sake of convenience, the corresponding operation can be referred to as the third operation, the result of the third operation can be referred to as the third operation value, and the corresponding scheduling can be referred to as the third scheduling.
[0120] For example, the NPU scheduler 130 can set the third operation value to the node data of the fourth layer corresponding to the third connection network, and can schedule the inference results stored in the node data of the fourth layer to be stored in the NPU internal memory 120. In the following text, for ease of description, the corresponding schedule can be referred to as the fourth schedule.
[0121] In summary, the NPU scheduler 130 can control the NPU internal memory 120 and the processing element array 110 to execute operations in a first, second, third, and fourth scheduling sequence. That is, the NPU scheduler 130 can be configured to control the NPU internal memory 120 and the processing element array 110 to execute operations in a set scheduling sequence.
[0122] In summary, the neural processing unit 100 according to the embodiments of this disclosure can be configured to schedule processing sequences based on the structure of layers of an artificial neural network and the operation sequence data corresponding to that structure.
[0123] For example, the NPU scheduler 130 can be configured to schedule processing sequences based on structural data from the input layer to the output layer of the artificial neural network model or data locality information of the artificial neural network.
[0124] The NPU scheduler 130 controls the NPU internal memory 120 by utilizing scheduling sequences based on artificial neural network model structure data or locality information of artificial neural network data, thereby improving the operating speed of the neural processing unit. Therefore, it has the effect of improving the operating speed of the neural processing unit and memory reuse rate.
[0125] Due to the nature of the artificial neural network operation driven by the neural processing unit 100 according to the embodiments of this disclosure, the operation value of one layer will become the input data of the next layer in terms of characteristics.
[0126] Therefore, when the neural processing unit 100 controls the NPU internal memory 120 according to the scheduling sequence, it has the effect of improving the memory reuse rate of the NPU internal memory 120. Memory reuse can be determined by the number of times data stored in memory is read. For example, if specific data is stored in memory, and then that specific data is read only once, and then the corresponding data is deleted or overwritten, the memory reuse rate can be 100%. For example, if specific data is stored in memory, and that specific data is read four times, and then the corresponding data is deleted or overwritten, the memory reuse rate may be 400%. Memory reuse rate can be defined as the number of times initially stored data is reused. That is, memory reuse can mean reusing data stored in memory or a specific memory address storing specific data.
[0127] Specifically, if the NPU scheduler 130 is configured to receive structural data or locality information of artificial neural network models, when the provided structural data or locality information of artificial neural network models can determine the sequence data of artificial neural network operations, the NPU scheduler 130 identifies that the operation results of the node data of a specific layer of the artificial neural network model and the weight data of a specific connection network become the node data of the next corresponding layer.
[0128] Therefore, the NPU scheduler 130 can reuse the memory address value storing the result of a specific operation in subsequent operations. This improves memory reuse efficiency.
[0129] For example, the first operation value of the first schedule is set to the node data of the second layer of the second schedule. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the first operation value of the first schedule stored in the NPU internal memory 120 to the memory address value corresponding to the node data of the second layer of the second schedule. That is, the memory address value can be reused. Therefore, since the NPU scheduler 130 reuses the data at the memory address of the first schedule, the NPU internal memory 120 has the effect of utilizing the second layer node data of the second schedule without a separate memory write operation.
[0130] For example, the second operation value of the second scheduling described above is set to the node data of the third layer of the third scheduling. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the second operation value of the second scheduling stored in the NPU internal memory 120 to the memory address value corresponding to the node data of the third layer of the third scheduling. That is, the memory address value can be reused. Therefore, since the NPU scheduler 130 reuses the data at the memory address of the second scheduling, the NPU internal memory 120 has the effect of utilizing the third layer node data of the third scheduling without a separate memory write operation.
[0131] For example, the third operation value of the third schedule described above is set to the node data of the fourth layer of the fourth schedule. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the third operation value of the third schedule stored in the NPU internal memory 120 to the memory address value corresponding to the node data of the fourth layer of the fourth schedule. That is, the memory address value can be reused. Therefore, since the NPU scheduler 130 reuses the data at the memory address of the third schedule, the NPU internal memory 120 has the effect of utilizing the fourth layer node data of the fourth schedule without a separate memory write operation.
[0132] Furthermore, the NPU scheduler 130 can also be configured to control the NPU internal memory 120 by determining the scheduling sequence and memory reuse. In this case, the NPU scheduler 130 can provide efficient scheduling by analyzing the artificial neural network model structure data or the locality information of the artificial neural network data. Moreover, since the data required for memory reuse operations is not copied and stored in the NPU internal memory 120, it has the effect of reducing memory usage. Furthermore, the NPU scheduler 130 has the effect of improving the efficiency of the NPU internal memory 120 by calculating the reduced memory usage due to memory reuse.
[0133] Furthermore, the NPU scheduler 130 can be configured to monitor the resource usage of the NPU internal memory 120 and the resource usage of processing elements PE1 to PE12 based on the structural data of the neural processing unit 100. Therefore, it has the effect of improving the hardware resource utilization efficiency of the neural processing unit 100.
[0134] The NPU scheduler 130 of the neural processing unit 100 according to the embodiments of this disclosure has the effect of reusing memory by utilizing artificial neural network model structure data or artificial neural network data locality information.
[0135] In other words, when the artificial neural network model is a deep neural network, the number of layers and connections can be significantly increased, which can further maximize the effect of memory reuse.
[0136] In other words, if the neural processing unit 100 does not recognize the structural data or the locality information of the artificial neural network data and the operation sequence of the artificial neural network model, the NPU scheduler 130 cannot determine whether to reuse the storage of values in the NPU internal memory 120. Therefore, the NPU scheduler 130 unnecessarily generates the memory addresses required for each processing operation, and essentially the same data must be copied from one memory address to another. This results in unnecessary memory read / write operations and duplicate values stored in the NPU internal memory 120, potentially leading to unnecessary memory waste.
[0137] Processing element array 110 refers to a configuration in which a plurality of processing elements PE1 to PE12 are arranged, the processing elements PE1 to PE12 being configured to compute node data and weight data of the connection network of an artificial neural network. Each processing element may be configured to include a multiplication and accumulation (MAC) operator and / or an arithmetic logic unit (ALU) operator. However, embodiments according to this disclosure are not limited thereto. Processing element array 110 may be referred to as a plurality of processing elements, and each processing element may operate independently of each other, or a group of processing elements may operate as a group.
[0138] Although Figure 2 Multiple processing elements are illustrated exemplarily, but operators implemented as a tree of multiple multipliers and adders can also be configured to be arranged in parallel by replacing the MAC in one processing element. In this case, the processing element array 110 may be referred to as at least one processing element comprising multiple operators.
[0139] The processing element array 110 is configured to include a plurality of processing elements PE1 to PE12. The plurality of processing elements PE1 to PE12 in Figure 2 are merely examples for descriptive convenience, and the number of processing elements PE1 to PE12 is not limited thereto. The size or number of processing element arrays 110 can be determined by the number of processing elements PE1 to PE12. The size of the processing element array 110 can be implemented in the form of an N×M matrix. Here, N and M are integers greater than zero. The processing element array 110 can include N×M processing elements. That is, it can have at least one processing element.
[0140] The size of the processing element array 110 can be designed taking into account the characteristics of the artificial neural network model in which the neural processing unit 100 operates. In other words, the number of processing elements can be determined by considering the amount of data in the artificial neural network model to be operated, the required operating speed, the required power consumption, etc. The amount of data in the artificial neural network model can be determined based on the number of layers in the artificial neural network model and the amount of weight data in each layer.
[0141] Therefore, the size of the processing element array 110 of the neural network processing unit 100 according to the embodiments of this disclosure is not limited thereto. As the number of processing elements in the processing element array 110 increases, the parallel computing power of the running artificial neural network model increases, but the manufacturing cost and physical size of the neural processing unit 100 may increase.
[0142] For example, the artificial neural network model operating in the neural processing unit 100 can be an artificial neural network trained to detect thirty specific keywords, i.e., an AI keyword recognition model. In this case, considering the computational complexity, the size of the processing element array 110 of the neural processing unit 100 can be designed to be 4×3. In other words, the neural processing unit 100 can be configured to include twelve processing elements. However, it is not limited to this, and the number of multiple processing elements PE1 to PE12 can be selected, for example, from 8 to 16,384. That is, the embodiments of this disclosure are not limited in terms of the number of processing elements.
[0143] The processing element array 110 is configured to perform functions such as addition, multiplication, and accumulation required for artificial neural network operations. In other words, the processing element array 110 can be configured to perform multiplication and accumulation (MAC) operations.
[0144] In the following text, refer to Figure 2 The following explanation will be based on the first processing element PE1 of the processing element array 110.
[0145] Figure 2 The illustration shows an embodiment that can be applied to this disclosure. Figure 1 A schematic concept diagram of one of the processing elements (i.e., PE1) in the PE1 to PE12 processing element array.
[0146] For further reference Figure 1The neural processing unit 100 according to an embodiment of the present disclosure includes: a processing element array 110; an NPU internal memory 120 configured to store an artificial neural network model that can be inferred from the processing element array 110 or to store at least some data of the artificial neural network model; and an NPU scheduler 130 configured to control the processing element array 110 and the NPU internal memory 120 based on the artificial neural network model structure data or artificial neural network data locality information, and the processing element array 110 can be configured to quantize and output MAC operation results. However, embodiments of the present disclosure are not limited thereto.
[0147] The NPU's internal memory 120 can store all or part of the artificial neural network model, depending on the memory size and data volume of the model.
[0148] refer to Figure 2 The first processing element PE1 can be configured to include a multiplier 111, an adder 112, an accumulator 113, and a bit quantization unit 114. However, embodiments according to this disclosure are not limited thereto, and the processing element array 110 can be modified in consideration of the computational characteristics of artificial neural networks.
[0149] Multiplier 111 multiplies the received (N)-bit data and (M)-bit data. The output of multiplier 111 is (N+M)-bit data, where N and M are integers greater than zero. A first input unit for receiving (N)-bit data can be configured to receive values with characteristics such as variability, and a second input unit for receiving (M)-bit data can be configured to receive values with characteristics such as constancy. When the NPU scheduler 130 distinguishes between variable and constant values, it improves the memory reuse rate of the NPU internal memory 120. However, the input data of multiplier 111 is not limited to constant and variable values. That is, according to embodiments of this disclosure, since the input data of the processing element can be manipulated by understanding the characteristics of variable and constant values, the computational efficiency of the neural processing unit 100 can be improved. However, the neural processing unit 100 is not limited to the characteristics of constant and variable values of the input data.
[0150] Here, the meaning of a value with properties similar to a variable, or the meaning of a variable, refers to the update of the memory address storing the corresponding value whenever the input data is updated. For example, the node data of each layer can be the MAC operation value of the weight data of the artificial neural network model in which it is applied. In the case of object recognition of moving image data using the corresponding artificial neural network model, the node data of each layer will change because the input image changes every frame.
[0151] Here, the meaning of values with constant-like characteristics or the meaning of constants means that the memory address storing the corresponding value is preserved, regardless of how the input data is updated. For example, the weight data of the connection network is the only inference criterion of the artificial neural network model. Even if the artificial neural network model infers object recognition from moving image data, the weight data of the connection network can remain unchanged.
[0152] In other words, multiplier 111 can be configured to receive a variable and a constant. More specifically, the variable value input to the first input unit can be node data of a layer in the artificial neural network, which can be input data of the input layer, accumulated values of the hidden layers, and accumulated values of the output layer. The constant value input to the second input unit can be weight data of the connection network of the artificial neural network.
[0153] The NPU scheduler 130 can be configured to improve memory reuse by taking into account the characteristics of constant values.
[0154] The variable values are the calculated values of each layer, and the NPU scheduler 130 can control the NPU internal memory 120 to identify reusable variable values and reuse the memory based on the artificial neural network model structure data or the locality information of the artificial neural network data.
[0155] The constant values are the weight data for each network. The NPU scheduler 130 can control the NPU internal memory 120 to identify the constant values of the reusable connected networks and reuse the memory based on the artificial neural network model structure data or the locality information of the artificial neural network data.
[0156] In other words, the NPU scheduler 130 identifies reusable variable values and reusable constant values based on the structural data or locality information of the artificial neural network model, and the NPU scheduler 130 can be configured to control the NPU internal memory 120 to reuse the memory.
[0157] When zero is input to either the first or second input unit of multiplier 111, the processing element knows that the result of the operation is zero even if the operation is not performed. Therefore, the operation of multiplier 111 can be restricted so that the operation is not performed.
[0158] For example, when zero is input to one of the first and second input units of multiplier 111, multiplier 111 can be configured to operate in a zero-jump mode.
[0159] The number of bits of data input to the first and second input units can be determined based on the quantization of the node data and weight data of each layer of the artificial neural network model. For example, the node data of the first layer can be quantized to five bits, and the weight data of the first layer can be quantized to seven bits. In this case, the first input unit can be configured to receive five bits of data, and the second input unit can be configured to receive seven bits of data.
[0160] When quantized data stored in the NPU's internal memory 120 is input to the input terminal of the processing element, the neural processing unit 100 can control the number of quantization bits to be converted in real time. That is, the number of quantization bits for each layer can be different, and when the number of bits of the input data is converted, the processing element can be configured to receive bit information from the neural processing unit 100 in real time and convert the number of bits in real time to generate the input data.
[0161] Accumulator 113 accumulates the operation values of multiplier 111 and accumulator 113 using adder 112 in L loops. Therefore, the number of data bits in the output and input units of accumulator 113 can be output as (N+M+log2(L)) bits, where L is a positive integer.
[0162] When the accumulation is complete, the accumulator 113 may receive an initialization reset to initialize the data stored in the accumulator 113 to zero. However, embodiments according to this disclosure are not limited thereto.
[0163] The bit quantization unit 114 can reduce the number of bits of data output from the accumulator 113. The bit quantization unit 114 can be controlled by the NPU scheduler 130. The number of bits of quantized data can be output as X bits, where X is a positive integer. According to the above configuration, the processing element array 110 is configured to perform MAC operations, and the processing element array 110 has the function of quantizing and outputting the result of the MAC operation. In particular, as the number of L loops increases, this quantization has the effect of further reducing power consumption. In addition, if power consumption is reduced, it also has the effect of reducing the heat generation of the edge device. In particular, reducing heat generation has the effect of reducing the possibility of failure due to the high temperature of the neural processing unit 100.
[0164] The output data X bits of the bit quantization unit 114 can be node data of the next layer or input data of a convolution. If the artificial neural network model has been quantized, the bit quantization unit 114 can be configured to receive quantization information from the artificial neural network model. However, it is not limited to this; the NPU scheduler 130 can be configured to extract quantization information by analyzing the artificial neural network model. Therefore, the output data X bits can be converted into the number of quantized bits to correspond to the amount of quantized data and output. The output data X bits of the bit quantization unit 114 can be stored in the NPU internal memory 120 as the number of quantized bits.
[0165] The processing element array 110 of the neural processing unit 100 according to an embodiment of this disclosure includes a multiplier 111, an adder 112, an accumulator 113, and a bit quantization unit 114. The processing element array 110 can reduce the number of bits of (N+M+log2(L)) bit data output from the accumulator 113 to X bits via the bit quantization unit 114. The NPU scheduler 130 can control the bit quantization unit 114 to reduce the number of bits of the output data from the least significant bit (LSB) to the most significant bit (MSB) by a predetermined number of bits. When the number of bits of the output data is reduced, power consumption, computational load, and memory usage can be reduced. However, when the number of bits is reduced below a certain length, the inference accuracy of the artificial neural network model may rapidly decrease. Accordingly, the reduction in the number of bits of the output data (i.e., the degree of quantization) can be determined based on the reduction in power consumption, computational load, and memory usage in relation to the decrease in the inference accuracy of the artificial neural network model. The degree of quantization can also be determined by determining the target inference accuracy of the artificial neural network model and testing it while gradually reducing the number of bits. The degree of quantization can be determined for each operation value of each layer.
[0166] According to the first processing element PE1, by adjusting the number of bits of the N-bit and M-bit data of the multiplier 111 and reducing the number of bits of the operation value X-bit by the bit quantization unit 114, the processing unit array 110 has the effect of improving the MAC operation speed while reducing power consumption, and has the effect of performing convolution operations of artificial neural networks more efficiently.
[0167] The NPU internal memory 120 of the neural processing unit 100 may be a memory system configured with consideration of the MAC operation characteristics and power consumption characteristics of the processing element array 110.
[0168] For example, taking into account the MAC operation characteristics and power consumption characteristics of the processing element array 110, the neural processing unit 100 can be configured to reduce the number of bits in the operation values of the processing element array 110.
[0169] The NPU internal memory 120 of the neural processing unit 100 can be configured to minimize the power consumption of the neural network processing unit 100.
[0170] Considering the amount of data and operation steps of the artificial neural network model to be operated, the NPU internal memory 120 of the neural processing unit 100 can be a memory system configured to control the memory with low power.
[0171] Considering the amount of data and the operation steps of the artificial neural network model, the NPU internal memory 120 of the neural processing unit 100 can be a low-power memory system configured to reuse specific memory addresses where weight data is stored.
[0172] The neural processing unit 100 can provide various activation functions for providing nonlinearity. For example, it can provide the sigmoid function, the hyperbolic tangent function, or the ReLU function. Activation functions can be selectively applied after the MAC operation. The operation value after applying the activation function can be called an activation map.
[0173] Figure 3 The diagram shows... Figure 1 Example of modification of neural processing unit 100.
[0174] because Figure 3 The neural processing unit 100 and Figure 1 The neural processing unit 100 shown in the example is basically the same as that shown in the example, except for the processing element array 310. Therefore, for the sake of convenience, redundant descriptions will be omitted below.
[0175] Figure 3 The processing element array 110 shown in the example is configured to include a plurality of processing elements PE1 to PE12 and corresponding register files RF1 to RF12 for each of the processing elements PE1 to PE12.
[0176] Figure 3 The multiple processing elements PE1 to PE12 and the multiple register files RF1 to RF12 are merely examples for ease of description, and the number of multiple processing elements PE1 to PE12 and the number of multiple register files RF1 to RF12 is not limited thereto.
[0177] The size or number of processing element array 110 can be determined by the number of multiple processing elements PE1 to PE12 and multiple register files RF1 to RF12. The size of processing element array 110 and multiple register files RF1 to RF12 can be implemented in the form of an N×M matrix, where N and M are integers greater than zero.
[0178] The array size of the processing element array 110 can be designed with reference to the characteristics of the artificial neural network model in which the neural processing unit 100 operates. In other words, the memory size of the register file can be determined by taking into account the data size operated by the artificial neural network model, the required operating speed, and the required power consumption, etc.
[0179] The register files RF1 to RF12 of the neural processing unit 100 are static memory units directly connected to the processing elements PE1 to PE12. For example, register files RF1 to RF12 may include flip-flops and / or latches. Register files RF1 to RF12 can be configured to store the MAC operation values of the corresponding processing elements RF1 to RF12. Register files RF1 to RF12 can be configured to provide weight data and / or node data to the NPU system memory 120 or to receive weight data and / or node data from the NPU system memory 120.
[0180] Figure 4 An exemplary artificial neural network model is illustrated.
[0181] The operation of an exemplary artificial neural network model 110a that can operate in the neural processing unit 100 will be described below.
[0182] Figure 4 The exemplary artificial neural network model 110a can be trained by the neural processing unit 100 or by... Figure 5A The equipment shown Figure 6A The artificial neural network model 110a may be an artificial neural network trained by the device shown in 6B or a separate machine learning device. The artificial neural network model 110a may be an artificial neural network trained to perform various reasoning functions, such as object recognition and speech recognition.
[0183] Artificial neural network model 110a can be a deep neural network (DNN).
[0184] However, the artificial neural network model 110a according to the embodiments of this disclosure is not limited to deep neural networks.
[0185] For example, the artificial neural network model 110a can be implemented as a model such as VGG, VGG16, DenseNet, and a fully convolutional network (FCN) with an encoder-decoder structure, or a deep neural network (DNN) such as SegNet, DeconvNet, DeepLAB V3+, U-net, SqueezeNet, AlexNet, ResNet18, MobileNet-v2, GoogLeNet, ResNet-v2, ResNet50, ResNet101, Inception-v3, etc. However, this disclosure is not limited to the models mentioned above. Furthermore, the artificial neural network model 110a can be an ensemble model based on at least two different models.
[0186] The artificial neural network model 110a can be stored in the internal memory 120 of the NPU of the neural processing unit 100. Alternatively, the artificial neural network model 110a can be implemented in such a way that it is stored in... Figure 5A In device 1000, or Figure 6A It is stored in the memory 200 of the device 1000 of the 6B, and then loaded into the neural processing unit 100 when operating the artificial neural network model 110a.
[0187] In the following text, reference will be made to Figure 4 The process of reasoning an exemplary artificial neural network model 110a by the neural processing unit 100 is described.
[0188] Artificial neural network model 110a is an exemplary deep neural network model configured to include an input layer 110a-1, a first connection network 110a-2, a first hidden layer 110a-3, a second connection network 110a-4, a second hidden layer 110a-5, a third connection network 110a-6, and an output layer 110a-7. However, this disclosure is not limited to... Figure 4 The artificial neural network model shown. The first hidden layer 110a-3 and the second hidden layer 110a-5 can be referred to as multiple hidden layers.
[0189] Input layer 110a-1 may include, for example, input nodes x1 and x2. That is, input layer 110a-1 may include node data with two node values. Figure 1 or Figure 3 The NPU scheduler 130 shown can be set to... Figure 1 or Figure 3 The memory address of the input data of input layer 110a-1 is stored in the NPU internal memory 120 shown.
[0190] The first connection network 110a-2 may include, for example, connections with weight values, which include six weight values connecting each node of the input layer 110a-1 and each node of the first hidden layer 110a-3. Figure 1 or Figure 3 The NPU scheduler 130 can be set in the memory address of the NPU memory system 120 to store the weight data of the first connection network 110a-2. Each weight value is multiplied by the value of each input node, and the sum of the multiplied values is stored in the first hidden layer 110a-3.
[0191] The first hidden layer 110a-3 may include, for example, nodes a1, a2, and a3. That is, the first hidden layer 110a-3 may include node data containing three node values. Figure 1 or Figure 3 The NPU scheduler 130 can be set in the NPU internal memory 120 to store the memory address of the node value of the first hidden layer 110a-3.
[0192] The second connection network 110a-4 may include, for example, connections with weight values, which include nine weight values connecting each node in the first hidden layer 110a-3 and each node in the second hidden layer 110a-5. Each connection network includes its own weight values. Figure 1 or Figure 3 The NPU scheduler 130 can be configured to store the memory address of the weight values of the second connection network 110a-4 in the NPU internal memory 120. The weight values of the second connection network 110a-4 are multiplied by the input node values of the first hidden layer 110a-3, and the accumulated value of the multiplied values is stored in the second hidden layer 110a-5.
[0193] The second hidden layer 110a-5 may include, for example, nodes b1, b2, and b3. That is, the second hidden layer 110a-5 may include information about the values of the three nodes. The NPU scheduler 130 may set a memory address for storing information about the node values of the second hidden layer 110a-5 in the NPU internal memory 120.
[0194] The third connection network 110a-6 may include, for example, information having six weight values for each node connecting the second hidden layer 110a-5 and each node connecting the output layer 110a-7. The NPU scheduler 130 may be configured to store the weight values of the third connection network 110a-6 in the NPU internal memory 120. The weight values of the third connection network 110a-6 are multiplied by the input node values of the second hidden layer 110a-5, and the sum of the multiplied values is stored in the output layer 110a-7.
[0195] Output layer 110a-7 may include, for example, nodes y1 and y2. That is, output layer 110a-7 may include information about the values of the two nodes. NPU scheduler 130 may set a memory address for storing information about the node values of output layer 110a-7 in NPU internal memory 120.
[0196] In other words, the NPU scheduler 130 can analyze or receive the structure of the artificial neural network model for operation in the processing element array 110. The artificial neural network model may include artificial neural network data that may include node values for each layer, locality information or structural information about the layer layout data, or information about the weight values of each network connecting the nodes of each layer.
[0197] Since the NPU scheduler 130 is provided with structural data of the exemplary neural network model 110a or locality information of artificial neural network data, the NPU scheduler 130 is also able to analyze the sequence of operations from input to output of the artificial neural network model 110a.
[0198] Therefore, considering the scheduling sequence, the NPU scheduler 130 can be configured to store the memory address of the MAC operation value for each layer in the NPU internal memory 120. For example, the specific memory address could be the MAC operation value of the input layer 110a-1 and the first connection network 110a-2, and could also be the input data of the first hidden layer 110a-3. However, this disclosure is not limited to MAC operation values, and MAC operation values can also be referred to as artificial neural network operation values.
[0199] At this point, since the NPU scheduler 130 knows that the MAC operation results of the input layer 110a-1 and the first connection network 110a-2 are the input data of the first hidden layer 110a-3, the same memory address can be used. That is, the NPU scheduler 130 can reuse MAC operation values based on the artificial neural network model structure data or the locality information of the artificial neural network data. Therefore, the NPU internal memory 120 can provide the effect of memory reuse.
[0200] In other words, the NPU scheduler 130 stores the MAC operation value of the artificial neural network model 110a in a specific memory address of the NPU internal memory 120 according to the scheduling sequence, and the specific memory address storing the MAC operation value can be used as input data for the MAC operation of the next scheduling sequence.
[0201] MAC operation from the perspective of the first processing element PE1
[0202] The MAC operation will be described in detail based on the first processing element PE1. The first processing element PE1 can be designated to perform the MAC operation of node a1 of the first hidden layer 110a-3.
[0203] First, the first processing element PE1 inputs the x1 node data of input layer 110a-1 into the first input unit of multiplier 111, and inputs the weight value between x1 node and a1 node into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0204] Next, the first processing element PE1 inputs the x2 node value of input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between x2 node and a1 node to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the x1 node value calculated in the previous step and the weight between x1 node and a1 node is stored. Therefore, adder 112 generates the MAC operation value of x1 node and x2 node corresponding to a1 node.
[0205] Third, the NPU scheduler 130 can terminate the MAC operation of the first processing element PE1 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of the Lth cycle can be initialized to zero.
[0206] The bit quantization unit 114 can be appropriately controlled based on the accumulated value. More specifically, as the number of cycles L increases, the number of bits in the output value increases. At this time, the NPU scheduler 130 can remove predetermined low bits, so that the number of bits in the operation value of the first processing element PE1 becomes X bits.
[0207] MAC operation from the perspective of the second processing element PE2
[0208] The MAC operation will be described in detail based on the second processing element PE2. The second processing element PE2 can be specified to perform the MAC operation of node a2 of the first hidden layer 110a-3.
[0209] First, the second processing element PE2 inputs the x1 node value of input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between x1 node and a2 node to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0210] Next, the second processing element PE2 inputs the x2 node value of input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between x2 node and a2 node to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product value of x1 node value calculated in the previous step and the weight between x1 node and a2 node is stored. Therefore, adder 112 generates the MAC operation value of x1 node and x2 node corresponding to a2 node.
[0211] Third, the NPU scheduler 130 can terminate the MAC operation of the second processing element PE2 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0212] MAC operation from the perspective of the third processing element PE3
[0213] The MAC operation will be described in detail based on the third processing element PE3. The third processing element PE3 can be specified to perform the MAC operation of node a3 of the first hidden layer 110a-3.
[0214] First, the third processing element PE3 inputs the x1 node value of input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between x1 node and a3 node to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0215] Next, the third processing element PE3 inputs the x2 node value of input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between x2 node and a3 node to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the x1 node value calculated in the previous step and the weight between x1 node and a3 node is stored. Therefore, adder 112 generates the MAC operation value of x1 node and x2 node corresponding to a3 node.
[0216] Third, the NPU scheduler 130 can terminate the MAC operation of the third processing element PE3 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0217] Therefore, the NPU scheduler 130 of the neural processing unit 100 can simultaneously use three processing elements PE1 to PE3 to perform MAC operations on the first hidden layer 110a-3.
[0218] MAC operation from the perspective of the fourth processing element PE4
[0219] The MAC operation will be described in detail based on the fourth processing element PE4. The fourth processing element PE4 can be specified to perform the MAC operation of node b1 of the second hidden layer 110a-5.
[0220] First, the fourth processing unit PE4 inputs the value of node a1 of the first hidden layer 110a-3 into the first input unit of multiplier 111, and inputs the weight value between node a1 and node b1 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 to the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0221] Next, the fourth processing unit PE4 inputs the value of node a2 from the first hidden layer 110a-3 to the first input unit of multiplier 111, and inputs the weight value between node a2 and node b1 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the value of node a1 calculated in the previous step and the weight between node a1 and node b1 is stored. Therefore, adder 112 generates the MAC operation value of node a1 and node a2 corresponding to node b1. In this case, the counter value of iteration L can be 2.
[0222] Third, the fourth processing element PE4 inputs the value of node a3 from input layer 110a-1 to the first input unit of multiplier 111, and inputs the weight value between node a3 and node b1 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 2, the MAC operation values of nodes a1 and a2 corresponding to node b1 calculated in the previous step are stored. Therefore, adder 112 generates the MAC operation values of nodes a1, a2, and a3 corresponding to node b1.
[0223] Fourth, the NPU scheduler 130 can terminate the MAC operation of the fourth processing element PE4 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0224] MAC operation from the perspective of the fifth processing element PE5
[0225] The MAC operation will be described in detail based on the fifth processing element PE5. The fifth processing element PE5 can be designated to perform the MAC operation of node b2 of the second hidden layer 110a-5.
[0226] First, the fifth processing unit PE5 inputs the value of node a1 of the first hidden layer 110a-3 into the first input unit of multiplier 111, and inputs the weight value between node a1 and node b2 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 to the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0227] Next, the fifth processing unit PE5 inputs the value of node a2 from the first hidden layer 110a-3 to the first input unit of multiplier 111, and inputs the weight value between node a2 and node b2 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the value of node a1 calculated in the previous step and the weight between node a1 and node b2 is stored. Therefore, adder 112 generates the MAC operation value of node a1 and node a2 corresponding to node b2. In this case, the counter value of iteration L can be 2.
[0228] Third, the fifth processing unit PE5 inputs the value of node a3 of the first hidden layer 110a-3 into the first input unit of multiplier 111, and inputs the weight value between node a3 and node b2 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 2, the MAC operation values of nodes a1 and a2 corresponding to node b2 calculated in the previous step are stored. Therefore, adder 112 generates the MAC operation values of nodes a1, a2, and a3 corresponding to node b2.
[0229] Fourth, the NPU scheduler 130 can terminate the MAC operation of the fifth processing unit PE5 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0230] MAC operation from the perspective of the sixth processing element PE6
[0231] The MAC operation will be described in detail based on the sixth processing element PE6. The sixth processing element PE6 can be designated to perform the MAC operation of node b3 of the second hidden layer 110a-5.
[0232] First, the sixth processing element PE6 inputs the value of node a1 of the first hidden layer 110a-3 into the first input unit of multiplier 111, and inputs the weight value between node a1 and node b3 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0233] Next, the sixth processing element PE6 inputs the value of node a2 of the first hidden layer 110a-3 to the first input unit of multiplier 111, and inputs the weight value between node a2 and node b3 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the value of node a1 calculated in the previous step and the weight between node a1 and node b3 is stored. Therefore, adder 112 generates the MAC operation value of node a1 and node a2 corresponding to node b3. In this case, the counter value of iteration L can be 2.
[0234] Third, the sixth processing element PE6 inputs the value of node a3 of the first hidden layer 110a-3 into the first input unit of multiplier 111, and inputs the weight value between node a3 and node b3 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 2, the MAC operation values of nodes a1 and a2 corresponding to node b3 calculated in the previous step are stored. Therefore, adder 112 generates the MAC operation values of nodes a1, a2, and a3 corresponding to node b3.
[0235] Fourth, the NPU scheduler 130 can terminate the MAC operation of the sixth processing element PE6 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0236] Therefore, the NPU scheduler 130 of the neural processing unit 100 can simultaneously use three processing elements PE4 to PE6 to perform MAC operations on the second hidden layer 110a-5.
[0237] MAC operation from the perspective of the seventh processing element PE7
[0238] The MAC operation will be described in detail based on the seventh processing element PE7. The seventh processing element PE7 can be designated to perform the MAC operation of the y1 node of the output layer 110a-7.
[0239] First, the seventh processing element PE7 inputs the value of node b1 of the second hidden layer 110a-5 into the first input unit of multiplier 111, and inputs the weight value between node b1 and node y1 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 to the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0240] Next, the seventh processing element PE7 inputs the value of node b2 from the second hidden layer 110a-5 to the first input unit of multiplier 111, and inputs the weight value between node b2 and node y1 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the value of node b1 calculated in the previous step and the weight between node b1 and node y1 is stored. Therefore, adder 112 generates the MAC operation value of node b1 and node b2 corresponding to node y1. In this case, the counter value of iteration L can be 2.
[0241] Third, the seventh processing element PE7 inputs the value of node b3 of the second hidden layer 110a-5 into the first input unit of multiplier 111, and inputs the weight value between node b3 and node y1 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 2, the MAC operation values of nodes b1 and b2 corresponding to node y1 calculated in the previous step are stored. Therefore, adder 112 generates the MAC operation values of nodes b1, b2, and b3 corresponding to node y1.
[0242] Fourth, the NPU scheduler 130 can terminate the MAC operation of the seventh processing element PE7 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0243] MAC operation from the perspective of the eighth processing element PE8
[0244] The MAC operation will be described in detail based on the eighth processing element PE8. The eighth processing element PE8 can be designated to perform the MAC operation of the y2 node of output layer 110a-7.
[0245] First, the eighth processing element PE8 inputs the value of node b1 of the second hidden layer 110a-5 to the first input unit of multiplier 111, and inputs the weight value between node b1 and node y2 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is zero, the accumulated value is zero because there is no accumulated value. Therefore, the operation value of adder 112 will be the same as the operation value of multiplier 111. In this case, the counter value for iteration L can be 1.
[0246] Next, the eighth processing element PE8 inputs the value of node b2 from the second hidden layer 110a-5 to the first input unit of multiplier 111, and inputs the weight value between node b2 and node y2 to the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 1, the product of the value of node b1 calculated in the previous step and the weight between node b1 and node y2 is stored. Therefore, adder 112 generates the MAC operation value of node b1 and node b2 corresponding to node y2. In this case, the counter value of iteration L can be 2.
[0247] Third, the eighth processing element PE8 inputs the value of node b3 of the second hidden layer 110a-5 into the first input unit of multiplier 111, and inputs the weight value between node b3 and node y2 into the second input unit of multiplier 111. Adder 112 adds the operation value of multiplier 111 and the operation value of accumulator 113. At this time, when the number of iterations L is 2, the MAC operation values of nodes b1 and b2 corresponding to node y2 calculated in the previous step are stored. Therefore, adder 112 generates the MAC operation values of nodes b1, b2, and b3 corresponding to node y2.
[0248] Fourth, the NPU scheduler 130 can terminate the MAC operation of the eighth processing element PE8 based on the structural data of the artificial neural network model or the locality information of the artificial neural network data. At this time, the accumulator 113 can be initialized by inputting an initialization reset. That is, the counter value of L cycles can be initialized to zero. The bit quantization unit 114 can be appropriately controlled according to the accumulated value.
[0249] Therefore, the NPU scheduler 130 of the neural processing unit 100 can use two processing elements PE7 to PE8 simultaneously to perform MAC operations of the output layer 110a-7.
[0250] When the MAC operation of the eighth processing element PE8 is completed, the inference operation of the artificial neural network model 110a can be completed. That is, it can be determined that the artificial neural network model 110a has completed the inference operation of one frame. If the neural processing unit 100 infers video data in real time, the image data of the next frame can be input to the x1 and x2 input nodes of the input layer 110a-1. In this case, the NPU scheduler 130 can store the image data of the next frame in the memory address that stores the input data of the input layer 110a-1. If this process is repeated for each frame, the neural processing unit 100 can process the inference operation in real time. In addition, it also has the effect of reusing preset memory addresses.
[0251] Summarize Figure 4 In the case of the artificial neural network model 110a, the NPU scheduler 130 of the neural processing unit 100 can determine an operation scheduling sequence based on the structural data or locality information of the artificial neural network model 110a for inference operations of the artificial neural network model 110a. The NPU scheduler 130 can set the memory addresses required by the NPU internal memory 120 based on the operation scheduling sequence. The NPU scheduler 130 can also set memory addresses for memory reuse based on the structural data or locality information of the neural network model 110a. The NPU scheduler 130 can execute inference operations by specifying the processing elements PE1 to PE8 required for the inference operation.
[0252] Furthermore, if the amount of weight data connected to a node increases by L, the number of L iterations of the accumulator in the processing element can be set to L-1. That is, even if the amount of weight data in the artificial neural network increases, the accumulator still has the effect of easily performing inference operations by increasing the number of accumulations of the accumulator.
[0253] In other words, the NPU scheduler 130 of the neural processing unit 100 according to the embodiments of this disclosure can control the processing element array 100 and the NPU internal memory 120 based on structural data or artificial neural network data locality information, including structural data of the input layer 110a-1, the first connection network 110a-2, the first hidden layer 110a-3, the second connection network 110a-4, the second hidden layer 110a-5, the third connection network 110a-6 and the output layer 110a-7 of the artificial neural network model.
[0254] In other words, the NPU scheduler 130 can set the memory address values corresponding to the node data of the input layer 110a-1, the weight data of the first connection network 110a-2, the node data of the first hidden layer 110a-3, the weight data of the second connection network 110a-4, the node data of the second hidden layer 110a-5, the weight data of the third connection network 110a-6, and the node data of the output layer 110a-7 in the NPU internal memory 120.
[0255] The scheduling of the NPU scheduler 130 will be described in detail below. The NPU scheduler 130 can schedule the operation sequence of the artificial neural network model based on the structure data of the artificial neural network model or the locality information of the artificial neural network data.
[0256] The NPU scheduler 130 can obtain the memory address values of the node data of the layers of the artificial neural network model and the weight data of the connected network based on the structure data of the artificial neural network model or the locality information of the artificial neural network data.
[0257] For example, the NPU scheduler 130 can obtain the memory address values of the node data of the layers and the weight data of the connections of the artificial neural network model stored in the main memory. Therefore, the NPU scheduler 130 can retrieve the node data of the layers and the weight data of the connections of the artificial neural network model to be driven from the main memory and store them in the NPU's internal memory 120. Each layer's node data can have its own corresponding memory address value. Each connection's weight data can also have its own corresponding memory address value.
[0258] The NPU scheduler 130 can schedule the operation sequence of the processing element array 110 based on the structural data or locality information of the artificial neural network model, such as the layer arrangement structure data or locality information of the artificial neural network model built at compile time.
[0259] For example, the NPU scheduler 130 can obtain weight data, i.e., network connection data, which has four artificial neural network layers and weight values connecting the three layers. In this case, the following example illustrates how the NPU scheduler 130 schedules processing sequences based on the structural data of the neural network model or the locality information of the artificial neural network data.
[0260] For example, the NPU scheduler 130 sets the input data for the inference operation as the node data of the first layer (the first layer is the input layer 110a-1 of the artificial neural network model 110a), and can first schedule to execute the MAC operation of the node data of the first layer and the weight data of the first connection network corresponding to the first layer. In the following text, for the sake of convenience, the corresponding operation may be referred to as the first operation, the result of the first operation may be referred to as the first operation value, and the corresponding schedule may be referred to as the first schedule.
[0261] For example, the NPU scheduler 130 sets the first operation value to the node data of the second layer corresponding to the first connection network, and can schedule the MAC operation of the node data of the second layer and the weight data of the second connection network corresponding to the second layer to be executed after the first scheduling. In the following text, for ease of description, the corresponding operation can be referred to as the second operation, the result of the second operation can be referred to as the second operation value, and the corresponding scheduling can be referred to as the second scheduling.
[0262] For example, the NPU scheduler 130 sets the second operation value to the node data of the third layer corresponding to the second connection network, and can schedule the MAC operation of the node data of the third layer and the weight data of the third connection network corresponding to the third layer to be executed in the second schedule. In the following text, for the convenience of description, the corresponding operation may be referred to as the third operation, the result of the third operation may be referred to as the third operation value, and the corresponding schedule may be referred to as the third schedule.
[0263] For example, the NPU scheduler 130 sets the third operation value to the node data corresponding to the fourth layer (i.e., output layers 110a-7) of the third connection network, and can schedule the inference results stored in the node data of the fourth layer to be stored in the NPU internal memory 120. For ease of description, the corresponding schedule may be referred to as the fourth schedule. The inference result value can be sent and used to various elements of the device 1000.
[0264] For example, if the inference result value is the result value of detecting a specific keyword, the neural processing unit 100 sends the inference result to the central processing unit so that the device 1000 can perform an operation corresponding to the specific keyword.
[0265] For example, the NPU scheduler 130 can drive the first processing element PE1 to the third processing element PE3 in the first schedule.
[0266] For example, the NPU scheduler 130 can drive the fourth processing element PE4 to the sixth processing element PE6 in the second schedule.
[0267] For example, the NPU scheduler 130 can drive the seventh processing element PE7 to the eighth processing element PE8 in the third schedule.
[0268] For example, the NPU scheduler 130 can output inference results in the fourth schedule.
[0269] In summary, the NPU scheduler 130 can control the NPU internal memory 120 and the processing element array 110 to execute operations in a first, second, third, and fourth scheduling sequence. That is, the NPU scheduler 130 can be configured to control the NPU internal memory 120 and the processing element array 110 to execute operations in a set scheduling sequence.
[0270] In summary, the neural processing unit 100 according to embodiments of this disclosure can be configured to schedule processing sequences based on the structure of layers in an artificial neural network and operation sequence data corresponding to that structure. The scheduled processing order may include at least one operation. For example, since the neural processing unit 100 can predict the sequence of all operations, it can also schedule subsequent operations, or it can schedule operations according to a specific sequence.
[0271] The NPU scheduler 130 improves memory reuse by using scheduling sequences to control the NPU internal memory 120 based on artificial neural network model structure data or locality information of artificial neural network data.
[0272] Due to the nature of the artificial neural network operations driven by the neural processing unit 100 according to the embodiments of this disclosure, the operation values of one layer may have the characteristic of becoming input data for the next layer.
[0273] Therefore, when the neural processing unit 100 controls the NPU internal memory 120 according to the scheduling sequence, it has the effect of improving the memory reuse rate of the NPU internal memory 120.
[0274] Specifically, if the NPU scheduler 130 is configured to receive structural data or locality information of the artificial neural network model, it can determine the sequence of computations to be performed on the artificial neural network based on the provided structural data or locality information. The NPU scheduler 130 can determine that the computation results of node data in a specific layer and weight data in a specific connection network of the artificial neural network model become node data in consecutive layers. Therefore, the NPU scheduler 130 can reuse the memory address value of the storage operation result in subsequent operations.
[0275] For example, the first operation value of the first schedule described above is set as the node data of the second layer of the second schedule. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the first operation value of the first schedule stored in the NPU internal memory 120 to the memory address value corresponding to the node data of the second layer of the second schedule. That is, the memory address value can be reused. Therefore, by reusing the memory address value of the first schedule by the NPU scheduler 130, the NPU internal memory 120 has the effect of using the second layer node data of the second schedule without a separate memory write operation.
[0276] For example, the second operation value of the second schedule described above is set to the node data of the third layer of the third schedule. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the second operation value of the second schedule stored in the NPU internal memory 120 to the memory address value corresponding to the node data of the third layer of the third schedule. That is, the memory address value can be reused. Therefore, by reusing the memory address value of the second schedule by the NPU scheduler 130, the NPU internal memory 120 has the effect of using the third layer node data of the third schedule without a separate memory write operation.
[0277] For example, the third operation value of the third schedule described above is set to the node data of the fourth layer of the fourth schedule. Specifically, the NPU scheduler 130 can reset the memory address value corresponding to the third operation value of the third schedule stored in the NPU memory system 120 to the memory address value corresponding to the node data of the fourth layer of the fourth schedule. That is, the memory address value can be reused. Therefore, by reusing the memory address value of the third schedule by the NPU scheduler 130, the NPU internal memory 120 has the effect of using the fourth layer node data of the fourth schedule without a separate memory write operation.
[0278] Furthermore, the NPU scheduler 130 may also be configured to control the NPU internal memory 120 by determining the scheduling sequence and whether to reuse memory. In this case, the NPU scheduler 130 can provide optimized scheduling by analyzing the artificial neural network model structure data or the locality information of the artificial neural network data. Moreover, since the data required for memory reuse operations is not copied and stored in the NPU internal memory 120, it has the effect of reducing memory usage. Furthermore, the NPU scheduler 130 optimizes the NPU internal memory 120 by calculating the reduced memory usage through memory reuse.
[0279] According to the neural processing unit 100 of the embodiments of this disclosure, the first processing element PE1 can be configured such that a first input having N bits receives a variable value and a second input having M bits receives a constant value. Furthermore, such a configuration can be applied to other processing elements of the processing element array 110. That is, one input of the processing element can be configured to receive a variable value, while the other input can be configured to receive a constant value. Therefore, this has the effect of reducing the number of times the constant value data is updated.
[0280] At this time, the NPU scheduler 130 utilizes the structural data or data locality information of the artificial neural network model 100A, and can set the node data of the input layer 110a-1, the first hidden layer 110a-2, the second hidden layer 110a-3, and the output layer 110a-4 as variables, and set the weight data of the first connection network 110a-1, the second connection network 110a-2, and the third connection network 110a-3 as constants. That is, the NPU scheduler 130 can distinguish between constant values and variable values. However, this disclosure is not limited to constant and variable data types. Essentially, the reuse rate of the NPU internal memory 120 can be improved by distinguishing between frequently changing values and infrequently changing values.
[0281] In other words, the NPU internal memory 120 can be configured to store connection weight data stored in the NPU internal memory 120 while the inference operation of the neural processing unit 100 continues. Therefore, it has the effect of reducing memory read / write operations.
[0282] In other words, the NPU internal memory 120 can be configured to reuse MAC operation values stored in the NPU internal memory 120 while continuing inference operations.
[0283] In other words, for each processing element in the storage processing element array 110, the data update frequency of the memory address storing the N-bit input data of the first input can be greater than the data update frequency of the memory address storing the M-bit input data of the second input unit. That is, there is an effect where the data update frequency of the second input unit can be less than the data update frequency of the first input unit.
[0284] Figure 5A It shows including Figure 1 or Figure 3 The artificial neural network (ANN) driver of the neural processing unit 100, Figure 5B The energy consumed during the operation of the neural processing unit 100 is shown.
[0285] refer to Figure 5AThe ANN driving device 1000 may include a neural processing unit 100, a memory 200, a kernel generator 300, and a substrate 400.
[0286] Conductive patterns can be formed on substrate 400. Furthermore, neural processing unit 100, memory 200, and kernel generator 300 can be coupled to substrate 400 to be electrically connected to the conductive patterns. The conductive patterns can operate as a system bus allowing communication between neural processing unit 100, memory 200, and kernel generator 300.
[0287] The neural processing unit 100 may include Figure 1 or Figure 3 The components shown.
[0288] Memory 200 is a device for storing data under the control of a host device such as a computer or smartphone. Memory 200 may include volatile memory and non-volatile memory.
[0289] Volatile storage devices are storage devices that store data only while powered on and lose the stored data when power is cut off. Volatile memory can include static random access memory (SRAM), dynamic random access memory (DRAM), etc.
[0290] The memory 200 may include solid-state drives (SSDs), flash memory, magnetic random access memory (MRAM), phase-change RAM (PRAM), ferroelectric RAM (FeRAM), hard disks, flash memory, synchronous random access memory (SRAM), dynamic random access memory (DRAM), etc.
[0291] The main focus will be on explaining Convolutional Neural Networks (CNNs), which are a type of deep neural network (DNN) in artificial neural networks.
[0292] A convolutional neural network (CNN) can be a combination of one or more convolutional layers, pooling layers, and fully connected layers. CNNs have a structure suitable for learning and inference on two-dimensional data and can be trained using the backpropagation algorithm.
[0293] In the examples disclosed herein, for each channel of the convolutional neural network, there exists a kernel for extracting features from the input image of that channel. The kernel can consist of a two-dimensional matrix. The kernel performs convolution operations while traversing the input data. The size of the kernel can be arbitrarily determined, as can the stride of the kernel traversing the input data. The degree of matching between the kernel and all input data for each kernel can be a feature map or an activation map.
[0294] Since convolution is an operation consisting of a combination of input data and a kernel, activation functions such as ReLU can be applied to add non-linearity. When an activation function is applied to a feature map that is the result of a convolution operation, it can be called an activation map.
[0295] Convolutional neural networks may include AlexNet, SqueezeNet, VGG16, ResNet152, and MobileNet. The number of multiplications required for one inference operation in each neural network model is 727 MFLOPs, 837 MFLOPs, 16 MFLOPs, 11 MFLOPs, 11 MFLOPs, and 579 MFLOPs, respectively, and the data sizes of all weights, including the kernel, are 233 MB, 5 MB, 528 MB, 230 MB, and 16 MB, respectively. Therefore, it can be seen that they require a considerable amount of hardware resources and power consumption.
[0296] Traditionally, these kernels are stored in the memory of each corresponding channel, and the input data is processed by reading them from memory for each convolutional process. For example, as... Figure 5B As shown, for a 32-bit data read operation during convolution, the NPU internal memory 120 of the neural processing unit 100, which functions as SRAM, may consume 5 pJ of energy, while the memory 200, which functions as DRAM, may consume 640 pJ. Comparing memory to other operations, an 8-bit addition operation consumes 0.03 pJ, a 16-bit addition operation consumes 0.05 pJ, a 32-bit addition operation consumes 0.1 pJ, and an 8-bit multiplication operation consumes 0.2 pJ. In other words, memory consumes a considerable amount of power and leads to an overall performance degradation. Specifically, the power consumed when reading the kernel from memory 200 is 128 times greater than the power consumed when reading the kernel from the internal memory of the neural processing unit 100.
[0297] In other words, the operating speed of memory 200 is slower than that of neural processing unit 100, but the power consumption per unit operation is relatively higher. Therefore, minimizing read operations of memory 200 will affect the reduction of power consumption of device 1000.
[0298] To overcome this inefficiency, this disclosure provides a method and system for generating artificial neural network kernels with improved computational performance by minimizing data movement, which involves calling the kernel from memory 200 during each convolution process to reduce overall hardware resource and power consumption due to data movement.
[0299] Specifically, the memory 200 may include a basic kernel storage unit 210, a kernel filter storage unit 220, and a kernel generation algorithm storage unit 230.
[0300] According to the examples in this disclosure, multiple kernels can be generated based on rules determined according to a kernel generation algorithm (or kernel recovery algorithm) based on a base kernel.
[0301] The memory 200 can be configured with a base kernel storage device 210 for storing the base kernel, a kernel filter storage device 220 for storing kernel filters, and a kernel generation algorithm storage device 230 by allocating regions. The base kernel storage device 210, kernel filter storage device 220, and kernel generation algorithm storage device 230 can be configured by setting the memory address of the memory 200. However, this disclosure is not limited thereto.
[0302] Figure 5A The basic kernel storage device 210, kernel filter storage device 220, and kernel generation algorithm storage device 230 are shown stored in memory 200. However, according to one example, it can be stored in the NPU internal memory 120 included in the neural processing unit 100. Furthermore, although the kernel generator 300 is shown as independent of the neural processing unit 100, the kernel generator 300 can be located within the neural processing unit 100, such as... Figure 6A As shown.
[0303] The basic kernel storage device 210 can store basic kernels that serve as the basis for kernel generation. A basic kernel can be a basic kernel used to generate another layer of kernels, another layer of channels, and / or another channel of the same layer. A basic kernel is not necessarily single, and, according to one example, multiple basic kernels can exist. Each basic kernel can have different weight values.
[0304] A basic kernel can be applied to a cell in a channel or layer. For example, in a color image, each basic kernel can be applied to a cell in the RGB channels, and feature maps can be generated from the basic kernel applied to each channel. For example, the kernel of another layer can be generated based on the basic kernel.
[0305] In other words, a kernel for computing feature maps of another channel can be generated from the base kernel. Therefore, the ANN driver 1000 can select appropriate weights and assign them as the base kernel based on the kernel generation algorithm (or kernel recovery algorithm) used to generate kernels corresponding to each channel and / or layer. The base kernel can be determined with reference to the kernel filter described later.
[0306] For example, the ANN driver device 1000 can determine the base kernel by learning a process that includes the kernel with the weights that have the highest statistical inference accuracy.
[0307] For example, an arbitrary kernel with weights that minimize the average weight difference between kernels of multiple channels and / or layers can be set as the base kernel.
[0308] However, this disclosure is not limited to the examples above, and the basic kernel can be determined according to various algorithms.
[0309] The kernel filter storage device 220 can store kernel filters generated based on the difference between the base kernel and other kernels (i.e., the delta Δ value).
[0310] In the case of a trained convolutional neural network, many final kernels are stored. The ANN driver 1000 selects at least some of these kernels as base kernels. Additionally, kernels not selected as base kernels can be converted into kernel filters corresponding to the base kernels and then stored. That is, if kernel filters are applied to base kernels, the original kernel can be recovered or a kernel similar to the original kernel can be generated. A kernel recovered in this way can be called a modified kernel or an updated kernel. In other words, the original kernel can be divided into base kernels and kernel filters.
[0311] Kernel generator 300 can read the basic kernel, kernel filters, and kernel generation algorithm (including mapping information) from memory 200 and store them in internal memory 310. It then restores the original kernel or generates a kernel similar to the original kernel and transmits it to neural processing unit 100. Once the basic kernel, kernel filters, and mapping information are stored in the internal memory of kernel generator 300, kernel generator 300 may not need to access memory 200 again. Therefore, by accessing internal memory 310 instead of memory 200, power consumption can be saved by up to 128 times.
[0312] The ANN driver 1000 can recover the original kernel required for each layer or channel of the artificial neural network or generate a kernel similar to the original kernel by selectively reflecting the basic kernel and kernel filters by the kernel generator 300. Therefore, by storing only the reference basic kernel and kernel filters instead of storing all kernels corresponding to each layer or each channel in the memory 200, improved storage efficiency can be achieved compared to storing all kernels.
[0313] As a concrete example, the weight value included in the first kernel of the first layer (or the first channel) can be eight, and the weight value included in the second kernel of the second layer (or the second channel) can be seven. To store the first kernel of the first layer (or the first channel) and the second kernel of the second layer (or the second channel), 4 bits of memory may be required respectively.
[0314] According to the example in this disclosure, the difference (i.e., the Δ value) between weight value eight and weight value seven is one. Therefore, it may only require one bit of memory to store this difference one.
[0315] Figure 6A It shows including Figure 1 or Figure 3 Modified configuration of the ANN driver of the neural processing unit 100.
[0316] refer to Figure 6A The ANN driving device 1000 may include a neural processing unit 100, a memory 200, a kernel generator 300, and a substrate 400.
[0317] Conductive patterns can be formed on substrate 400. Additionally, neural processing unit 100 and memory 200 can be coupled to substrate 400 to be electrically connected to the conductive patterns. The conductive patterns can act as a system bus allowing neural processing unit 100 and memory 200 to communicate with each other.
[0318] The memory 200 may include a basic kernel storage device 210, a kernel filter storage device 220, and a kernel generation algorithm storage device 230. In addition... Figure 1 or Figure 3 In addition to the components shown, the neural processing unit 100 may also include a kernel generator 150. Figure 5A In the diagram, the kernel generator 300 is shown positioned outside the neural processing unit 100, but... Figure 6A In the diagram, the nucleus generator 150 is shown disposed in the neural processing unit 100.
[0319] The kernel generator 150 can generate (restore) the original kernel of the corresponding layer or channel based on the basic kernel and kernel filter stored in the memory 200 according to the kernel generation algorithm (or kernel restoration algorithm).
[0320] Figure 6B The diagram includes Figure 1 or Figure 3 Modified configuration of the ANN driver of the neural processing unit 100.
[0321] refer to Figure 6B The ANN driving device 1000 may include a neural processing unit 100, a memory 200, a kernel generator 300, and a substrate 400.
[0322] An electrically conductive pattern can be formed on the substrate 400. Additionally, the neural processing unit 100 and the memory 200 can be coupled to the substrate 400 to be electrically connected to the conductive pattern. The conductive pattern can act as a system bus allowing the neural processing unit 100 and the memory 200 to communicate with each other.
[0323] The memory 200 may include a basic kernel storage device 210, a kernel filter storage device 220, a kernel generation algorithm storage device 230, and a kernel generator 240.
[0324] exist Figure 6A In the diagram, kernel generator 150 is shown as being located within neural processing unit 100, but... Figure 6B In the image, kernel generator 240 is shown to be located in memory 200.
[0325] In other words, if the memory 200 operates at the same speed as the neural processing unit 100 and has built-in computing capabilities, and the power consumption per unit of operation is improved and very low, then the memory 200 may include a kernel generator 240.
[0326] The following text will describe it in detail. Figure 5A , 6A And the kernel generation algorithm storage device 230 of 6B.
[0327] The kernel generation algorithm (or kernel recovery algorithm) may include mapping information between the base kernel, the corresponding kernel filters, and the recovered (modulated) kernel. This will be discussed later. Figure 18 Describe it.
[0328] A kernel generation algorithm (or kernel recovery algorithm) can be an algorithm that promises to minimize the size of the kernel filter (i.e., the data size) through a learning process.
[0329] Kernel generation algorithms (or kernel recovery algorithms) can be generated based on algorithms that have been determined to have the best accuracy through a series of learning processes.
[0330] The kernel generation algorithm (or kernel recovery algorithm) may include at least a portion of the kernel (i.e., the matrix containing weight values) used in the artificial neural network, the number of channels, the number of layers, input data information, the algorithm processing method, and the order in which the kernels are retrieved from memory 200. Specifically, the kernel generation algorithm (or kernel recovery algorithm) may indicate a method for generating (or recovering) the kernel of a specific layer.
[0331] At least one base kernel can be used to create kernel filters. The base kernel does not necessarily have to be the first-level kernel; a kernel from any level or any channel can be used to determine the base kernel.
[0332] The kernel generator 300 can generate kernel filters for another layer by applying a base kernel within a layer's unit and using a reference layer's kernel as the base kernel. Furthermore, within a layer, at least one kernel can be determined as the base kernel for each channel, and kernel filters can be generated based on the base kernel.
[0333] In one example, there might be cases where the input data contains only three RGB channels, as well as cases using dozens or more channels. The kernel generator 300 can generate kernel filters based on various techniques using different base kernels for each channel.
[0334] Furthermore, various techniques for generating another kernel from a base kernel can be applied differently to each layer or each channel. Specifically, techniques for generating another kernel from a base kernel can include one of the following: a first method of using the base kernel as is in another layer or channel; a second method of using a kernel filter; a third method of modifying the base kernel itself without considering the kernel filter; and a fourth method of modifying both the kernel filter and the base kernel simultaneously.
[0335] Specifically, the third method, which modifies the basic kernel itself, can be implemented by changing the order in which data is retrieved from memory 200. Data stored in memory 200 can be represented as addresses indicating their locations. For example, in memory 200, locations can be represented by column addresses and row addresses. An artificial neural network can alter the order in which each data value from the basic kernel is received by sending the modified addresses to memory 200, based on a kernel generation algorithm (or kernel recovery algorithm).
[0336] For example, the kernel generation algorithm (or kernel recovery algorithm) can instruct: the first layer (or the first channel) to use the basic kernel as is; the kernels corresponding to the second layer (or the second channel) to the fourth layer (or the fourth channel) to be generated by rotating the basic kernel; the kernel corresponding to the fifth layer (or the fifth channel) to be generated by transposing the basic kernel; the kernels corresponding to the sixth layer (or the sixth channel) to the eighth layer (or the eighth channel) to be generated by adding or subtracting kernel filters to the basic kernel; and the kernel corresponding to the ninth layer to be rotated while multiplying by the kernel filter.
[0337] In particular, a third method of modifying the basic kernel itself may be effective for training convolutional neural networks for object recognition. For example, if rotation and transposition are applied, images rotated at various angles can be effectively trained and recognized when recognizing objects. That is, when an artificial neural network learns to recognize a specific object, the recognition rate when rotating or transposing an image of a specific object can also be improved if there is a first kernel for rotating the basic kernel and a second kernel for transposing the basic kernel. In other words, when an artificial neural network only learns a frontal face, the positions of the eyes, nose, and mouth are reversed, so it may not be able to recognize a face that is 180 degrees reversed. In particular, according to the examples of this disclosure, each corresponding kernel can be read from memory 200 without rotating or transposing the basic kernel. Therefore, considering memory reads, there is an effect of reducing power consumption.
[0338] The methods for generating another kernel from a basic kernel are not limited to this; various algorithms that can be implemented by the user through programs can be utilized.
[0339] As described above, by applying the kernel filter to the base kernel, the original kernel can be restored or a kernel similar to the original kernel can be generated. Therefore, this effectively reduces the capacity of memory 200. In other words, if the base kernel is chosen such that the kernel filter value is minimized, the data size of the kernel filter can be minimized, and the bit width of the data storing the kernel filter weights can be minimized.
[0340] In other words, even if the kernels of all layers (or channels) are not stored in memory 200, other kernels can be regenerated using only the basic kernel. Therefore, memory usage can be effectively reduced and operating speed can be improved.
[0341] Furthermore, by using predetermined kernel filters for each layer, the required memory can be reduced compared to storing the original kernel for each layer, and the kernel filters for each layer, determined after the training process, can be flexibly applied according to the level of artificial intelligence needs. Therefore, user-customized artificial intelligence optimized for the user's environment can be provided.
[0342] Figure 7 The basic structure of a convolutional neural network is illustrated.
[0343] refer to Figure 7 When moving from the current layer to the next layer, the convolutional neural network can reflect the weights between layers through convolution and transfer the weights to the next layer.
[0344] For example, convolution is defined by two main parameters. The size of the input data (typically a 1×1, 3×3, or 5×5 matrix) and the depth of the output feature map (number of kernels) can be calculated using convolution. These convolutions may start at a depth of 32, continue to a depth of 64, and end at a depth of 128 or 256.
[0345] Convolution can be performed by sliding a 3×3 or 5×5 window across a 3D input feature map, stopping at any position, and extracting a 3D block of surrounding features.
[0346] Each of these 3-D blocks can be transformed into a one-dimensional (1-D) vector through a tensor product of the same training weight matrix, which is called the weight. These vectors can then be spatially recombine to form a 3-D output feature map. All spatial locations in the output feature map can correspond to the same locations in the input feature map.
[0347] A convolutional neural network can include convolutional layers that perform convolution operations between the input data and the kernel (i.e., the weight matrix) trained through multiple gradient update iterations during training. If (m, n) is the kernel size and W is the weight value, the convolutional layer can perform convolution between the input data and the weight matrix by calculating the dot product.
[0348] A convolutional neural network can be tuned or trained so that input data leads to a specific output estimate. The convolutional neural network can be tuned using backpropagation based on a comparison between the ground truth and the output estimate until the output estimate gradually matches or approaches the ground truth.
[0349] Convolutional neural networks can be trained by adjusting the weights between neurons based on the difference between real data and actual output.
[0350] Figure 8 The input data 300 of the convolutional layer and the kernel 340 used for convolution operations are shown.
[0351] Input data 300 can be an image or an image displayed as a two-dimensional matrix consisting of rows 310 of a specific size and columns 320 of a specific size. Input data 300 can have multiple channels 330, where channels 330 can represent the number of color components of the input data image.
[0352] Meanwhile, kernel 340 can be a set of common parameters used for convolution to extract features from specific portions of the input data 300 while scanning them. Similar to the input data image, kernel 340 can be configured with rows 350 of a specific size, columns 360 of a specific size, and a specific number of channels 370. Typically, the size of rows 350 and columns 360 of kernel 340 is set to the same, and the number of channels 370 can be the same as the number of channels 330 in the input data image.
[0353] Figure 9 The operation of a convolutional neural network using kernels to generate activation maps is shown.
[0354] Kernel 410 can ultimately generate feature map 430 by traversing input data 420 at specified intervals and performing convolutions. When kernel 410 is applied to a portion of input data 420, convolution can be performed by multiplying the input data value at a specific location of that portion with the value at the corresponding location of kernel 410, and then summing all the generated values.
[0355] This convolution process generates computed values for the feature map, and each time kernel 410 iterates through input data 420, it generates the resulting convolution values to configure feature map 430. Each element value of the feature map is transformed into feature map 430 through the activation function of the convolutional layer.
[0356] exist Figure 9 In this model, the input data 420 to the convolutional layer is represented by a 4×4 two-dimensional matrix, and the kernel 410 is represented by a 3×3 two-dimensional matrix. However, the sizes of the input data 420 and the kernel 410 of the convolutional layer are not limited to these and can be varied according to the performance and requirements of the convolutional neural network including the convolutional layer.
[0357] As shown in the figure, when input data 420 is fed into the convolutional layer, kernel 410 iterates through the input data 420 at predetermined intervals (e.g., 1), and element-wise multiplication of the input data 420 and corresponding positions in the kernel is performed by multiplying the values at the same positions in 410 respectively. Kernel 410 can iterate through the input data 420 at regular intervals and sum the values obtained through multiple multiplications.
[0358] Specifically, kernel 410 assigns the element-wise multiplication value "fifteen", calculated at a specific position 421 of the input data 420, to the corresponding element 431 of the feature map 430. Kernel 410 assigns the element-wise multiplication value "sixteen", calculated at the next position 422 of the input data 420, to the corresponding element 432 of the feature map 430. Kernel 410 assigns the element-wise multiplication value "six", calculated at the next position 423 of the input data 420, to the corresponding element 433 of the feature map 430. Next, kernel 410 assigns the element-wise multiplication value "fifteen", calculated at the next position 424 of the input data 420, to the corresponding element 434 of the feature map 430.
[0359] As described above, when kernel 410 assigns the element-wise multiplication values of all corresponding positions calculated during the traversal of input data 420 to feature map 430, feature map 430 with a size of 2×2 can be completed.
[0360] At this point, if the input data 510 consists of, for example, three channels (R channel, G channel, B channel), then feature maps for each channel can be generated by convolution, where the same kernel for each channel or different channels traverse the data of each channel of the input data 420 and perform element-wise multiplication and addition.
[0361] Figure 10 Explanation Figures 7 to 9 The operation of the convolutional neural network described in [the document].
[0362] refer to Figure 10 For example, the input image is shown as a two-dimensional matrix of size 5×5. Furthermore, in Figure 10 In this example, three nodes, namely Channel 1, Channel 2, and Channel 3, are used.
[0363] First, the convolution operation of the first layer will be described.
[0364] The input image is convolved with the first node of the first layer and the first kernel of channel one, and the result is the output feature. Figure 1 Furthermore, the input image is convolved with the second kernel of channel two at the second node of the first layer, resulting in the output feature... Figure 2 Furthermore, the input image is convolved with the third kernel of channel three at the third node, resulting in the output feature... Figure 3 .
[0365] Next, the pooling operations used for the second layer will be described.
[0366] Features output from the first layer Figure 1 ,feature Figure 2 and characteristics Figure 3The input is fed into three nodes of the second layer. The second layer can receive the feature map output from the first layer as input and perform pooling. Pooling can reduce the size of a matrix or emphasize specific values. Pooling methods can include max pooling, average pooling, and min pooling. Max pooling is used to collect the maximum value within a specific region of the matrix, while average pooling can be used to find the average value within a specific region.
[0367] exist Figure 10 In the example, pooling is used to reduce the feature map size of a 5×5 matrix to a 4×4 matrix.
[0368] Specifically, the first node of the second layer receives the feature map from channel one as input, performs pooling, and outputs, for example, a 4×4 matrix. The second node of the second layer receives the feature map from channel two. Figure 2 As input, pooling is performed, and the output is, for example, a 4×4 matrix. The third node of the second layer receives the features from channel three. Figure 3 As input, it is pooled, and the output is, for example, a 4×4 matrix.
[0369] Next, the convolution operation of the third layer will be described.
[0370] The first node of the third layer receives the output from the first node of the second layer as input, performs convolution with the fourth kernel, and outputs the result. The second node of the third layer receives the output from the second node of the second layer as input, performs convolution with the fifth kernel of channel two, and outputs the result. Similarly, the third node of the third layer receives the output from the third node of the second layer as input, performs convolution with the sixth kernel of channel three, and outputs the result.
[0371] In this way, convolution and pooling are repeated, and finally, as... Figure 7 As shown, the output can be generated by a fully connected layer. The corresponding output can then be fed back into the artificial neural network for image recognition.
[0372] Figure 11 The generation of the kernel filter is shown.
[0373] For artificial neural network models, there may be multiple kernels used for multiple layers and / or multiple channels.
[0374] exist Figure 11 The example illustrates the kernel at level u (or channel u) and the kernel at level i (or channel i). Multiple such kernels can be stored... Figure 5A or Figure 6A Or in memory 200 as shown in 6B.
[0375] The u-th layer / channel may include a first kernel indicated by diagonal stripes, and the i-th layer / channel may include a second kernel indicated by a grid pattern.
[0376] One of the first kernel and the second kernel can be set as the base kernel.
[0377] Kernel filters can be generated by performing arbitrary operations on multiple kernels, generating alpha (α). For example... Figure 11 As shown, arbitrary operation α or transformation is performed on the first kernel of the u-th layer / channel and the second kernel of the i-th layer / channel to generate a kernel filter. Operation α can include, for example, addition, multiplication, division, combinations of arithmetic operations, convolution, and various other operations.
[0378] Since the generated kernel filter has a smaller bit width than the original kernel, the advantage is that it reduces the burden of accessing memory 200.
[0379] exist Figure 11 The generated kernel filter can be stored in [the system / process]. Figure 1 Or in the NPU's internal memory 120, or Figure 5A It is stored in memory 200, either 6A or 6B.
[0380] like Figure 11 As shown, kernel filters are generated from kernels at different layers, but kernel filters can also be generated from kernels at the same layer.
[0381] According to one example, the u-th layer and the i-th layer can be adjacent layers or distant layers, or at least three kernels can be combined in various ways to generate kernel filters.
[0382] like Figure 11 As shown, if the kernel filter is generated and stored in memory 120 or 200, then Figure 5A Kernel generator 300, Figure 6A Kernel generator 150 or Figure 6B The kernel generator 240 can restore the original kernel or generate a kernel similar to the original kernel by combining a base kernel and a kernel filter.
[0383] During the training of an artificial neural network (e.g., CNN) model, kernel filters can be set to have a small bit width (or small bit size). For example, when training an artificial neural network (such as a CNN) model, if multiple kernel filter candidates can be generated, any one with the smallest bit width (or small bit size) can be selected as the kernel filter.
[0384] Kernel filters can be generated using various combinations of programmable kernels. For example, during training, a convolutional neural network is trained in one direction to minimize the differences between kernels in adjacent layers, while simultaneously minimizing the difference between the estimated and target values. In this case, the kernel filter can be determined based on the differences between kernels between layers. Alternatively, kernel filters can be generated using different methods besides combinations of addition, multiplication, division, arithmetic operations, and convolution operations between kernels in layers.
[0385] Figure 12 The illustration shows an example of restoring the original kernel or generating a kernel similar to the original kernel.
[0386] refer to Figure 12 The basic kernel is illustrated as a 4×4 matrix. If the matrix elements of the basic kernel have a bit width of 16 bits, then the total data size of the 4×4 matrix could be 256 bits. (See reference) Figure 12 The kernel filter is illustrated as a 4×4 matrix. If the matrix elements of the kernel filter have a bit width of 5 bits, then the total data size of a kernel filter with a 4×4 matrix size can be a total of 80 bits.
[0387] When a recovery operation is performed based on the basic kernel and the first kernel filter, a first recovery (or modulation) kernel can be generated.
[0388] Furthermore, when performing a recovery operation based on the basic kernel and the second kernel filter, a second recovery (or modulation) kernel can be generated.
[0389] exist Figure 12 In this representation, the first and second recovery (or modulation) kernels are exemplarily represented as 4 × 4 matrices. However, alternatively, the first or second recovery (or modulation) kernel may be, for example, larger or smaller than the matrix size of the basic kernel. For example, the first recovery (or modulation) kernel may be a 5 × 5 matrix, and the second recovery (or modulation) kernel may be a 3 × 3 matrix. Conversely, the first recovery (or modulation) kernel may be a 3 × 3 matrix, and the second recovery (or modulation) kernel may be a 5 × 5 matrix.
[0390] Figure 13 This shows another example of restoring the original kernel or generating a kernel similar to the original kernel.
[0391] refer to Figure 13 As an example, the basic kernel is shown as a 4×4 matrix. If the matrix elements of the basic kernel have a bit width of 16 bits, then the total data size of the 4×4 matrix can be 256 bits. (See reference) Figure 13The kernel filter is shown as an example 4×4 matrix. If the matrix elements of the kernel filter have a bit width of 5 bits, then the data size of a kernel filter with a 4×4 matrix size can be a total of 80 bits.
[0392] When a recovery operation is performed based on the basic kernel and the first kernel filter, a first recovery (or modulation) kernel can be generated.
[0393] and Figure 12 The examples are different, in Figure 13 In this process, the second kernel filter may not be applied to the basic kernel, but it may be applied to the first recovered (or modulated) kernel.
[0394] Specifically, when a recovery operation is performed based on the first recovery (or modulation) kernel and the second kernel filter, a second recovery (or modulation) kernel can be generated.
[0395] exist Figure 13 In this representation, the first and second recovery (or modulation) kernels are exemplarily represented as 4 × 4 matrices. However, unlike this, the first or second recovery (or modulation) kernel can be, for example, larger or smaller than the matrix size of the basic kernel. For example, the first recovery (or modulation) kernel can be a 5 × 5 matrix, and the second recovery (or modulation) kernel can be a 3 × 3 matrix. Alternatively, the first recovery (or modulation) kernel can be a 3 × 3 matrix, and the second recovery (or modulation) kernel can be a 5 × 5 matrix.
[0396] Figure 14 This shows another example of restoring the original kernel or generating a kernel similar to the original kernel.
[0397] refer to Figure 14 As an example, the basic kernel is shown as a 4×4 matrix. If the matrix elements of the basic kernel have a bit width of 16 bits, then the data size of the 4×4 matrix can be a total of 256 bits. (See reference) Figure 14 The kernel filter is shown as an example 4×4 matrix. If the matrix elements of the kernel filter have a bit width of 5 bits, then the data size of a kernel filter with a 4×4 matrix size can be a total of 80 bits.
[0398] and Figure 13 The examples are different, in Figure 14 In this process, a second recovered (or modulated) kernel can be generated by performing arbitrary operations on the first kernel filter and the second kernel filter.
[0399] Specifically, when performing a recovery operation based on the basic kernel and the first kernel filter, a first recovery (or modulation) kernel can be generated.
[0400] Furthermore, when a recovery operation is performed based on the first kernel filter and the second kernel filter, a second recovery (or modulation) kernel can be generated.
[0401] exist Figure 14 In this representation, the first and second recovery (or modulation) kernels are exemplarily represented as 4 × 4 matrices. However, alternatively, the first or second recovery (or modulation) kernel may be, for example, larger or smaller than the matrix size of the basic kernel. For example, the first recovery (or modulation) kernel may be a 5 × 5 matrix, and the second recovery (or modulation) kernel may be a 3 × 3 matrix. Conversely, the first recovery (or modulation) kernel may be a 3 × 3 matrix, and the second recovery (or modulation) kernel may be a 5 × 5 matrix.
[0402] Figure 15 An example of generating another kernel by rotating the base kernel is shown.
[0403] refer to Figure 15 Another basic kernel can be generated by rotating the basic kernel. Compared to the examples in Figures 12 to 14, in Figure 15 In the example, another kernel can be generated by modifying the base kernel itself without using a kernel filter.
[0404] Therefore, with Figures 12 to 14 Compared to examples (where the basic kernel and kernel filter must be loaded from memory), it has the effect of reducing the amount of data to be transferred. Furthermore, depending on the required AI performance, it can be applied simultaneously with the kernel filter to operate at low power.
[0405] Figure 16 An example is shown of generating another kernel by transposing the base kernel.
[0406] refer to Figure 16 Another base kernel can be generated by transposing the base kernel. Compared to the examples in Figures 12 to 14, Figure 16 Examples can also be generated by modifying the base kernel itself without using kernel filters.
[0407] Therefore, with Figures 12 to 14 Compared to examples where the base kernel and kernel filters must be loaded from memory, this reduces the amount of data transferred. Furthermore, depending on the required AI performance, it can be applied simultaneously with kernel filters to operate at lower power consumption.
[0408] Figure 15 The rotation and Figure 16The transpose shown is merely an example, and kernels can be generated using various algorithms that can be implemented as programs. Various kernel generation methods, including rotation and transpose, can be appropriately selected and applied simultaneously, and convolutional neural networks can perform operations to find the optimal combination.
[0409] Figure 17 An example of generating another kernel by transposing the base kernel is shown.
[0410] refer to Figure 17 As an example, the basic kernel is shown as a 4×4 matrix.
[0411] When a recovery operation is performed based on the basic kernel and the first kernel filter, a first recovery (or modulation) kernel can be generated.
[0412] Furthermore, when the first recovered (or modulated) kernel is transposed, such as Figure 16 As shown, a second recovered (or modulated) kernel can be generated.
[0413] Furthermore, if the first kernel filter is rotated, a third reconstructed (or modulated) kernel can be generated.
[0414] Figure 18 The kernel generation (or kernel recovery) algorithms are shown in a table for better understanding.
[0415] A kernel generation algorithm (or kernel recovery algorithm) can be an algorithm that defines the computational processing method for input data through a training process. A kernel generation algorithm (or kernel recovery algorithm) can be generated based on an algorithm that has been determined to have optimal accuracy through a series of training processes.
[0416] Kernel generation algorithms (or kernel recovery algorithms) may include the number of layers used, input data information, arithmetic processing methods, and the order in which the kernel is retrieved from memory.
[0417] In addition, the kernel generation algorithm (or kernel recovery algorithm) may include information, i.e., mapping information, for recovering the original kernel of a specific layer or generating a kernel similar to the original kernel.
[0418] Methods for recovering other original kernels (i.e., modulated kernels or recovered kernels) using a base kernel or generating kernels similar to the original kernel can be applied differently for each layer or each channel. Specifically, methods for recovering the original kernel or generating kernels similar to the original kernel can include one of the following: a first method of using the base kernel as is in another layer or channel; a second method of using a kernel filter; a third method of modifying the base kernel itself without considering the kernel filter; and a fourth method of modifying both the kernel filter and the base kernel simultaneously.
[0419] For example, a kernel generation algorithm (or kernel restoration algorithm) can instruct: the first layer to use the base kernel as is; the kernels corresponding to the second to fourth layers to be generated by rotating the base kernel; the kernels corresponding to the fifth layer to be generated by transposing the base kernel; the weights corresponding to the sixth to eighth layers to be generated by adding or subtracting kernel filters to the base kernel; and the kernels corresponding to the ninth layer to be generated by rotating the kernel while adding kernel filters.
[0420] According to this disclosure, if kernel filters for each layer are used during training, the amount of memory used can be reduced compared to storing the entire kernel (i.e., weight matrix) for each layer. Furthermore, the kernel filters between layers determined during training can be flexibly adjusted according to the level of artificial intelligence required. Therefore, it has the effect of providing user-customized artificial intelligence optimized for the user's environment.
[0421] refer to Figure 18 The first kernel of channel one in layer one is determined as the basic kernel in layer one. The second kernel of channel two in layer one can be recovered (or generated) by combining the first kernel and the second kernel filter corresponding to the basic kernel. The third kernel of channel three in layer one can be recovered (or generated) by the first kernel, the first kernel filter, and a rotation corresponding to the basic kernel. Although not shown in the table, information about whether the rotation is performed on the first kernel or the first kernel filter may also be required. The fourth kernel of channel four in layer one can be recovered (or created) by the first kernel, the second kernel filter, and a transpose. Although not shown in the table, information about whether the transpose is performed on the first kernel or the second kernel filter may also be required.
[0422] Meanwhile, the kernel of channel one in layer eleven is the tenth kernel, and can also be the base kernel of layer eleven. The eleventh kernel of channel two in layer eleven can be recovered (or generated) by combining the tenth kernel corresponding to the base kernel and the sixth kernel filter. Additionally, the twelfth kernel of channel two in layer eleven can be recovered (or generated) by combining the tenth kernel corresponding to the base kernel and the eighth kernel filter.
[0423] Information used to restore a specific layer of the kernel or generate a kernel similar to the original kernel (i.e., mapping information, such as...) Figure 18 The size (as shown in the table) is at most tens or hundreds of kilobytes (kB). Therefore, the storage capacity is significantly reduced compared to the size required to store the entire kernel of all layers (e.g., hundreds of megabytes as in known techniques).
[0424] Figure 19This paper illustrates a concept for recovering the structure of an artificial neural network (e.g., CNN) model using multiple basic kernels and multiple kernel filters.
[0425] like Figure 19 As shown, multiple kernel filters can exist corresponding to the first basic kernel, and multiple kernel filters can exist corresponding to the second basic kernel. According to... Figure 19 In the example shown, each basic kernel can be, for example, 256 bits, and each kernel filter can be, for example, 16 bits.
[0426] exist Figure 19 The diagram shows how the first and second layer kernels are restored (or created) when operations are performed by combining the first base kernel and the corresponding kernel filters, and how the third and fourth layer kernels are restored (or generated) when operations are performed by combining the second base kernel and the corresponding kernel filters.
[0427] exist Figure 19 In the example shown, since there are four layers, each requiring three cores, a total of twelve cores are used. In this case, 256 bits multiplied by 12 cores would require 3,702 bits to store in memory. However, when using kernel filters, the total memory size required is reduced to 672 bits, which includes two 256-bit basic cores and five 16-bit kernel filters. As mentioned above, using kernel filters has the advantage of significantly reducing the required memory size.
[0428] Figure 20 The process for determining the basic kernel and kernel filters is shown.
[0429] Figure 20 The procedure shown can be executed during machine learning in artificial neural networks such as convolutional neural networks. Machine learning can be an algorithm in which a computer learns from data, discovers patterns on its own, and learns to take appropriate actions. For example, machine learning can include supervised learning, unsupervised learning, and reinforcement learning.
[0430] In step S2001, the kernel (i.e., a matrix including weight values) of each layer and channel to be applied to the artificial neural network model (e.g., a convolutional neural network model) can be determined.
[0431] For example, when the input information is an image and the image can be divided into three channels: red, green, and blue, three kernels for each of the three channels can be determined for each layer. Specifically, three kernels for the three channels can be determined in the first layer, and three kernels for the three channels can be determined in the second layer. Alternatively, when the input image can be divided into five channels, five kernels for each of the five channels can be determined for each layer. Alternatively, multiple kernel candidates can be determined for each channel. For example, if two kernel candidates are determined for each channel, and there are five channels, a total of ten kernel candidates can be determined.
[0432] In step S2003, at least one basic kernel can be selected from multiple kernels to be applied to each layer and channel. The selected basic kernel can minimize the bit width (or data size) of the kernel filter.
[0433] In step S2005, a kernel filter can be determined based on the correlation between the selected base kernel and other kernels.
[0434] For example, when there are three kernels in the three channels of the first layer, any one of the three kernels can be selected as the base kernel of the first layer. Specifically, the first kernel can be selected as the base kernel from the first kernel, the second kernel, and the third kernel. Furthermore, the first kernel filter can be determined based on the correlation between the base kernel and the second kernel, while the second kernel filter can be determined based on the correlation between the base kernel and the third kernel.
[0435] As another example, when there are three kernels with three channels in the first layer and three kernels with three channels in the second layer, one of six kernels can be selected as the base kernel for both the first and second layers. Specifically, the third kernel can be selected as the base kernel from the first to the third kernel in the first layer and the fourth to the sixth kernels in the second layer. The kernel filter can be determined based on the correlation between the third kernel as the base kernel and other kernels.
[0436] As another example, when layers one, two, and three exist, assuming each layer has three cores with three channels, there are a total of nine cores. In this case, one of the five cores, including the three cores from layer one and the two cores from layer two, can be selected as the first base core. Furthermore, one of the four cores, including the remaining core from layer two and the three cores from layer three, can be selected as the second base core.
[0437] As another example, suppose there are three layers and three channels (e.g., a red channel, a green channel, and a blue channel). Then, in the first layer, there is a first kernel for the red channel, a second kernel for the green channel, and a third kernel for the blue channel; in the second layer, there is a fourth kernel for the red channel, a fifth kernel for the green channel, and a sixth kernel for the blue channel; and in the third layer, there is a seventh kernel for the red channel, an eighth kernel for the green channel, and a ninth kernel for the blue channel. At this point, one of these three kernels (i.e., the first kernel for the red channel in the first layer, the fourth kernel for the red channel in the second layer, and the seventh kernel for the red channel in the third layer) can be chosen as the first basic kernel. Similarly, one of the second kernel for the green channel in the first layer, the fifth kernel for the green channel in the second layer, and the eighth kernel for the green channel in the third layer can be chosen as the second basic kernel. Similarly, one of the third kernel for the green channel in the first layer, the sixth kernel for the blue channel in the second layer, and the ninth kernel for the blue channel in the third layer can be chosen as the third basic kernel. Typically, the three kernels in the three layers for a single channel (e.g., the red channel) may be similar to each other. Therefore, one of the three kernels can be chosen as the basic kernel, and the other two kernels can be recovered using kernel filters. Furthermore, since the three kernels in three layers of a channel (e.g., the red channel) may be similar to each other, the bit width (or bit size) of the kernel filter can be reduced.
[0438] Meanwhile, multiple candidates may exist for a kernel filter, but the kernel filter that meets the predefined rules can be finally selected from multiple candidates through the training process.
[0439] Predefined rules may include the kernel filter's bit width (or bit size), computational cost, cost-benefit ratio, power consumption, accuracy, or a combination thereof.
[0440] For example, kernel filters can be set during the training of an ANN model by applying global weighting functions, which include delta functions with accuracy and weight reduction rates, coefficient functions, rotation functions, transpose functions, bias functions, and cost functions.
[0441] As a specific example, a kernel filter with the smallest bit width (or bit size) and highest accuracy can be selected from multiple kernel filter candidates. The selection of the kernel filter can be updated for each training iteration of the artificial neural network and can be finally completed after training is finished.
[0442] In step S2007, mapping information between the basic kernel, the corresponding kernel filter, and the recovery (modulation) kernel can be stored. This mapping information can be stored in... Figure 5AOr in the kernel generation algorithm storage device 230 in the memory 200 of Figure 6.
[0443] Figure 21 The application is shown after the kernel of the convolutional neural network is restored.
[0444] refer to Figure 21 The application can be obtained from Figure 20 The process S2007 begins. However, if there is a long time difference between process S2007 and process S2101, they can be identified as separate processes. Alternatively, they can be identified separately by being executed by different devices. For example, the program shown in Figure 20 can be executed on a device with high-performance computing capabilities. Figure 21 The procedure shown can be comprised of, including Figure 1 or Figure 3 The neural processing unit 100 is executed by the device.
[0445] In step S2101, Figure 5A Kernel generator 300, Figure 6A The kernel generator 150 in the neural processing unit 100 or Figure 6B The kernel generator 240 in the memory 200 reads a kernel generation algorithm (i.e., kernel recovery algorithm) including mapping information, a basic kernel, and a kernel filter from the memory 200.
[0446] For example, Figure 5A The kernel generator 300 can store the basic kernel, kernel filters, and mapping information extracted into the internal memory 310. Alternatively, Figure 6A The kernel generator 150 in the neural processing unit 100 can store the basic kernel, kernel filters, and mapping information extracted to the NPU's internal memory 120. Once the basic kernel, kernel filters, and mapping information are stored in the internal memory 120, the neural processing unit 100 may not need to access the memory 200 again. Thus, by allowing the neural processing unit 100 to access the NPU's internal memory 120 instead of the memory 200, power consumption can be reduced by up to 128 times.
[0447] In step S2103, Figure 5A The kernel generator 300 generates a restored (or modulated) kernel based on the mapping information, the basic kernel, and the kernel filter, and then sends it to the neural processing unit 100. Alternatively, Figure 6A The kernel generator 150 in the neural processing unit 100 generates a recovered (or modulated) kernel based on mapping information, a basic kernel, and a kernel filter. Alternatively, Figure 6BThe kernel generator 240 in the memory 200 generates a restored (or modulated) kernel and then sends it to the neural processing unit 100. Through these operations, the original kernel can be restored or a kernel similar to the original kernel can be generated.
[0448] A kernel can be restored using at least one base kernel. The base kernel is not necessarily the first-level kernel; it can be any level kernel or any channel kernel.
[0449] In other words, a recovered (modulated) kernel can be generated based on at least one of a base kernel and a kernel filter. For example, when the kernel filter is represented as a coefficient function of the base kernel, a recovered (or modulated) kernel can be generated by applying the coefficients to the base kernel. As a more specific example, the coefficient function can be a constant value added or multiplied relative to all elements of the base kernel (e.g., 2).
[0450] In one example, there might be cases where the input data consists of only three RGB channels, as well as cases using dozens or more channels. Kernel generators 150, 300, or 240 can generate several raw kernels (i.e., modulated kernels or restored kernels) based on several basic kernels for each channel, depending on various techniques.
[0451] Furthermore, various techniques for generating another kernel from a base kernel can be applied differently to each layer or each channel. Specifically, methods for generating kernels can include one of the following: a first method that uses the base kernel as is for other layers or channels, a second method that modifies the base kernel itself, and so on.
[0452] Specifically, the second method (i.e., modifying the base kernel itself) can be implemented by altering the order in which data is retrieved from memory. Data stored in memory can be represented by addresses indicating their locations. For example, a location can be represented by column and row addresses in memory. A convolutional neural network can change the order in which each data value from the base kernel is received by the kernel generation algorithm (or kernel recovery algorithm) by sending the modified addresses to memory.
[0453] For example, the kernel generation algorithm (or kernel recovery algorithm) can be instructed to use the base kernel as is for the first layer, rotate the base kernel to generate weights corresponding to the second to fourth layers, and transpose the base kernel to generate weights corresponding to the fifth layer.
[0454] Traditionally, the entire kernel must be loaded from memory each time an operation is performed on each layer or channel, which is inefficient. However, according to the disclosure in this specification, the basic kernel and the restored kernel (or modulated kernel) can be generated in real time. Therefore, the frequency of access to memory 200 can be reduced, thereby significantly reducing power consumption.
[0455] In step S2105, the neural processing unit 100 performs matrix multiplication, convolution, or pooling using a basic kernel.
[0456] In step S2107, the neural processing unit 100 uses a recovery (modulation) kernel to perform matrix multiplication, convolution, or pooling.
[0457] In step S2109, the neural processing unit 100 can perform artificial neural network operations by using the output of matrix multiplication or convolution.
[0458] Figure 22 It shows according to Figure 1 or Figure 3 The operation is performed in the pattern of the neural processing unit, and Figure 23A and 23B The active bits of the kernel for each mode are shown.
[0459] In step S2201, the neural processing unit 100 can determine its operating mode. This determination can be performed based on preset control information or control signals. For example, when the neural processing unit 100 receives a control signal from the outside indicating that it will operate in any mode, the neural processing unit 100 can determine the operating mode based on the corresponding control signal.
[0460] The operation mode can include multiple operation modes.
[0461] Mode 1 can be the maximum performance operation mode, and mode 2 can be the low power operation mode or the low performance operation mode.
[0462] Mode 1 can be used to implement highly complex artificial neural network models without considering power consumption. Alternatively, Mode 1 can be used to process highly complex input data through an artificial neural network model. Mode 1 can be convolution or pooling using all bits of a kernel or kernel filter.
[0463] Mode 2 can be used to account for power consumption or to implement a low-complexity artificial neural network model. Alternatively, Mode 2 can be used to process low-complexity input data through an artificial neural network model.
[0464] Convolution or pooling can be performed using only certain bits of the kernel or kernel filter. For this purpose, the kernel or kernel filter can be divided into multiple regions. Mode 2 can be divided into several sub-modes 2-1, 2-2, and 2-3.
[0465] In step S2203, the neural processing unit 100 can select weight bits in any kernel region based on a determined operating mode.
[0466] As in Figure 23A As shown, the first kernel has a size of 4×4 matrix, and the bit width of each element is shown as 8 bits. Mode 1 can exemplarily select and use all 8-bit elements.
[0467] like Figure 23B As shown, mode 2-1 can use any region of bits in the weighted bits. As shown in the figure, mode 2-1 can use 4 bits from the first region out of a total of 8 bits. Mode 2-2 can use 4 bits from the second region out of a total of 8 bits.
[0468] Pattern 2-3 can use only some bits of any element in, for example, a 4×4 matrix. According to this pattern 2-3, for example, a matrix of size 3×3 can be selected and used, where each element has a size of 4 bits.
[0469] In step S2205, the neural processing unit 100 can perform convolution by using weight bits selected in any kernel.
[0470] The examples shown in the specification and accompanying drawings are provided only to facilitate the description of the subject matter of this disclosure and to provide specific examples to aid in understanding this disclosure, and are not intended to limit the scope of this disclosure. It will be apparent to those skilled in the art to which this disclosure pertains that other modifications based on the technical spirit of this disclosure may be implemented in addition to the examples disclosed herein.
Claims
1. A neural processing unit (NPU) comprising circuitry, the circuitry comprising: At least one processing element PE is configured to process operations of an artificial neural network (ANN) model; At least one memory configured to store a base kernel and a first kernel filter derived from the base kernel; and Internal memory, The circuit is configured to determine a base kernel with the minimum values of the first kernel filter and the second kernel filter based on a kernel generation algorithm, thereby minimizing the data bit width used to store the weights of the first kernel filter and the second kernel filter, and generating a first modulation kernel based on the base kernel and the first kernel filter. The circuitry is configured to store mapping information between the base kernel, the first kernel filter, and the base kernel, the first kernel filter, and the first modulation kernel into the internal memory.
2. The NPU according to claim 1, The basic kernel described therein comprises a K×M matrix, where K and M are integers, and The K×M matrix includes at least one first weight value or multiple weight values applicable to the first layer of the ANN model.
3. The NPU according to claim 1, wherein, The first kernel filter is configured to be generated based on the difference between at least one kernel weight value of the base kernel and at least one modulation kernel weight value of the first modulation kernel.
4. The NPU of claim 1, wherein the first kernel filter is set during the training process of the ANN model.
5. The NPU of claim 1, wherein the circuitry is configured to generate a second modulation core based on the first modulation core and the second core filter.
6. The NPU according to claim 5, in, The second kernel filter is configured to be generated by applying a mathematical function to the first kernel filter, and The mathematical function includes at least one of the following: delta function, rotation function, transpose function, bias function, and global weight function.
7. The NPU according to claim 1, wherein, The at least one memory can also be configured to store mapping information between at least one core and at least one core filter to generate at least one modulation core.
8. The NPU according to claim 1, wherein, The ANN model includes information about the bit allocation of a first weight bit, which is included in a first kernel filter used for the first mode.
9. The NPU of claim 1, wherein the NPU operates in one of a plurality of modes, the plurality of modes including: The first mode, wherein a first portion of the multiple weight bits included in the first kernel is applied to the ANN model; and The second mode applies all of the multiple weight bits included in the first kernel to the ANN model.
10. The NPU of claim 9, wherein if the first portion is activated according to the first mode, the weight bit in the first portion is selected.
11. The NPU according to claim 1, in, The first kernel of the first layer of the ANN model includes multiple weight bits grouped into a first part and a second part, and The first part and the second part are configured to be used selectively.
12. The NPU of claim 1, wherein the first kernel filter is configured such that the bit width of the values in the first kernel filter is smaller than the bit width of the weights of the first kernel.
13. A method of operating a neural processing unit (NPU), the method comprising: From the multiple kernels contained in the artificial neural network (ANN) model, at least one basic kernel is selected, and from the multiple kernel filters, at least one kernel filter corresponding to the at least one basic kernel is selected. The artificial neural network model is trained to select a kernel filter with a relatively small data bit width and relatively high accuracy from the plurality of kernel filters. The training is based on an accuracy cost function and a weight size cost function. The accuracy cost function is configured to improve inference accuracy, and the weight size cost function is configured to reduce the data bit width of the kernel filter according to a predefined weight size reduction rate of the kernel filter in the neural processing unit. The training ANN model is used to determine the mapping data between the updated base kernel and the updated kernel filter corresponding to the updated base kernel; Based on the determined mapping data, the updated base kernel, and the updated kernel filter, a modulation kernel is generated; and The updated base kernel, the updated kernel filter, and the mapping data are stored in the internal memory of the neural processing unit.
14. The method of claim 13, wherein the method is performed by the neural processing unit (NPU) including circuitry, the circuitry including at least one processing element (PE) and at least one memory.
15. The method of claim 14, wherein the method performed for the ANN model further comprises: Read the first kernel from the plurality of kernels from the at least one memory; The first operation is performed by applying the first kernel of the plurality of kernels to the first layer of the ANN model or the first channel of the ANN model. Read the kernel filter from the at least one memory; A first modulation kernel is generated based on the first kernel among the plurality of kernels and the first kernel filter among the plurality of kernel filters; as well as A second operation is performed on the ANN model by applying the first modulation kernel to the second layer or the second channel of the ANN model.
16. An artificial neural network driving device, comprising: A semiconductor substrate on which conductive patterns are formed; At least one first memory electrically connected to the semiconductor substrate and configured to store information about the basic core; as well as At least one neural processing unit (NPU) electrically connected to the substrate and configured to access the at least one first memory, the NPU including semiconductor circuitry comprising: At least one processing element PE is configured to process operations of an artificial neural network (ANN) model, and At least one internal memory, configurable to store information about the first kernel filter obtained through the basic kernel transformation, and The operation of the ANN model includes determining a basic kernel with the minimum values of the first kernel filter and the second kernel filter based on a kernel generation algorithm, thereby minimizing the data bit width used to store the weights of the first kernel filter and the second kernel filter, and generating a first modulation kernel based on the basic kernel and the first kernel filter. The semiconductor circuit is configured to store mapping information between the base core, the first core filter, and the base core, the first core filter, and the first modulation core into the at least one internal memory.