System with multiple buses and method of controlling processing cores in the system

By assigning a priority to each input tensor, the controller prioritizes high-priority memory access operations, thus resolving the bottleneck and data starvation issues when multiple neural processing units access memory simultaneously, improving data processing efficiency and reducing power consumption.

CN122195891APending Publication Date: 2026-06-12DEEPX CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DEEPX CO LTD
Filing Date
2025-11-26
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, when multiple neural processing units or processing cores access memory simultaneously, bus bottlenecks and data starvation periods are likely to occur, leading to increased memory access time and affecting data processing efficiency and power consumption.

Method used

By determining the priority of each input tensor, the controller can control the bus circuitry to prioritize high-priority memory access operations, thereby reducing data starvation and improving bus bandwidth utilization.

🎯Benefits of technology

This effectively reduces the data starvation period of the neural processing unit, improves data processing efficiency, and reduces power consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122195891A_ABST
    Figure CN122195891A_ABST
Patent Text Reader

Abstract

According to one example of the present disclosure, a system can be provided. The system can include at least one processing core configured to perform a compute operation of at least one neural network model associated with a tensor; at least one memory circuit configured to store the tensor; a plurality of bus circuits operably coupled to the at least one processor core and the at least one memory circuit. The plurality of bus circuits is configured to send the tensor from the at least one memory circuit to the at least one processing core in response to receiving a request for a read operation or a write operation; and a controller operably coupled to the plurality of bus circuits, the controller configured to determine a priority of each bus circuit for each tensor for the read operation or the write operation.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications This application claims priority to Korean Patent Application No. 10-2024-0183303, filed on December 11, 2024, with the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference. Technical Field

[0002] This disclosure relates to systems and methods for controlling processing cores. Background Technology

[0003] Humans possess intelligence such as recognition, classification, inference, prediction, and control / decision-making. Artificial intelligence (AI) is the artificial imitation of human intelligence.

[0004] The human brain is composed of a large number of nerve cells called neurons. Each neuron is connected to hundreds to thousands of other neurons through connections called synapses. To mimic human intelligence, the operation of biological neurons and the connections between neurons are modeled in neural network (NN) models. In other words, a neural network is a system that mimics the connected nodes in the layered structure of neurons. Summary of the Invention

[0005] The implementation involves determining the priority of memory operations associated with processing a neural network model using multiple bus circuits. A system includes at least one processing core, multiple bus circuits, and a controller. The at least one processing core is configured to perform computational operations on input tensors to generate output tensors. The input and output tensors are associated with at least one neural network model. At least one memory circuit stores the input and output tensors. The multiple bus circuits are operatively coupled to the at least one processing core and the at least one memory circuit. In response to receiving a request for a read operation, the bus circuits send input tensors from the at least one memory circuit to the at least one processing core, and in response to receiving a request for a write operation, send output tensors from the at least one processing core to the at least one memory circuit. The controller is operatively coupled to the bus circuits. The controller determines the priority of a read operation for each input tensor or a write operation for each output tensor and controls the bus circuits to send each input tensor or each output tensor according to the determined priority.

[0006] In one or more embodiments, the bus circuit includes: a first bus configured to perform a read operation for reading data from at least one memory circuit; and a second bus configured to perform a write operation for writing data to at least one memory circuit.

[0007] In one or more embodiments, the controller is configured to determine the priority of each input tensor by comparing the duration of the computation cycle of each input tensor with the duration of the memory access cycle associated with the next input tensor after each input tensor.

[0008] In one or more embodiments, the controller is configured to determine the priority of each tensor by comparing the duration of a computation cycle at the processing core with the duration of a memory cycle, wherein the memory cycle includes a write cycle of the previous tensor preceding each input tensor and a read cycle of the next input tensor following each input tensor.

[0009] In one or more embodiments, the controller is configured to, in response to determining that a data starvation prediction has occurred or has occurred in the first processing core, increase the bus bandwidth of the first read cycle assigned to the first processing core of at least one processor core by reducing the bus bandwidth of the second read cycle assigned to the second processing core of at least one processing core.

[0010] In one or more embodiments, the controller is configured to increase the priority of sending input tensors via multiple bus circuits during the read cycle of at least one processing core in response to determining that a data starvation prediction has occurred or has occurred in a processing core, thereby increasing the bus bandwidth allocated to the processing core.

[0011] In one or more embodiments, the controller is configured to reduce the bandwidth of multiple bus circuits assigned to at least one of the processing cores in response to determining that the processing core is in a compute-constrained state.

[0012] In one or more embodiments, the controller is configured to: receive a signal from at least one of the processing cores indicating that a data starvation has occurred at the processing core; and increase the priority of sending input tensors to the processing core in response to receiving the signal.

[0013] In one or more implementations, multiple bus circuits operate separately for read and write operations.

[0014] In one or more embodiments, each of at least one processing core includes a plurality of processing elements (PEs), wherein the plurality of PEs includes at least one of a multiplication and accumulation (MAC) operator circuit, an adder tree circuit, or an arithmetic logic unit (ALU) operator circuit.

[0015] In one or more embodiments, the priority of memory access operations includes a first priority, a second priority, and a third priority, wherein the second priority is higher than the first priority, and the third priority is higher than the first priority and the second priority.

[0016] In one or more embodiments, determining whether data starvation has occurred or is expected to occur further includes: counting by a counter while performing a memory access operation; and determining that data starvation has occurred in response to the count value reaching a threshold.

[0017] In one or more implementations, the threshold is pre-calculated during compilation. Attached Figure Description

[0018] Figure 1A and Figure 1B This diagram illustrates the bottleneck that occurs when performing read and write operations via a neural processing unit in a conventional control system.

[0019] Figure 2 This is a schematic diagram illustrating a system for controlling a processing core according to an example of this disclosure.

[0020] Figure 3 This is a schematic diagram illustrating a processing element according to an example of this disclosure.

[0021] Figure 4 This is a schematic diagram illustrating an example neural network.

[0022] Figure 5 This is a table showing the energy consumption per unit operation of a neural processing unit according to an example of this disclosure.

[0023] Figure 6A and Figure 6B This is a diagram illustrating a system that performs memory access operations using a bus for read operations and another bus for write operations, according to an example of this disclosure.

[0024] Figure 7A and Figure 7B This is a diagram illustrating an example of performing read and write operations using a bus-based utilization according to an example of this disclosure.

[0025] Figure 8A and Figure 8B This is a diagram illustrating an example operation for reducing latency in a tensor in the event of bus congestion in a system used for controlling the processing core, according to a first example of this disclosure.

[0026] Figure 9 This is a flowchart illustrating a method for controlling the processing core according to a first example of this disclosure.

[0027] Figure 10 This is a diagram illustrating a method for determining the priority of a transmitted tensor according to a first example of this disclosure.

[0028] Figure 11This is a diagram illustrating how, according to a first example of this disclosure, data processing speed can be improved by prioritizing the reduction of data starvation periods in processing cores.

[0029] Figure 12 This is a flowchart illustrating an example of a method for determining the priority of a control processing core according to a second example of this disclosure.

[0030] Figure 13 This is a timing diagram illustrating an example of a data hunger signal generated during runtime in the processing core, according to a second example of this disclosure.

[0031] Figure 14 This is a timing diagram illustrating a method for determining priorities to reduce delays identified based on counter count values, according to a third example of this disclosure.

[0032] Figure 15 This is a flowchart illustrating a method for determining the priority of a transmitted tensor according to a fourth example of this disclosure.

[0033] Figure 16 This is a diagram illustrating how, according to the fourth example of this disclosure, data processing speed can be improved by prioritizing the reduction of data starvation periods.

[0034] Figure 17 This is another example flowchart illustrating a method for controlling a neural processing unit according to another example of this disclosure.

[0035] Figure 18 This is a diagram illustrating an example of a method for determining priorities within a method for controlling the processing core, according to another example of this disclosure.

[0036] Figure 19 This is a diagram illustrating an improvement in data processing speed according to another example of this disclosure, achieved by allocating priorities in a method for controlling neural processing units to minimize data starvation periods. Detailed Implementation

[0037] The specific structures or step-by-step descriptions of examples based on the concepts disclosed in this specification or application are merely illustrative examples used to explain the concepts based on the concepts disclosed.

[0038] Examples based on the concepts of this disclosure may be embodied in various forms. Examples based on the concepts of this disclosure should not be construed as limited to the examples described in this specification or application.

[0039] Various modifications can be applied to the examples based on the concepts of this disclosure. This disclosure can take many forms. Therefore, specific examples are shown in the accompanying drawings and described in detail in this disclosure. However, this is not intended to limit the examples based on the concepts of this disclosure to the specific forms of disclosure. Therefore, it should be understood that all modifications, equivalents, or alternatives that fall within the spirit and scope of this disclosure are included in this disclosure.

[0040] Terms such as first and / or second may be used to describe various components. However, this disclosure should not be limited to the foregoing terms. These terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the claims according to the concepts of this disclosure, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element.

[0041] When an element is said to be "connected to" or "in contact with" another element, it is understood that the element may be directly connected to or in contact with that other element, but other elements may be positioned between them. On the other hand, when it is said that an element is "directly connected" or "directly connected to" another element, it should be understood that there are no other elements between them. Other expressions describing the relationship between elements, such as "between" and "immediately adjacent to," or "adjacent to" and "directly adjacent to," should be interpreted similarly.

[0042] In this disclosure, expressions such as “A or B”, “at least one of A and / or B” or “one or more of A and / or B” may include all possible combinations thereof. For example, “A or B”, “at least one of A and B” or “at least one of A or B” may mean: (1) containing at least one A, (2) containing at least one B, or (3) containing both at least one A and at least one B.

[0043] As used herein, expressions such as “first,” “second,” or “first or second” may modify various elements regardless of their order and / or importance. These expressions are used only to distinguish one element from others and do not limit the elements. For example, “first user equipment” and “second user equipment” may refer to different user equipment regardless of their order or importance. For example, a first element may be named a second element without departing from the scope of the claims described in this disclosure, and similarly, a second element may be renamed a first element.

[0044] The terminology used in this disclosure is for the purpose of describing specific examples only and may not be intended to limit the scope of other examples. Singular expressions may include plural expressions unless the context clearly specifies otherwise. The terminology used herein, including technical or scientific terms, may have the same meaning as commonly understood by one of ordinary skill in the art as described herein.

[0045] In this disclosure, terms defined in general dictionaries may be interpreted as having the same or similar meaning as in the relevant technical context. Unless explicitly defined herein, they should not be interpreted in an ideal or overly formal sense. In some cases, even terms defined in this disclosure should not be interpreted as excluding examples of this disclosure.

[0046] The terminology used herein is for the purpose of describing specific examples only and is not intended to limit this disclosure. Unless the context clearly specifies otherwise, singular expressions include plural expressions. In this specification, terms such as “comprising” or “having” are intended to indicate the presence of the described features, quantities, steps, operations, components, parts, or combinations thereof. Therefore, it should be understood that the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof is not excluded.

[0047] Unless otherwise defined, all terms used herein, including technical or scientific terms, shall have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Terms such as those defined in common dictionaries shall be interpreted as having a meaning consistent with their meaning in the relevant technical context. Unless expressly defined in this disclosure, they shall not be interpreted in an ideal or overly formal sense.

[0048] Each feature of the various examples of this disclosure can be combined, either partially or entirely, or with each other. Those skilled in the art will fully understand that the various examples of this disclosure are technically capable of various interlocking and driving mechanisms. Each example of this disclosure can be implemented independently of each other, or can be implemented together in an associated relationship.

[0049] In describing examples, descriptions of technical content that is well-known in the technical field to which this disclosure pertains and is not directly related to this disclosure may be omitted. This is to convey the essential points of this disclosure more clearly by omitting unnecessary descriptions without obscuring them.

[0050] <Terminology Definition> For ease of understanding this disclosure, the following is a brief overview of the terminology used herein.

[0051] NPU: an abbreviation for Neural Processing Unit, which can refer to a processor specifically designed for computing neural network models independent of the Central Processing Unit (CPU).

[0052] SoC: An abbreviation for System-on-a-Chip, which refers to a semiconductor chip that integrates at least one processor and various circuit elements of an electronic system into a single integrated circuit (IC). An SoC can integrate digital circuits, analog circuits, mixed-signal circuits, and radio frequency processing circuits on a single semiconductor chip. An SoC can contain at least one processor. For example, the at least one processor that can be included in an SoC can be at least one of a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Image Signal Processor (ISP), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU). An SoC can contain at least one memory. For example, the memory that can be included in an SoC can be at least one of Random Access Memory (RAM), Read-Only Memory (ROM), and cache memory. An SoC can include a high-speed data bus, such as AXI, AHB, APB, etc., for efficient communication between multiple IP blocks included in the SoC. An SoC can include at least one interface for connecting to external devices and sensors, such as PCIe, USB, I2C, SPI, UART, GPIO, etc. An SoC can include an on-chip power management unit that regulates the voltage and power distribution on the semiconductor chip. A System-on-a-Chip (SoC) can include communication interfaces that integrate wired and wireless communication protocols (such as Ethernet, Wi-Fi, Bluetooth, and cellular connectivity) for data transmission. SoCs can be manufactured using a variety of packaging technologies.

[0053] NN: an abbreviation for Neural Network, a network of nodes connected in a layered structure, mimicking the way neurons in the human brain connect through synapses to imitate human intelligence.

[0054] Neural network model information includes: information about the network structure, the number of layers, the connections within each layer, the parameters of each layer, the computational processing methods, the activation functions, the data type of each layer's parameters (e.g., floating-point or integer), and the bit width of each parameter. Each layer's parameters can be represented using a tensor of a certain size. During the compilation process, at least one layer can be divided into tiled tensors based on the computational circuit architecture and internal memory size. Based on the parameter sizes of each tensor (e.g., the size of the input parameters and the size of the weight parameters) and the required computational algorithms (e.g., matrix multiplication, activation functions, and softmax functions), the clock cycles for the computational circuitry processing the tensors and the clock cycles for data transfer to memory can be calculated.

[0055] DNN: an abbreviation for Deep Neural Network, which can refer to increasing the number of hidden layers in a neural network to achieve higher levels of artificial intelligence.

[0056] CNN: an abbreviation for Convolutional Neural Network, a type of neural network that functions similarly to the human brain's visual cortex when processing images. Convolutional neural networks are known to be very well-suited for image processing and are renowned for their ability to extract features from input data and identify patterns within those features.

[0057] Transformer: A Transformer neural network is a DNN based on attention techniques. It utilizes many matrix multiplication operations. A Transformer can receive input values ​​and parameters such as query (Q), key (K), and value (V) to obtain output values, i.e., attention (Q, K, V). Based on the output values ​​(i.e., attention (Q, K, V)), the Transformer can handle various inference operations. Transformers have been actively used in language generation models.

[0058] Kernel: Refers to the weights of an NxM convolution matrix. Each layer of a neural network model has multiple kernels; the number of kernels can be referred to as the number of channels, the number of filters, etc.

[0059] Tensors: Tensors are multidimensional matrix parameters processed by neural network models. Tensors can refer to various parameters of a neural network model, such as weights, feature maps, kernels, and attention parameters. A tensor can refer to the input parameters input to a neural processing unit and the output parameters computed by that unit. A tensor can be the parameters of a piece of data computed by the neural processing unit at a time. A neural network model can include multiple layers, and each layer can be configured to contain at least one tensor. For example, the input parameters of the first layer of a neural network model can be called the first tensor, the weight parameters of the first layer can be called the second tensor, and the output parameters of the first layer can be called the third tensor. For example, the input parameters of the first layer of a first neural network model can be called the first tensor, while the output parameters of the first layer of a second neural network model can be called the second tensor.

[0060] Neural network (NN) models are classified into "single-layer neural networks" and "multi-layer neural networks" based on the number of layers. A typical multi-layer neural network consists of an input layer, hidden layers, and an output layer. (1) The input layer is the layer that receives external data, and the number of neurons in the input layer is the same as the number of input variables. (2) The hidden layer is located between the input layer and the output layer. It receives signals from the input layer, extracts features, and passes them to the output layer. (3) The output layer receives signals from the hidden layer and outputs them. The input signals between neurons are multiplied by their respective weights (whose values ​​are between 0 and 1) and then summed. If the sum is greater than the neuron's threshold, the neuron is activated and outputs the signal through an activation function.

[0061] On the other hand, neural networks with more hidden layers to achieve higher levels of artificial intelligence are called deep neural networks (DNNs). There are many types of DNNs, but convolutional neural networks (CNNs) are known for their ability to extract features from input data and identify patterns in those features.

[0062] Convolutional neural networks (CNNs) are neural networks whose function is similar to that of the human brain's visual cortex, which processes images. CNNs are known to be suitable for image classification, object detection, and other applications.

[0063] Convolutional Neural Networks (CNNs) consist of iterative processes of convolution and pooling channels. Convolution operations account for the majority of computation time in a CNN. A CNN identifies objects by extracting features from the image in each channel using kernels in matrix form, and provides steady-state processing such as translation or distortion through pooling. In each channel, a feature map is obtained by convolving the input data and the kernel, and an activation function such as ReLU (Rectified Linear Unit) is applied to generate the activation map for that channel. Pooling can then be applied. The neural network that actually classifies the patterns is located at the end of the feature extraction neural network, called a fully connected layer. In the computational processing of a CNN, most operations are performed via convolution or matrix multiplication. The necessary kernels are read from memory very frequently. A significant portion of the operation of a CNN is the time spent reading the kernel corresponding to each channel from memory. However, the examples in this disclosure are not limited to CNNs and can be applied to transformer neural networks, etc.

[0064] Memory (also referred to herein as "memory circuitry") can be categorized into memory sections (e.g., main memory or off-chip memory), internal memory, on-chip memory, etc. Each memory may include multiple memory cells, each with a unique memory address. In particular, whenever a neural processing unit calls weight parameters stored in memory or calls other parameters, there may be a delay of several clock cycles before accessing the memory cell corresponding to the memory address.

[0065] The neural processing unit can perform memory access operations to write data to or read data from memory, as well as computational operations to compute a neural network model based on that data.

[0066] Recently, systems with multiple neural processing units or multiple processing cores contained within neural processing units have been developed. These systems are configured to simultaneously send data to or receive data from memory.

[0067] In such a system, multiple neural processing units or multiple processing cores may simultaneously attempt to access memory via the bus. In this case, the bus handling data communication can prioritize the first memory access command to arrive. When using this scheme, contention for memory access on the bus may occur. Furthermore, the next memory access operation may be delayed until the first memory access operation completes, which could lead to data starvation.

[0068] This contention can further increase the processing time and power required to read parameters from memory and perform AI operations in the neural processing unit. Furthermore, if a memory access operation is not completed, a data starvation period occurs, during which the neural processing unit cannot perform computational operations because the required data is not available.

[0069] On the other hand, the processing time of memory access operations used to perform computations by the neural processing unit is related to the size of the data. That is, the larger the data size, the greater the amount of data transferred, which can increase data transfer time. The time it takes for the neural processing unit to process AI operations using the data provided by memory access operations is related to the complexity of the computational algorithm. That is, as AI computational algorithms become more complex, the computational load increases, which can increase data computation time. Therefore, the processing time of memory access operations and the processing time of AI algorithm operations may differ. For example, the time to complete a memory access operation within a specific interval may be shorter or longer than the time required to complete the computational operation.

[0070] A memory-constrained state occurs when the completion time of a memory access operation is longer than the completion time of a computation operation. This state can occur when computation is limited by memory access speed rather than the computational capacity of the neural processing unit. In this state, data starvation may occur, during which the neural processing unit is inactive as it waits to fetch data from or write data to memory.

[0071] Conversely, when the completion time of a computational operation is longer than the completion time of a memory access operation, it is called a computationally constrained state. A computationally constrained state may occur when the processing power of the neural processing unit is a limiting factor, for example, when AI computation time takes longer than memory access time. In this state, the neural processing unit may experience data starvation due to the inefficiency of AI computation or memory bandwidth allocation. Therefore, a data starvation period occurs, during which the computational operations of the neural processing unit cannot be performed in either a memory-constrained or computationally constrained state.

[0072] According to one aspect of this disclosure, the Quality of Service (QoS) priority of read and write operations for accessing memory of each neural processing unit or processing core is modified to improve the efficiency of direct memory access (DMA) read and / or write operations. Furthermore, the QoS priority of read and write operations for accessing memory of each neural processing unit or processing core can be set based on analysis of memory-constrained and computation-constrained states to improve the efficiency of DMA read and write operations. When a data starvation period is predicted, the bus bandwidth allocated to a neural processing unit or processing core can be reallocated based on tensors, since the time to complete a memory access operation of a tensor is shorter or longer than the time to complete a computation operation of another tensor, thereby enabling the computing circuitry to operate without data starvation periods, improving data processing performance and reducing power consumption.

[0073] An example of this disclosure will be described below with reference to the accompanying drawings.

[0074] Figure 1A and Figure 1B This diagram illustrates a bottleneck that occurs when performing read and write operations on each of multiple neural processing units in a conventional control system.

[0075] Traditional control systems control the operation of neural processing units to access data from memory (i.e., memory access operations) and / or control the operation of neural processing units to compute data provided to them (i.e., computation operations).

[0076] first, Figure 1A The diagram illustrates a scenario where the memory is connected to a bus, and the bus is implemented as a single-bus architecture. In this case, the memory cannot process both read and write operations simultaneously, but can only perform one of the read or write operations associated with one of the neural processing units.

[0077] on the other hand, Figure 1B The diagram illustrates a scenario where the memory is connected to a bus implemented as dual independent buses. In this case, although the memory can perform read and write operations simultaneously via the same bus, a bottleneck occurs because the read and write operations of multiple neural processing units are processed sequentially.

[0078] Furthermore, if there is a difference in the request time between read and write operations, the corresponding neural processing unit will inevitably experience a data starvation period. That is to say, according to... Figure 1A and Figure 1BWhen multiple neural processing units (RNUs) or processing cores simultaneously attempt to access memory, traditional control systems handle these attempts on a first-come, first-served basis. Consequently, multiple RNUs compete for memory access. This competition increases the time required for each RNU to complete its memory access operation. Furthermore, because memory access operations are not completed in a timely manner, data starvation occurs within the RNU, preventing it from initiating computational operations.

[0079] Even if it's not a competition between multiple neural processing units or multiple processing cores, at any given point in time, the time spent completing a memory access operation and the time spent completing a computation operation are not always equal. Therefore, in some intervals, the memory access operation time may be longer than the computation operation time (i.e., a memory-constrained state), which can also lead to data starvation periods during which the computation circuitry does not operate.

[0080] In such Figure 1B In the scenario depicted, the bus, lacking a specific scheduling criterion, generates a sequential queue based on a first-come, first-served basis, prioritizing read and write operations for each tensor requested by the corresponding neural processing unit that requests them first. Consequently, data starvation frequently occurs across multiple neural processing units. In other words, based on the sequential queue of the bus used for operations according to a neural network model of a conventional control system, bus bottlenecks can frequently occur due to inefficient data starvation and the resulting degradation of effective bus bandwidth.

[0081] To address this issue, a system according to an example of this disclosure can determine the priority of a particular neural processing unit (NN) processing tensors associated with multiple NNs transmitted on the bus based on 1) the memory access operation time of the bus and 2) the computation operation time of each tensor at the NN, thereby reducing the data starvation period of the NN. In other words, when the transmission times of multiple tensors corresponding to multiple NNs requesting use of the bus overlap, a system according to an example of this disclosure can determine the priority of competing tensors, thereby reducing the data starvation time of the computation circuitry of a particular NN in the system. In this way, data starvation of the NN caused by memory bandwidth limitations and memory latency during read and write operations can be mitigated or eliminated.

[0082] In the following, the neural processing unit according to the examples of this disclosure may also be referred to as a processing core. For example, a neural processing unit may refer to a semiconductor chip formed on a substrate including at least one processing core. In other words, multiple processing cores may be configured as part of a neural processing unit. As described herein, a processing core may refer to computational circuitry configured to process operations of a neural network model. In various examples of this disclosure, neural processing units and processing cores may be substantially equivalent to each other.

[0083] In some examples, the first controller 1100 and the second controller 100 can be integrated to form a controller. The controller can be referred to as a control circuit.

[0084] Figure 2 This is a schematic diagram illustrating a system for controlling a processing core according to an example of this disclosure. Figure 2 A neural processing unit comprising multiple processing cores is illustrated, along with multiple peripheral devices for computing the neural processing unit. Therefore, the neural processing unit and the multiple peripheral devices can be referred to as a system. At least some components of the system may include a system-on-a-chip (SoC).

[0085] refer to Figure 2 The neural processing unit (NPU) 1000 of system 1000 may include multiple processing cores 1000-1, ..., 1000-n and may be configured to communicate with a central processing unit (CPU) 2000, a memory 3000, an image sensor 4000, and / or a decoder 5000 to perform various neural network inference functions. Furthermore, each processing core of the neural processing unit 1000 may be configured to be controlled via a corresponding first controller 1100. However, although the neural processing unit (NPU) 1000 is described as including multiple processing cores 1000-1, ..., 1000-n, it should be understood that this is only an example and other embodiments may include at least one processing core and are not intended to limit the number of processing cores.

[0086] Each of the neural processing unit 1000, central processing unit 2000, memory 3000, image sensor 4000, decoder 5000, and / or bus 6000 according to one example of this disclosure may be formed as a separate semiconductor circuit, or at least a portion thereof may be integrated within a single package, and this disclosure is not limited thereto. In some examples of this disclosure, bus 6000 may comprise multiple buses, such as a first bus 6100 and a second bus 6200.

[0087] According to various examples, the neural processing unit 1000 of system 1000 can be patterned on the same semiconductor die as the central processing unit 2000.

[0088] According to various examples, the neural processing unit 1000, central processing unit 2000, and memory 3000 of system 10000 can be patterned on the same semiconductor die.

[0089] According to various examples, the neural processing unit 1000 of system 1000 may include a semiconductor die connected to the central processing unit 2000 via chiplet technology. When chiplet technology is applied, an inserter may also be included.

[0090] According to various examples, a system 10000 including a neural processing unit 1000, a central processing unit 2000, and a memory 3000 can be constructed from semiconductor dies connected via chiplet technology.

[0091] Each of the aforementioned components is characterized by its operational function, and each component can be embodied in a circuit board, silicon substrate, resistor, transistor, etc. Therefore, each component can be a semiconductor circuit with many transistors connected to it, some of which may be difficult to identify and distinguish with the naked eye, and may be identifiable only by their operation. Figure 2 Each component can be called a corresponding circuit unit.

[0092] Each of the central processing unit 2000, memory 3000, image sensor 4000, and decoder 5000 can communicate via bus 6000 to send data to and receive data from each processing core 1000-1, … , 1000-n.

[0093] According to one example of this disclosure, bus 6000 can be configured to process read and write operations of each tensor sequentially based on a defined priority. In this case, bus 6000 can be an Advanced Extensible Interface (AXI) bus. However, each processing core 1000-1, …, 1000-n can be configured, but is not limited to, to be directly coupled to at least one of the elements described above.

[0094] Furthermore, according to another example of this disclosure, bus 6000 may be configured with multiple buses, for example, a separate bus for reading (hereinafter referred to as "first bus") 6100 and a bus for writing (hereinafter referred to as "second bus") 6200 may be configured. In this case, each of the first bus 6100 and / or the second bus 6200 may be an AXI bus. However, each processing core 1000-1, …, 1000-n may be configured, but is not limited to, to be directly coupled to at least one of the aforementioned elements.

[0095] The neural processing unit 1000 can be defined as a processor specifically designed for operations on the neural network model. In particular, the neural processing unit 1000 can be specifically designed for matrix multiplication or convolution operations, which account for the majority of computations in the neural network model.

[0096] Neural network models are based on neural networks, which are artificial neural networks that receive multiple inputs or stimuli, multiply them by their respective weights, sum them together, and then transform and transmit the resulting biases through activation functions. Neural network models trained in this way can be used to output inference results from input data. These inference results can be for object detection, image classification, event detection, pose estimation, lexical generation, natural language generation, image generation, and more.

[0097] The neural processing unit 1000 can be a semiconductor implemented as an electrical / electronic circuit. The electrical / electronic circuitry may be intended to include multiple electronic components (e.g., transistors, capacitors).

[0098] In the case of a neural network model based on transformer and / or CNN, the neural processing unit 1000 can selectively process matrix multiplication operations, convolution operations, etc., depending on the architecture of the neural network.

[0099] For example, in each layer of a convolutional neural network (CNN), the input feature map corresponding to the input data and the kernel corresponding to the weights can be matrices comprising multiple channels. Convolution operations can be performed on the input feature maps and kernels, generating output feature maps that are convolutionally and pooled for each channel. Activation functions can be applied to the output feature maps to generate activation maps for the corresponding channels. Pooling can then be applied to the activation maps. In this paper, activation maps can be collectively referred to as output feature maps, while each feature map and weight can be referred to as a tensor.

[0100] However, the examples disclosed herein are not limited to this; the output feature maps can be subjected to matrix multiplication, convolution, and other operations.

[0101] Furthermore, the output feature map according to the example of this disclosure will be fully explained. For example, the output feature map may be the result of matrix multiplication or convolution operations. Therefore, the plurality of processing elements (PEs) included in the processing element array 400 may be modified to further include processing circuit units for additional algorithms.

[0102] The neural processing unit 1000 can be configured to include multiple physical processes (PEs) for processing convolutions and matrix multiplications required for neural network operations.

[0103] The neural processing unit 1000 can be configured to include corresponding computational circuits suitable for matrix multiplication, convolution, activation function, pooling, stride, batch normalization, skip connection, concatenation, quantization, pruning, padding, softmax, and attention operations required for neural network operations.

[0104] For example, the neural processing unit 1000 may be configured to include one or more circuits for the special function unit (SFU) 500 for processing at least one of the algorithms described above: activation function operation, pooling operation, stride operation, batch normalization operation, skip connection operation, concatenation operation, quantization operation, pruning operation, padding operation, softmax operation, and attention operation.

[0105] Multiple tensors transmitted to the neural processing unit 1000 via bus 6000 can be configured to be controlled by a first controller 1100. Specifically, the first controller 1100 can be configured to determine which tensors cause data starvation based on the cycle time of memory access operations and computation operations (i.e., the number of clock cycles spent processing the tensor) before each processing core 1000-1, …, 1000-n directly accesses memory 3000 to read and / or write to memory 3000, and to determine the priority processing of the identified tensors and subsequent tensors. Bus 6000 can be configured to process each tensor sequentially based on the determined priority. Therefore, the first controller 1100 can be configured to ensure that lower-priority tensors yield bus 6000 bandwidth to higher-priority tensors, thereby preventing data starvation.

[0106] According to one example of this disclosure, if the bus 6000 used for reading and writing is divided into a first bus 6100 and a second bus 6200, respectively, the first controller 1100 can relinquish bus bandwidth for writing operations on a tensor or for reading operations on subsequent tensors of that tensor.

[0107] On the other hand, each tensor can have its own data size, and the first controller 1100 can calculate the number of clock cycles based on the size of each tensor sent on the bus 6000. Furthermore, the compiler can pre-calculate the number of clock cycles for processing the corresponding tensor in the processing core when compiling the corresponding neural network model. Therefore, the first controller 1100 can obtain cycle information for each tensor calculated at compile time. As will be further described, the clock cycle for memory access operations of each tensor is called a memory cycle, and the clock cycle for computation operations is called a computation cycle. Moreover, the number of pre-calculated computation clock cycles is unlikely to change when the neural processing unit processes the pre-calculated computation clock cycles. This is the case when the neural processing unit is a dedicated AI accelerator configured to process neural network models. In contrast, the number of pre-calculated memory cycles can be a minimum number and can dynamically increase to above the minimum number due to various reasons, such as bandwidth contention on the bus, low priority in the sequential queue, etc. Therefore, the pre-acquired memory cycles can refer to the minimum number of memory cycles.

[0108] Specifically, the neural processing unit 1000 may include a controller 100, a direct memory access (DMA) unit 200, internal memory 300, a processing element array 400, and special function units 500. However, in describing the neural processing unit 1000, the following description will be limited to a single processing core 1000-1. This is merely for ease of description and can be applied substantially equally to any processing core included in the neural processing unit 1000.

[0109] The components of the processing core 1000-1 are distinguished by their operational functions, and each component can be formed using at least one of a substrate, a resistive element, and a transistor. Therefore, each component can be a semiconductor circuit with many transistors connected thereon, some of which may be difficult to identify and distinguish with the naked eye and may only be identifiable by their operation. Thus, each functional unit of the processing core 1000-1 can be called a circuit unit.

[0110] The second controller 100 can be configured to control the operations associated with each computational neural network model in DMA 200, internal memory 300, processing element array 400, and SFU 500. The second controller 100 can be directly or indirectly coupled to each of DMA 200, internal memory 300, processing element array 400, and SFU 500 to communicate with each other. For example, the second controller 100 can adjust the cache size of each tensor stored in internal memory 300 for each computational step based on the capacity of internal memory 300. The second controller 100 can be configured to control processing core 1000-1 based on the machine code (e.g., binary code) of the compiled neural network model.

[0111] For example, a compiler can generate machine code that determines the read / write sequence of neural network model data, as well as information about the processing sequences of neural network layers, the operation sequences of convolution multiplication, the operation sequences of matrix multiplication, and the read / write operation sequences of DMA data. These sequences are determined based on the hardware characteristics of the processing core 1000-1, such as the number of processing elements, memory capacity, functional circuit units within special function units, and the presence of post-processing units. Therefore, the second controller 100 can control the processing core 1000-1 based on the machine code. Machine code can be referred to as binary code, executable code, etc.

[0112] The second controller 100 can obtain scheduling information based on the directed acyclic graph (DAG) of the neural network model compiled by the compiler. This scheduling information schedules the sequence of operations of the neural network model to be executed by the processing core 1000-1. A computational step can be processed in a tensor unit. Here, the compiler can determine the scheduling operations that can accelerate the operation of the neural network model by determining the number of PEs of the processing core 1000-1, the size of the internal memory 300, the parameter size of each layer of the neural network model, etc. According to the computational schedule, the second controller 100 can be configured to control the number of PEs required for each computational step and control the read and write operations of parameters in the internal memory 300 of each computational step. The compiler can efficiently schedule computational steps based on its understanding of the hardware architecture and capabilities of the processing core 1000-1. The compiler can determine the data order required to compute the neural network model based on the sequence of operations of the neural network layers, convolutions, and / or matrix multiplications, and can generate compiled machine code. The parameters input to the neural processing unit in a computational step are called input tensors, while the parameters output from the neural processing unit in a computational step are called output tensors.

[0113] In some examples, processing core 1000-1 can be configured to include an embedded compiler. Based on the above configuration, processing core 1000-1 can be configured to generate machine code upon receiving one or more file inputs in various AI software framework formats. For example, AI software frameworks may include TensorFlow, PyTorch, Keras, XGBoost, mxnet, DARKNET, ONNX, etc. However, the examples disclosed herein are not limited to any specific AI software framework.

[0114] DMA 200 can be configured to access memory 3000 via bus 6000 and request reads and / or writes to memory 3000. Processing core 1000-1 can receive various data associated with the neural network model from memory 3000 via DMA 200. Memory 3000 can be included in a system-on-chip (SoC) or can be configured as a separate memory device.

[0115] Internal memory 300 can be memory located in the on-chip area of ​​processing core 1000-1, and can be memory used to cache or store data processed in the on-chip area. That is to say, internal memory 300 can also be called cache memory.

[0116] Furthermore, internal memory 300 can read from memory 3000 and store at least some data required for computing the neural network model. This at least some data may be referred to as tensors. Internal memory 300 can be configured to store all or part of the neural network model based on the storage capacity setting for each parameter and the data size of each layer of the neural network model. Parameters of representative data processed for the operation of the neural network model may include at least one of attention parameters, KV (key-value) cache parameters, activation map parameters, input feature map parameters, output feature map parameters, and weight parameters.

[0117] Specifically, internal memory 300 can read from and store parameters corresponding to the input data from memory 3000. Furthermore, internal memory 300 can read from and store parameters corresponding to the output data from processing element array 400. As further described below, the parameters included in the neural network model can include input values ​​and weights. Input or output values ​​read from or written by internal memory 300 can include at least one of activation parameters, feature map parameters, KV cache parameters, attention parameters, etc.

[0118] Internal memory 300 may include at least one of the following types of memory: register file, ROM, SRAM, DRAM, resistive RAM, magnetoresistive RAM, phase-change RAM, ferroelectric RAM, flash memory, HBM, etc. According to one example of this disclosure, internal memory 300 may be SRAM, and is configured such that SRAM is advantageous in terms of computational processing speed. Furthermore, internal memory 300 may be organized into at least one memory cell (e.g., a memory bank, etc.). Internal memory 300 may include homogeneous memory or heterogeneous memory.

[0119] Furthermore, the data stored in the storage units of internal memory 300 (e.g., parameters of the neural network model) is not fixed to one of the attention, KV cache, activation map, input feature map, weights, and output feature map, but can be changed to another of the attention, KV cache, activation table, input feature table, weights, and input feature map as needed. In other words, by changing the memory allocation of internal memory 300, the utilization efficiency of internal memory 300 can be improved; that is, the size of each tensor stored in internal memory 300 can vary for each computation step.

[0120] The processing element array 400 can be configured to contain multiple processing elements that perform multiplication and accumulation (MAC) operations.

[0121] Each element of the processing element array 400 can be configured to perform an operation by receiving input, such as an input feature map corresponding to the input data and / or a kernel corresponding to the weights of the neural network.

[0122] Processing elements can be configured to perform functions such as addition, multiplication, and accumulation required for processing neural network models. To this end, each processing element may include at least one of the following: MAC (multiplication and accumulation) operator, adder tree, and ALU (arithmetic logic unit) operator.

[0123] For example, a processing element can receive an input feature map and weights, perform convolution calculations, and output an output feature map. Furthermore, the array of processing elements 400 or the processing elements themselves can be referred to as an artificial intelligence (AI) computing unit.

[0124] In another example, a processing element (PE) can use the input feature map and weights as input to perform a generalized matrix multiplication (GEMM) operation, or matrix multiplication, to output an output feature map. More specifically, the processing element (PE) can multiply the input feature map in matrix form with a weight matrix, and then add a bias to the matrix to output an output feature map in matrix form. In particular, matrix multiplication can be performed at high speed through parallel processing in a neural processing unit, enabling efficient handling of matrix multiplication operations.

[0125] As another example, the processing element PE may include a circuit system designed to receive only integer type parameters as input. In this case, the input parameters of the processing element PE can be converted into integers of a specific width and stored in internal memory 300. According to the above configuration, power consumption can be effectively reduced compared to processors that support floating-point parameters, and it can be efficiently implemented on the device.

[0126] The SFU 500 can handle multiple activation functions to give the output feature map nonlinearity.

[0127] The activation function processed by the special function unit 500 may include, but is not limited to, the SiLU function, the Softmax function, the sigmoid function, the hyperbolic tangent (tanh) function, the ReLU function, the Leaky-ReLU function, the Maxout function, or the ELU function, which produces a non-linear output value relative to the input value.

[0128] On the other hand, it may be technically difficult to support all activation functions in the processing core 1000-1. Therefore, the processing core 1000-1 can approximate various activation functions using piecewise linear function approximation algorithms and piecewise linear function processing circuitry. These activation functions can be selectively applied after MAC operations. The operation values ​​for which activation functions are applied can be called the activation graph.

[0129] In addition, the SFU 500 can be configured to include floating-point multiplier circuitry to perform decimal point operations.

[0130] As another example, the SFU 500 can be configured to communicate with the processing element PE and can include a circuit system designed to receive integer parameters output from the processing element PE. In this case, the SFU 500 can be further configured to include dequantization circuitry configured to convert the integer parameters to floating-point parameters. Furthermore, the SFU 500 can be configured to process activation function operations with floating-point parameters. Additionally, the SFU 500 can also be configured to include quantization circuitry configured to convert the floating-point parameters to integer parameters at the end of the activation function operation. According to the above configuration, the SFU 500 can be configured to process floating-point operations by dequantizing the integer parameters when floating-point operations are required, and to requantize the result. In other words, a neural processing unit according to an example of this disclosure can include a processing element circuitry configured to process integer parameters and an SFU connected thereto via pipelines, and the SFU can include quantization and inverse quantization circuitry, and can be configured to process activation function operations with floating-point parameters. According to the above configuration, SFU 500 effectively communicates with the processing element PE, which only supports integer parameters, and has the effect of directly converting and processing parameter types even without external circuitry. That is, neural processing unit 1000 is configured to receive integer-formatted tensors via bus 6000 according to a request from DMA 200, and store the integer-formatted tensors in internal memory 300. The processing element PE can be configured to compute the integer-formatted tensors. SFU 500 can be configured to receive the integer-formatted tensors computed by the processing element PE as input, convert them to floating-point tensors, process the result of at least one special function, convert them back to integer-formatted tensors, and store them in internal memory 300. Neural processing unit 1000 can send the results stored in internal memory 300 to memory 3000 via bus 6000 according to a request from DMA 200.

[0131] Now for reference Figure 3 The detailed configuration of the processing element is described below. Figure 3 This is a schematic diagram illustrating a processing element according to an example of the present disclosure. The processing element PE 410 can be configured to include a multiplier 411, an adder 412, an accumulator 413, and a bit quantization unit 414. However, the example according to the present disclosure is not limited to this architecture, and the array of processing elements can be modified to take into account the computational characteristics of the target neural network model.

[0132] Multiplier 411 multiplies the input (N)-bit data with (M)-bit data. The result of the multiplier 411 is output as (N+M)-bit data, where N and M are integers greater than zero. A first input can be configured to receive (N)-bit data, and a second input can be configured to receive (M)-bit data. The first input can be configured to receive an activation value, and the second input can be configured to receive a weight value. The second controller 100 can control the internal memory 300 to reuse parameters stored in the internal memory 300 according to machine code. Parameter reuse may mean that parameters stored in the internal memory 300 are not deleted or otherwise copied or moved to memory 3000, but are reused in subsequent operations. According to the above configuration, as... Figure 5 As shown, it has the effect of reducing power consumption based on the operation of memory 3000. Figure 5 In this context, a 32-bit SRAM read refers to the energy required to read one bit of data from internal memory 300, and a 32-bit DRAM read refers to the energy required to read one bit of data from memory 3000 via bus 6000. It also has the effect of eliminating the latency that occurs when the neural processing unit 1000 sends data to and receives data from memory 3000 via bus 6000.

[0133] In other words, the second controller 100 can obtain reusable variable parameters and reusable constant parameters based on the machine code of the compiled neural network model. Therefore, the second controller 100 can be configured to control the internal memory 300 to reuse parameters stored in the internal memory 300.

[0134] The processing element can constrain the operation of multiplier 411 so that when zero is input at one of the first and second inputs of multiplier 411, multiplier 411 may not perform the operation, because if zero is multiplied by any number, the result will be zero even if the operation is not performed.

[0135] For example, when zero is input to one of the first and second inputs of multiplier 411, multiplier 411 can be configured to operate in a zero-jumping manner. For zero-jumping, each processing element PE included in the processing element array 400 can be individually enabled or disabled. The second controller 100 can be configured to provide an enable or disable signal to each processing element PE clockwise. When a processing element PE is disabled, multiplier 411 can be configured to be disabled. Therefore, the power consumed by the operation of multiplier 411 can be reduced. For example, refer to... Figure 5 Information about the power consumption of the multiplier is provided. Adder 412 can also be configured to be disabled when the processing element PE is disabled. Therefore, the power consumed by the operations of adder 412 can be reduced. For example, refer to... Figure 5 Provides information about the power consumption of the adder.

[0136] In some examples, each processing element PE can be designed to receive a corresponding control signal from the second controller 100 to control (i.e., enable or disable) the zero-jump operation.

[0137] In some examples, each multiplier 411 of each processing element PE can be designed to receive a corresponding control signal from the first controller 1100 for controlling zero-jumping operations. According to the above configuration, the power consumption of the multipliers can be reduced by zero-jumping.

[0138] In some examples, each adder 412 of each processing element PE can be designed to receive a corresponding control signal from the second controller 100 for controlling zero-jumping operations. According to the above configuration, zero-jumping can reduce the power consumption of the adders.

[0139] In some examples, each of the multipliers 411 and adders 412 of each processing element PE can be designed to simultaneously receive a corresponding control signal from the second controller 100 for controlling zero-jumping operations. According to the above configuration, the power consumption of the multipliers and adders can be reduced by zero-jumping.

[0140] In some examples, the weights are trained constant parameters, and the machine code that compiles the neural network model including the weights can be programmed to input corresponding control signals to each processing element PE that receives a zero weight value to control the zero-jump operation.

[0141] The number of bits of data input to the first and second inputs can be determined based on the quantization of the node data and the weight data of the corresponding layers in the neural network model. For example, the node data of the first layer can be quantized to 5 bits, while the weight data of the first layer can be quantized to 7 bits. In this case, the first input can be configured to receive 5 bits of data, while the second input can be set to receive 7 bits of data; that is, the number of bits of data input to each input can be different.

[0142] The processing element PE can be configured to receive quantization information of the data input to each input. The neural network data locality information can include quantization information of both the input and output data of the processing element PE.

[0143] In some examples, the processing core 1000-1 can be controlled so that when the quantization bit width information is input to the input of the processing element PE, the quantized data stored in the internal memory 300 is dynamically converted. That is, different tensors can have different quantization bit widths, and the processing element PE can be configured to generate input data by receiving bit width information from the processing core 1000-1 in real time as the bit width of the input data is converted.

[0144] Accumulator 413 uses adder 412 to perform multiple (L) loops to accumulate the operation values ​​of multiplier 411 and accumulator 413. Therefore, the number of data bits at the output and input of accumulator 413 can be output as (N+M+log2(L)) bits, where L is a positive integer.

[0145] Once the accumulator 413 has finished accumulating, it can receive an initialization reset signal to initialize the data stored inside the accumulator 413 to zero. However, the examples according to this disclosure are not limited thereto.

[0146] Accumulator 413 is configured to store the accumulated value even when zero-jumping is enabled in the corresponding processing element PE. Therefore, subsequent values ​​can be accumulated even when zero-jumping is enabled.

[0147] The bit quantization unit 414 can reduce the bit width of the data output from the accumulator 413. The bit quantization unit 414 can be controlled by the second controller 100. The bit width of the quantized data can be output as (X) bits, where X is a positive integer. According to the above configuration, the processing element array is configured to perform MAC operations, and the processing element array has the effect of quantizing and outputting the MAC operation results. In particular, as the number of (L) cycles increases, this quantization has the effect of further reducing power consumption. Reducing power consumption also has the effect of reducing heat generation of edge devices. In particular, reducing heat generation has the effect of reducing the possibility of malfunction operation caused by the high temperature of the processing core 1000-1.

[0148] The output data (X) bits of the bit quantization unit 414 can be equal to or different from the (N) bits and / or (M) bits. For example, the (X) bits can be set to a bit width such that no overflow of the output data (X) bits occurs based on the maximum value that can be accumulated in the accumulator 413. For example, the (X) bits can be 16 bits, 24 bits, or 32 bits.

[0149] According to an example of this disclosure, the processing element array of a processing core 1000-1 may include a multiplier 411, an adder 412, an accumulator 413, and a bit quantization unit 414. The bit quantization unit 414 can reduce the number of data bits (N+M+log2(L)) output from the accumulator 413 to (X) bits. A second controller 100 can control the bit quantization unit 414 to reduce the number of bits in the output data from the least significant bit (LSB) to a predetermined number of the most significant bit (MSB).

[0150] In some examples, the quantization level can be determined separately for each tensor of the neural network model.

[0151] According to the processing element PE, by adjusting the number of bits of the (N)-bit data and (M)-bit data of the multiplier 411 and the number of bits of the operation value (X)-bit determined by the bit quantization unit 414, the processing element array has the effect of preventing MAC operation overflow.

[0152] Figure 4 This is a schematic diagram illustrating an example neural network. An exemplary convolutional neural network can be a combination of one or more convolutional layers, pooling layers, and fully connected layers. Convolutional neural networks have a structure suitable for training and inference on two-dimensional data and can be trained via a backpropagation algorithm.

[0153] In one example disclosed herein, the convolutional neural network has a kernel for each channel, which extracts features from the input image for that channel. The kernels can be organized as a two-dimensional matrix and perform convolution operations as they traverse the input data. The size of the kernels can be arbitrary, and the stride of the kernel traversing the input data can also be arbitrary. The convolution result of the entire input data for each kernel can be called a feature map or activation map.

[0154] As used in this article, a kernel can include a single set of weights or multiple sets of weights. The number of kernels per layer can be referred to as the number of channels.

[0155] Therefore, since convolution is a combination of input data and kernels, an activation function can be subsequently applied to add non-linearity. When an activation function is applied to a feature map that is the result of a convolution operation, it can be called an activation map.

[0156] For details, please refer to the following: Figure 4 A convolutional neural network can contain at least one convolutional layer, at least one pooling layer, and at least one fully connected layer. For example, convolution can be defined by two main parameters: the size of the input data (typically a 1×1, 3×3, or 5×5 matrix) and the depth of the output feature map (the number of kernels). These key parameters can be computed using convolution. These convolutions can start at a depth of 32, continue to a depth of 64, and terminate at a depth of 128 or 256. A convolution operation can be understood as sliding a kernel of size 3×3 or 5×5 across the input data, an input image matrix, multiplying each weight of the kernel by each element of the overlapping input image matrix, and then summing them together.

[0157] Activation functions can be applied to the resulting output feature map to obtain the final output activation map. Pooling layers can perform pooling operations to downsample the output data (i.e., the activation map) to reduce the size of the feature map. For example, pooling operations can include, but are not limited to, max pooling and / or average pooling.

[0158] Max pooling uses a kernel and outputs the maximum value in the feature map region where the kernel slides and overlaps. Average pooling outputs the average value in the feature map region where the kernel slides and overlaps. Because these pooling operations reduce the size of the feature map, they also reduce the number of parameters in the feature map.

[0159] Fully connected layers can classify the data output from pooling layers into multiple categories (i.e., estimates) and output the classified categories and their scores. The data output from pooling layers is in the form of a 3D feature map, which can be converted into a 1D vector and input into the fully connected layer.

[0160] In one example, further reference Figure 2 According to one example of this disclosure, a neural network model processed by processing core 1000-1 may be related to image classification and object detection.

[0161] For example, the input data of the processing element array 400 of the neural processing unit 1000 processing the above-mentioned neural network model can be image data, and the output data of the processing element array 400 can be multiple bounding box data of the input image. Each of the multiple bounding box data can contain bounding box coordinate data and class data. The bounding box coordinate data of the bounding box can contain location confidence score, height data, width data, X coordinate data, and Y coordinate data. Assuming that the shape of the bounding box is rectangular, the bounding box coordinate data can contain the height data, width data, X coordinate data, and Y coordinate data as described above. However, the shape of the bounding box is not limited to a square, but can be transformed into a pentagon or a larger polygon or a circle. Therefore, the quantity and type of bounding box coordinate data can vary according to the shape of the bounding box. In addition, the class data can contain multiple classes classified as existing within the bounding box and their scores.

[0162] Figure 5 This is a table showing the energy consumption per unit operation of a neural processing unit according to an example of this disclosure. Power reduction techniques for the internal memory 300 of the processing core 1000-1 will be described. Figure 5This representation symbolically explains the energy consumed per unit operation of the processing core 1000-1. Energy consumption can be divided into memory access, addition, and multiplication operations. "8b Add" refers to the 8-bit integer addition operation of adder 412. An 8-bit integer addition operation consumes 0.03 pj of energy. "16b Add" refers to the 16-bit integer addition operation of adder 412. A 16-bit integer addition operation consumes 0.05 pj of energy. "32b Add" refers to the 32-bit integer addition operation of adder 412. A 32-bit integer addition operation consumes 0.1 pj of energy. "16b FPAdd" refers to the 16-bit floating-point addition operation of adder 412. A 16-bit floating-point addition operation consumes 0.4 pj of energy. "32b FPAdd" refers to the 32-bit floating-point addition operation of adder 412. A 32-bit floating-point addition operation consumes 0.9 pj of energy. "8bMult" refers to the 8-bit integer multiplication operation of multiplier 411. An 8-bit integer multiplication operation consumes 0.2 pJ of energy. "32bMult" refers to the 32-bit integer multiplication operation of multiplier 411. A 32-bit integer multiplication operation consumes 3.1 pJ of energy. "16bFPMult" refers to the 16-bit floating-point multiplication operation of multiplier 411. A 16-bit floating-point multiplication operation consumes 1.1 pJ of energy. "32bFPMult" refers to the 32-bit floating-point multiplication operation of multiplier 411. A 32-bit floating-point multiplication operation consumes 3.7 pJ of energy. "32b SRAM read" refers to the read access of 32-bit data when internal memory 300 is static random access memory (SRAM). Reading 32-bit data from internal memory 300 consumes 5 pJ of energy. "32b DRAM read" refers to the read access of 32-bit data when main memory 3000 is DRAM. Reading 32 bits of data from memory 3000 to internal memory 300 consumes 640 pJ of energy. The unit of energy is picojoule (pJ).

[0163] When processing core 1000-1 performs 32-bit floating-point multiplication and 8-bit integer multiplication, the energy consumption per unit operation is approximately 18.5 times. When reading 32-bit data from memory 3000 configured as DRAM and reading 32-bit data from internal memory 300 configured as SRAM, the energy consumption per unit operation differs by approximately 128 times.

[0164] In other words, from a power consumption perspective, power consumption increases with the median of the data. Furthermore, floating-point operations consume more power than integer operations. Additionally, reading data from DRAM significantly increases power consumption.

[0165] Therefore, the internal memory 300 of the processing core 1000-1 according to an example of this disclosure can be configured to include a high-speed static memory such as an SRAM transistor, but not DRAM. However, the neural network processing unit according to the example of this invention is not limited to SRAM.

[0166] For example, internal memory 300 may not contain DRAM, and internal memory 300 may be configured to contain static memory that is configured to have relatively higher read and write speeds and consume relatively less power than memory 3000.

[0167] Therefore, the internal memory 300 of the processing core 1000-1 according to an example of this disclosure can be configured to have relatively high read and write speeds compared to the memory 3000, and relatively low power consumption for inference operations of the neural network model.

[0168] Static memory, such as SRAM, MRAM, STT-MRAM, eMRAM, and OST-RAM, can be driven at high speeds. Furthermore, MRAM, STT-MRAM, eMRAM, and OST-RAM are static memories and possess non-volatile characteristics. Therefore, static memories capable of high-speed operation (such as SRAM) can have the effect that, upon restarting after a power failure, there is no need to redundantly provide neural network models from memory 3000. However, the examples according to this disclosure are not limited to this.

[0169] According to the above configuration, the processing core 1000-1 has the effect of significantly reducing DRAM power consumption during inference operations of the neural network model. Furthermore, the SRAM storage cell of the internal memory 300 may include, for example, four to six transistors to store one bit of data. However, the examples according to this disclosure are not limited thereto. Additionally, the MRAM storage cell of the internal memory 300 may include, for example, a magnetic tunnel junction (MTJ) and a transistor to store one bit of data.

[0170] Figure 6A and Figure 6B This is a diagram illustrating a system that performs memory access operations using a bus for read operations and another bus for write operations, according to an example of this disclosure.

[0171] refer to Figure 6A The system may include a single bus 6000 and multiple processing cores 1000-1 and 1000-2, corresponding to an example of this disclosure; see reference. Figure 6B The system may include multiple buses 6100 and 6200 and multiple processing cores 1000-1, 1000-2, corresponding to another example of this disclosure.

[0172] refer to Figure 6A and Figure 6B The operation of multiple tensors sent to each processing core 1000-1 and 1000-2 via bus 6000 or multiple buses 6100 and 6200 can be configured to be controlled by the first controller 1100. However, Figure 6A and Figure 6B The two processing cores 1000-1 and 1000-2 shown are for illustrative purposes only and do not limit the number of processing cores. It should be understood that a neural processing unit may contain at least one processing core, and Figure 6A and Figure 6B Each of the processing cores shown can be replaced by a neural processing unit.

[0173] First, according to Figure 6A The system including bus 6000 can be configured to access memory 3000 via bus 6000 to request read and / or write data to the first processing core 1000-1 and the second processing core 1000-2. Specifically, the first controller 1100 can be configured to determine which tensors cause data starvation based on the clock cycles of memory access and computation operations for each tensor (i.e., the number of clock cycles required to process the tensor) before each processing core 1000-1 and 1000-2 directly accesses memory 3000 for read and / or write operations, and to determine the priority of processing the identified tensors and subsequent tensors. Bus 6000 can be configured to process each tensor sequentially based on the determined priority. Therefore, the first controller 1100 can be configured to yield bandwidth on bus 6000 to higher-priority tensors, thereby eliminating or at least reducing data starvation.

[0174] according to Figure 6BA system comprising multiple buses 6100 and 6200 can be configured to access memory 3000 via a first bus 6100 for read operations and a second bus 6200 for write operations, requesting read and / or write operations to a first processing core 1000-1 and a second processing core 1000-2. Specifically, before each processing core (e.g., 1000-1, 1000-2) directly accesses memory (e.g., 3000) for read and / or write operations, the system can be configured to identify tensors causing data starvation based on the cycle of memory access and computation operations for each tensor (i.e., the number of clock cycles spent processing the tensor), and determine the priority between write operations on the identified tensors and read operations on the next tensor. The first bus 6100 and the second bus 6200 can be configured to process each tensor sequentially based on the determined priority. Therefore, the first controller 1100 can be configured to prevent or at least reduce the occurrence of data starvation by having the first controller yield bus bandwidth for lower-priority tensors to higher-priority tensors. For example, if a data starvation period is predicted to occur in a specific tensor, the bus bandwidth allocated to low-priority tensors can be freed up for high-priority tensors by assigning high priority to read operations of the identified specific tensor and low priority to write operations of the next tensor. Therefore, the time delay that the first bus 6100 must wait for the write operation of the next tensor can be reduced.

[0175] Figure 7A and Figure 7B This is a diagram illustrating an example of performing read and write operations on a bus according to another example of this disclosure. Figure 7A It shows that in such Figure 6A The example shown is of reading and writing operations performed in a system containing a bus 6000 and multiple processing cores 1000-1 and 1000-2. Figure 7B It shows that in such Figure 6B The example shown is of reading and writing operations performed in a system containing multiple buses 6100 and 6200 and multiple processing cores 1000-1 and 1000-2.

[0176] refer to Figure 7A and Figure 7BRead and write operations are common because they are memory access operations, but they differ in the components that execute them. Therefore, multiple buses 6100 and 6200 can be operated separately for each read and write operation. Furthermore, if two or more buses are provided, buses for read operations and buses for write operations can be set separately among multiple buses. In this case, the number of buses for read operations and the number of buses for write operations can be the same or different, and the ratio or method of setting the number is unrestricted. Specifically, in a read operation, memory 3000 transmits the requested data. After each processing core transmits the address information it intends to access, memory 3000 transfers the requested amount of data to bus 6000 within a specified period. This data should be continuously provided to each processing core. For example, if a processing core sends a request to memory 3000 to access, for example, 400 data patches starting at address 0x8000_0000, memory 3000 may consume additional time beyond the time required to send the 400 data patches. Furthermore, depending on the write operation, when the processing core provides write data, the address information to be used for the write operation and the data to be written to that address can be provided via bus 6000. For example, when a particular processing core sends a request to write 400 data patches to memory starting at address 0x8000_0000, it may consume additional time to provide the 400 data patches via bus 6000 to perform the actual write operation.

[0177] exist Figure 7A In this case, bus 6000 is a single bus, therefore bus 6000 cannot perform bidirectional communication simultaneously with two or more processing cores. If each processing core performs read operations simultaneously, memory 3000 has no problem continuously providing large amounts of read data in the direction of the processing cores. However, if the read and write operations performed by each processing core are different from each other, one of the write operations attempting to send large amounts of read data to multiple processing cores and the other of the read operations attempting to send large amounts of write data to memory 3000 may wait, resulting in data starvation. This causes data reading or writing to become inefficient over time.

[0178] like Figure 7AAs shown, when the first processing core 1000-1 sends a read operation request to the memory 3000, and the second processing core 1000-2 sends a write operation request to the memory 3000 and provides write data to the memory 3000, the memory 3000 cannot respond to the read operation request from the first processing core 1000-1 until the write operation of the second processing core 1000-2 is completed, and vice versa. Furthermore, when the write operation of the second processing core 1000-2 is completed, the memory 3000 generates a response to the read operation request from the first processing core 1000-1 and provides the read data to the first processing core 1000-1. As explained above, a response to the write operation of the second processing core 1000-2 cannot be executed until the read operation of the first processing core 1000-1 is completed.

[0179] exist Figure 7B In the case of multiple buses, a first bus 6100 for read operations and a second bus 6200 for write operations are provided respectively. This allows data to be sent to and received from each processing core when there are two or more processing cores, while simultaneously sending and receiving data to and from memory. In other words, data from memory 3000 to each processing core occupies the first bus 6100, and data from each processing core to memory 3000 occupies the second bus 6200, thus achieving bidirectional communication. It also reduces the latency required to transmit read / write access requests and read / write access responses. In other words, if Figure 7A If the buses for read operations and write operations are not separate, the processing core cannot transmit read data while it is being transferred from memory 3000 via bus 6000. Therefore, write data can be sent to memory 3000 along with the write request via bus 6000 only after all read data has been sent. This results in a wait for read data to be sent. However, in... Figure 7B In this case, since the bus for read operations (e.g., the first bus) and the bus for write operations (e.g., the second bus) are separate, when read data from memory 3000 is sent to the processing core via the first bus 6100, the processing core can send write data to memory 3000 via the second bus 6200. As a result, memory 3000 can perform write operations even when outputting read data, thereby eliminating or at least reducing the waiting time for write operations.

[0180] Although hardware resources increase with the separate provision of the first bus 6100 and the second bus 6200, the increase is not significant compared to the hardware resources occupied by the processing core that performs AI algorithm calculations.

[0181] The following will refer to Figure 8 to... Figure 16 Examples of this disclosure are described in detail, and references are made to... Figures 17 to 19 Another example of the invention will be described in detail below. Examples of this disclosure will be explained with reference to the first through fourth examples.

[0182] Figure 8A and Figure 8B This is a diagram illustrating an example operation for reducing latency in a processing tensor in the event of bus congestion in a system used to control a processing core, according to a first example of this disclosure. Furthermore, Figure 8A and Figure 8B Other examples of this disclosure may also be referenced.

[0183] refer to Figure 8A and Figure 8B When processing tensor n of a neural network model in the processing core, it is preferable to prefetch the data required for subsequent tensor n+1 operations via DMA (using DMA to transfer tensor n+1 is referred to as "DMA n+1" below). n This refers to the time it takes for the processing core to perform calculations on tensor n. T d This refers to the time spent by DMA transferring tensor n+1 to the internal memory of the processing core. DMA transfer of tensor n+1 can be performed, for example, as a prefetch operation. If the DMA prefetch for tensor n+1 is completed before the processing core completes the computation of tensor n, the processing core can process the computation of tensor n+1 without data starvation. Conversely, if the DMA prefetch for tensor n+1 is not completed when the operation on tensor n is completed, the computation of tensor n+1 may be delayed.

[0184] like Figure 8A As shown in case 1, when the prefetch time T of DMA n+1 is... d Shorter than the time T required to process the core computation tensor n n At that time, computational operations on tensor n+1 can begin without delay. That is, when at T... n During the computation of tensor n in the processing core, DMA 200 can transfer the parameters required for the computation operation using tensor n+1 to internal memory 300 via bus 6000. However, in case 2, if the DMA n+1 time increases to T′ d Then the computation operation using tensor n+1 can be performed when the tensor n operation terminates and T... w It will begin after the expiration date.

[0185] In another example of this disclosure, reference Figure 8A Case 1, in which no bus congestion occurs, tensor n in T n The computation takes time T, and the DMA n+1 prefetch is performed within T. dThe process is completed within a certain time frame, and then the tensor n+1 operations begin without delay. In other words, in T... n During the time period, operations on tensor n are performed in processing core 1000-1, and while processing core 1000-1 is operating on tensor n, DMA 200 can be used within T. d Within a given time, data for the operation of tensor n-1 is written to memory 3000 via the first bus 6100, and parameters required for the operation of tensor n+1 are stored in internal memory 300 via the second bus 6200. However, in Figure 8A In scenario 2, if bus congestion occurs during DMA n+1, the DMA n+1 time increases to T′d, and the computation of tensor n+1 takes T seconds after tensor n is completed. w It will begin after the specified time.

[0186] Therefore, the system according to the examples of this disclosure can adjust the order and / or timing of operations performed on each tensor, such as... Figure 8B As shown, the QoS of DMA is adjusted to reduce T. w This refers to latency (the time during which the computational circuitry is not running), even when bus congestion occurs in the DMA. The QoS control signals for the DMA can be represented by parameters such as (i) indicating the priority or urgency of memory requests, (ii) parameters associated with bus bandwidth (e.g., guaranteed bandwidth, maximum bandwidth, percentage of total bandwidth), parameters indicating the allowable latency of data transfer, buffer parameters, jitter parameters, and packet loss parameters. As described below, one or more of these QoS parameters can be controlled on a tensor basis so that the processing core can prefetch tensors for computational operations in a timely and efficient manner. The QoS parameters can be generated, adjusted, and / or controlled by the first controller 1100. The first controller 1100 can deliver the QoS control signals to the DMA 200 of a specific neural processing unit. In some examples, when the DMA 200 includes circuitry for controlling QoS, the DMA 200 can generate the QoS control signals instead of receiving them from the first controller 1100. The QoS control signals can be referred to as sideband signals.

[0187] The time T for performing tensor n computation n It can be determined at compile time or monitored in real time, i.e., time T. n This can be determined statically and / or dynamically by the first controller 1100. Furthermore, the amount of DMA n+1 data to be transferred within that time period is determined. However, with T... n Similarly, the operation time T of DMA n+1 dIt may be difficult to calculate or predict because the bandwidth that can be allocated to DMA varies depending on one or more conditions of bus 6000, the first bus 6100, or the second bus 6200. Furthermore, bus 6000 of the system of the first example of this disclosure, or the first bus 6100 and the second bus 6200 of the system of another example of this disclosure, can be allocated in real time for transferring data between various circuits (e.g., CPU, PCIe), rather than between memory and one or more processing cores. Therefore, the theoretical time T... d In practice, the number of buses can be increased according to the actual situation of the bus.

[0188] The interval for processing the core computation tensor n, i.e., the clock cycle for computing tensor n, is called T. n On the other hand, the time interval for data associated with the processing core to be transferred via DMA to compute tensor n+1, i.e., the memory cycle of DMA n+1, is called T. d However, T d The number of buses can be increased variably depending on the situation on bus 6000, the first bus 6100, or the second bus 6200.

[0189] Compare T n and T d In each period, if T n Much greater than T d ,like Figure 8B In case 1, the prefetch completion time of DMA n+1 is relatively faster than the computation completion time of tensor n. At this point, the system can determine that even considering various dynamic situations on bus 6000, the first bus 6100, and the second bus 6200, T... d Less than T n The possibility is also high, because DMA has sufficient time margin.

[0190] On the other hand, if T n Not significantly greater than T d (like Figure 8B In case 2), the prefetch completion time of DMA n+1 is relatively faster than the computation completion time of tensor n. Therefore, if bus congestion causes insufficient time for data transfer using DMA, T d Become greater than T n The probability will increase.

[0191] In other words, when T d / T n When the value is equal to or greater than 1, the system can be based on T according to the example. n With T d The ratio of DMA n+1 sent via the bus is prioritized. That is, the system according to the example of this disclosure can be configured based on T...n With T d The ratio, i.e., based on T d / T n The value and / or the DMA n+1 transmitted on the bus are preferentially processed based on a preset threshold.

[0192] Furthermore, the system according to the example of this disclosure can be based on T n With T d The ratio and bus congestion level are used to increase the priority of DMA n+1 sent through the bus, where T d / T n The value is compared with a predetermined threshold. The level of congestion in the bus can be determined based on the bandwidth sharing of various additional circuits connected to the bus. The higher the level of bus congestion, the higher the T value. d The greater the likelihood of an increase.

[0193] T d / T n The threshold can be set to, for example, 0.9. Therefore, when T d / T n When the threshold is ≥ 0.9, the system can be configured to determine that the transfer of tensor n+1 may be delayed and cause bus congestion. Therefore, the bus transfer priority of DMA n+1 is increased to prevent delayed transfer of tensor n+1. However, the above threshold can be appropriately determined according to the degree of bus congestion, and this disclosure is not limited thereto.

[0194] As a replacement or supplement, the modified formula T n / T d The threshold can be set to, for example, 1.1. Therefore, when T n / T d When the threshold is ≤1.1, the system can be configured to determine that the transfer of tensor n+1 may be delayed and cause bus congestion. Therefore, the bus transfer priority of DMA n+1 is increased to prevent delayed transfer of tensor n+1. However, the above threshold can be appropriately determined according to the degree of bus congestion, and this disclosure is not limited thereto.

[0195] As mentioned above, by using T d / T n Adjusting the DMA's QoS parameters to prioritize bus bandwidth for transferring tensor n+1 can effectively utilize the DMA bus and reduce the latency T associated with processing tensor n+1. w For example, if T d / T n If the value of T is less than a preset threshold, the QoS parameter can be reduced, thereby lowering the priority associated with accessing the bus. d / T nIf the value is not less than the preset threshold, the QoS parameter can be increased, thereby increasing the priority associated with accessing the bus.

[0196] Therefore, the control system can improve the performance of each processing core by increasing the priority of the DMA n+1 tensor, which may experience delays in transmission to the neural processing unit due to factors such as bus congestion.

[0197] QoS mechanisms can be used to prioritize traffic on the bus, manage bandwidth allocation, and reduce latency, jitter, and packet loss to improve overall system performance. QoS parameters can be controlled or adjusted to achieve, in particular, the following objectives: Bandwidth allocation: Control the bus to ensure that each processing core has sufficient bus bandwidth to receive and transmit data through the bus for its operation.

[0198] Priority Level: Each tensor can be assigned a specific priority level. Based on the priority, bus bandwidth can be allocated differently and / or the order of data requests in the sequential queue on the bus can be adjusted. The bus may contain additional sequential queue memory.

[0199] Traffic shaping: It can control data flow to improve or ensure performance, reduce latency, and ensure bandwidth.

[0200] Resource reservation: High-priority circuit units (such as neural processing units) can be reserved to maintain bus performance.

[0201] Figure 9 This is a flowchart illustrating control at the processing core according to a first example of this disclosure. This control method can be executed by a first controller 1100 that controls the neural processing unit 1000. (See reference...) Figure 9 The first controller 1100 can determine at least one data starvation period of the neural processing unit 1000 based on the access operations of the neural processing unit 1000 to the memory 3000 of each tensor and the computation operations of the data.

[0202] In this configuration, computational operations and memory access operations for each tensor can be performed within a given bus bandwidth so that each neural processing unit 1000 can communicate with the memory 3000.

[0203] To determine data starvation periods, the first controller 1100 can compare the computation cycle and memory cycle of each tensor in the neural processing unit 1000. For this purpose, the first controller 1100 can receive or monitor computation cycle and memory cycle information for each tensor.

[0204] Specifically, the first controller 1100 can compare the first processing time (i.e., computation cycle) required to complete a computational operation on a specific tensor with the second processing time (i.e., memory cycle) required to complete a memory access operation on the next tensor following the specific tensor, and identify the difference between the first and second processing times as a data starvation period. The first and second processing times are unique characteristics of the tensor determined based on the size of the tensor parameters and the complexity of the computational algorithm in the neural network model. Therefore, the first and second processing times can be pre-analyzed during the compilation phase of the neural network model.

[0205] Next, the first controller 1100 controls the priority of memory access operations performed by the neural processing unit 1000 for each tensor in S120, so that data starvation periods do not occur or are reduced. The first controller 1100 can be configured to determine the priority of memory access operations performed by the neural processing unit 1000 for tensors in which at least one data starvation period occurs. The first controller 1100 can be configured to control the bus based on the determined priority processing conditions.

[0206] When the first controller 1100 determines that a neural processing unit performing tensor computation operations may suffer at least one data starvation period, the first controller 1100 may adjust QoS parameters to eliminate or reduce the identified data starvation period. That is, if the second processing time (i.e., memory cycle) for completing a memory access operation is relatively longer than the first processing time (i.e., computation cycle), resulting in a data starvation period, the first controller 1100 may give high priority to memory access operations of tensors to be read during the second processing time. Conversely, if the first processing time for completing a computation operation is sufficiently long relative to the second processing time, the first controller 1100 may be configured to yield bus bandwidth by giving low priority to the corresponding memory access operations.

[0207] If the first processing time is longer than the second processing time, the first controller 1100 may assign a lower priority to the neural processing unit requesting a memory access operation with a relatively low probability of data starvation, so that the bus prioritizes the allocation of bus bandwidth to circuits other than the neural processing unit (e.g., other neural processing units, other processing cores, CPU, decoder, image sensor, etc.).

[0208] In one aspect, if the second processing time is longer than the first processing time, the first controller 1100 may increase the priority of neural processing units requesting memory access operations to reduce or eliminate data starvation periods of neural processing units, and the bus may process memory access request operations of neural processing units first to further utilize additional available bus bandwidth.

[0209] In some examples, the first controller 1100 may grant a relatively higher bandwidth on the bus to a particular neural processing unit than to others, based on a first processing time and a second processing time requested by each of the plurality of neural processing units for each tensor. The bandwidth allocation on the bus can be dynamically adjusted to reduce the data starvation period associated with each tensor. Therefore, the data starvation period of the plurality of neural processing units included in system 10000 can be reduced or eliminated.

[0210] In other words, even if the second processing time to complete the memory access operation of a specific neural processing unit increases slightly, as long as the main processing time to complete the computation operation is long enough, it can still give up at least a portion of the bus bandwidth to other neural processing units (i.e., for neural processing units with overlapping memory access operations).

[0211] On the other hand, if the second processing time for completing a memory access operation of a specific neural processing unit is long enough compared to the first processing time for completing a computational operation, more memory access opportunities can be obtained by acquiring bus bandwidth from one or more other neural processing units (where memory access operations overlap with those of the specific neural processing unit in time), thereby reducing the time when the computational circuitry of the specific neural processing unit is not running, and thus completing the memory access operation faster.

[0212] Figure 10 This is a diagram illustrating an exemplary method for determining the priority of a control processing core according to a first example of this disclosure. C(n) represents a first processing time, which is a computation cycle for completing a computational operation on a specific tensor, and D(n+1) represents a second processing time, which is a memory cycle for completing a memory access operation on the next tensor after that specific tensor. When the second processing time is longer than the first processing time, i.e., when the data hunger level (e.g., D(n+1) / C(n)) is greater than a first threshold Th1 (e.g., Th1 is set to 1), the first controller 1100 can assign a higher priority to the memory access request corresponding to D(n+1). Therefore, by obtaining bus bandwidth from another neural processing unit, the operation of D(n+1) of a specific neural processing unit can be accelerated. Thus, the total time for processing data is reduced because the time when the computing circuit is not running (i.e., the data hunger period) is reduced.

[0213] Furthermore, if the second processing time is less than the first processing time, i.e., if the data hunger level is less than a second threshold Th2 (e.g., Th2 is set to 1), the first controller 1100 can assign a lower priority to the memory access request corresponding to D(n+1) because the first processing time is longer. Therefore, at least a portion of the bandwidth allocated to the D(n+1) operation can be at least partially given to one or more other neural processing units. As a result, the data hunger period of one or more other neural processing units is reduced or eliminated, thereby reducing the total time spent by all neural processing units processing data.

[0214] On the other hand, when the first processing time and the second processing time are equal, that is, when the data hunger level is equal to the third threshold (e.g., 1), the first controller 1100 can assign normal priority to the data hunger period, because this corresponds to the case where there is no data hunger period.

[0215] In other words, the example system can calculate the data hunger level for a specific tensor and determine that the tensor has high priority by comparing the data hunger level with a first threshold. Furthermore, the system can determine that the tensor has low priority by comparing the data hunger level with a second threshold. Additionally, the system can maintain the tensor's priority when the data hunger level and a third threshold are equal. The first and second thresholds can be equal. Furthermore, the second and third thresholds can be equal.

[0216] In some examples, the first threshold can be greater than the third threshold. The second threshold can be less than the third threshold. The third threshold can be a range between the first and second thresholds. Specifically, for example, the first threshold can be 1. When the data hunger level is 1, the corresponding tensor theoretically has no data hunger, but due to various overhead and bandwidth contention that may occur on the bus, data hunger is considered likely to occur, even if it is temporary, and therefore the priority should be increased. The second threshold can be 0.8. If the data hunger level is 0.8, even if various overhead and bandwidth contention occur on the bus, the corresponding tensor is probabilistically not data hungry, and there is enough bus bandwidth to yield, so the priority can be decreased. The third threshold can be a range between the first and second thresholds. When the data hunger level is between 0.7 and 1, even considering various overhead and bandwidth contention on the bus, data hunger can be considered unlikely to occur, but there may not be enough bus bandwidth to yield. In other words, the system can be configured to calculate the data hunger level for each tensor, increase the priority of the tensor based on the first threshold, decrease the priority of the tensor based on the second threshold which is different from the first threshold, and maintain the priority of the tensor based on a value between the first and second thresholds (i.e., the third threshold).

[0217] Figure 11This is a diagram illustrating how, according to a first example of this disclosure, data processing speed can be improved by prioritizing the reduction of data starvation periods. Figure 11 The diagram illustrates the memory cycle (MEM) and computation cycle (COMP) for each tensor processed by each Neural Processing Unit (NPU). Each NPU can store the necessary parameters in its internal memory within the corresponding memory cycle for each tensor, and then use the parameters stored in the internal memory to process the neural network model's operations within the corresponding computation cycle. In other words, for an NPU to process a tensor, the NPU's DMA first transfers the tensor to the NPU's internal memory via the command bus within the memory cycle by sending a memory operation request, and then, within the computation cycle, the NPU's processing element performs computations using the tensor stored in the internal memory.

[0218] refer to Figure 11 NPU0 refers to a neural processing unit. An NPU0 can contain a single processing core or multiple processing cores. For example, NPU0 can correspond to... Figure 2 Neural processing unit 1000 or Figure 2 The processing core is 1000-1. NPU1 refers to another neural processing unit. For example, NPU1 can correspond to... Figure 2 The processing core is 1000-n.

[0219] refer to Figure 11 As shown in (a), the first processing time (i.e., computation cycle) of the operation on the data in the first tensor n+1 processed by NPU0 is shorter than the second processing time (i.e., memory cycle) of the memory access operation on the data in the second tensor n+2. Therefore, a data starvation period occurs between the computation cycles of the first tensor n+1 and the second tensor n+2 of NPU0 until the memory cycle of the second tensor n+2 of NPU0 is completed. The bus bandwidth of the memory cycle of the second tensor m+2 of NPU1 can be utilized, which at least partially overlaps with the memory cycle of the second tensor n+2 of NPU0 on the time axis. That is, since the memory cycle of the second tensor of NPU1 is completed before the computation cycle of the first tensor m+1 of NPU1 is completed, there is a bandwidth margin in the bus bandwidth of the memory cycle of the second tensor m+2 of NPU1 before the computation cycle of the second tensor m+2 of NPU1 begins. Therefore, relinquishing at least a portion of the bus bandwidth allocated to the memory cycle of the second tensor m+2 of NPU1 to NPU0 can essentially eliminate data starvation between the computation cycles of the first tensor m+1 and the second tensor m+2 of NPU1.

[0220] In other words, based on an example system, the memory cycles and computation cycles of continuously processed tensors can be compared to determine one or more data starvation periods or one or more bandwidth retention periods.

[0221] In other words, the system according to the first example can determine data starvation periods between consecutive tensors, each of which is processed in a first neural processing unit among multiple neural processing units. Furthermore, the system according to the first example of this disclosure can be configured to identify, among tensors processed in a second neural processing unit among multiple neural processing units, a tensor whose reserved bus transfer bandwidth can be at least partially relinquished to another tensor whose extended transfer on the bus may have caused or has caused a data starvation period. Therefore, the system according to one example can reallocate bus bandwidth previously allocated to a neural processing unit with sufficient bus bandwidth to another neural processing unit experiencing or potentially experiencing a data starvation period.

[0222] For example, as shown in (b), by assigning high priority to memory access operations of NPU0's second tensor n+2 and low priority to memory access operations of NPU1's second tensor m+2, at least a portion of the bus bandwidth allocated to NPU1 can be reallocated to NPU0, thereby reducing the data starvation period of NPU0. At this point, NPU1 may essentially not experience data starvation in its second tensor m+2 because NPU1 is in a bandwidth reservation period.

[0223] On the other hand, as shown in (a), the memory cycle of NPU1's third tensor m+3 is longer than the computation cycle of NPU1's second tensor m+2. Therefore, a data starvation period occurs between the computation cycle of NPU2's second tensor m+2 and NPU1's third tensor m+3 before the memory cycle of NPU1's third tensor m+3 is completed. Since the computation cycle of NPU0's second tensor n+2 is longer than the memory cycle of NPU0's third tensor n+3, there is bandwidth margin for NPU0's third tensor n+3 within its memory cycle. Therefore, at least a portion of the bus bandwidth allocated for transmitting NPU0's third tensor n+3's memory cycle can be reallocated to facilitate the transmission of NPU1's third tensor m+3 via the bus.

[0224] Therefore, the controller of the system in the first example (e.g., the first controller) can adjust the priority of memory access operations of the third tensor n+3 of NPU0 and the third tensor m+3 of NPU1, respectively. Thus, as shown in (b), by assigning a high priority to the memory access operations of the third tensor m+3 of NPU1 and a low priority to the third tensor n+3 of NPU0, at least a portion of the bus bandwidth of NPU0 can be reallocated to NPU1 to reduce the data starvation period of the third tensor m+3 of NPU1.

[0225] In summary, the system according to the first example can be configured to determine the data hunger period for each tensor of a neural processing unit, determine the bus bandwidth reservation period for tensors of another neural processing unit that overlap with the data hunger period along the time axis, and reduce the data hunger period by differentiating the priorities for each of the tensors with data hunger periods and the tensors with bandwidth reservation periods. Reference Figure 11 In (a) and (b), when the memory cycles of tensors with data starvation periods and tensors with bandwidth retention periods overlap by at least a portion, the processing time of NPU0 and NPU1 can be reduced by adjusting the priority of memory access operations for each tensor, respectively.

[0226] In the following, a second and a third example of the present disclosure will be described, wherein an operation to check the state information of each neural processing unit in real time is performed without comparing the computation cycle and memory cycle of each tensor of the neural processing unit 1000, thereby giving each tensor an appropriate priority to reduce data starvation.

[0227] Figure 12 This is a diagram illustrating an example of a method for determining priority in a method for controlling processing cores according to a second example of this disclosure. A first controller 1100 can receive status information from a second controller 100 for each processing core 1000-1, ..., 1000-n, and can determine, based on the received status information, whether each processing core 1000-1, ..., 1000-n is in a busy state in step S210. The busy state indicates whether the processing core is in use or is currently processing. The status information can be updated in real time.

[0228] Next, in step S210, the priority of each processing core 1000-1, …, 1000-n can be determined. Processing cores that are not in a busy state in S221 can be given high priority, while processing cores that are in a busy state in S222 can be given low priority. A busy state for a processing core indicates that it is performing a computational operation on a specific tensor; therefore, it is given low priority because the memory access operation for the next tensor does not need to be executed quickly. A non-busy state for a processing core indicates a data-starved state, where the computational operation for the next tensor has not been executed; therefore, it is given high priority because the memory access operation for the next tensor needs to be executed quickly.

[0229] The sequential queues of bus 6000 can be reordered according to the adjusted priority. The reordered sequential queues can be stored in a sequential queue memory or reordered. According to a second example of this disclosure, the first controller 1100 can determine the busy state of each NPU and reorder the sequential queues of bus 6000 according to priority. However, this disclosure is not limited thereto; the sequential queues on bus 6000 can also be configured to be reordered by at least one of the second controller 100, CPU 2000, DMA 200, or bus 6000.

[0230] Figure 13 This is a diagram illustrating an example of a data hunger signal generated during the runtime of the processing core according to the second example of this disclosure. Figure 13 Similar to Figure 11 For the sake of brevity, repeated explanations have been omitted here.

[0231] First, the second controller 100 for each processing core 1000-1, …, 1000-n can send its status information in the form of data hunger signals 11 and 12. A data hunger signal is a signal indicating that the processing element PE is in an idle state during the runtime of the processing core. A data hunger signal indicates that the computational operation of the processing core has stopped. The status information for each processing core 1000-1, …, 1000-n can be generated by checking the status of the components controlled by each second controller 100. Each data hunger signal can be an independent signal. As will be further described below, signals indicating idle state information can be generated by the second controller 100.

[0232] refer to Figure 13 In (a), when the computation operation COMP of the first tensor n is being executed, NPU0 disables the first data hunger signal 11 in a busy state. For example, the disable signal can be a first-level signal, while the activation signal can be a second-level signal. Alternatively, the disable signal can be a second-level signal, while the activation signal can be a first-level signal. The first level can be indicated by a low-level voltage signal, while the second level can be represented by a high-level voltage signal with a voltage higher than the low-level voltage signal. Furthermore, NPU0 can disable the first data hunger signal 11 in a busy state while processing the computation operation COMP of the first tensor n+1. Between tensor n and the first tensor n+1 of NPU0, no data hunger period caused by the memory access operation MEM will occur. NPU0 activates the first data hunger signal 11 in a data hunger state, in which the computation operation COMP of the first tensor n+1 is completed, but the computation operation COMP of the second tensor n+2 has not yet been executed.

[0233] Therefore, as Figure 13As shown in (b), the first controller 1100 can be configured to give higher priority to the processing core that has enabled the first data hunger signal 11. Therefore, the bus bandwidth of the memory access operation MEM of the high-priority tensor n+2 can be increased, thus... Figure 13 Compared to (a), the data starvation period shown in (b) can be reduced.

[0234] Furthermore, in some examples, when the first data hunger signal 11 is activated while the second data hunger signal 12 of another processing core is deactivated, the first controller 1100 can assign a lower priority to the other processor core. In this case, further reduction is possible. Figure 13 The first data starvation period is shown in (b).

[0235] Meanwhile, for NPU1, such as Figure 13 As shown in (a), NPU1 can disable the second data starvation signal 12 during a busy state while processing the computation operation COMP of the first tensor m. Subsequently, NPU1 can also disable the second data starvation signal 12 during a busy state while executing the computation operation COMP of the first tensor m+1. Therefore, there is no data starvation period caused by the memory access operation MEM between the first tensor m and the second tensor m+1 of NPU1.

[0236] In a busy state while performing the computation operation COMP of the second tensor m+2, NPU1 can disable the second data hunger signal 12. NPU1 can also activate the second data hunger signal 12 in a data hunger state, in which the computation operation COMP of the second tensor m+2 has been completed, but the computation operation COMP of the third tensor m+3 has not yet been executed.

[0237] Therefore, as Figure 13 As shown in (b), the first controller 1100 can be configured to give higher priority to the processing core that enables the second data hunger signal 12. This can increase the bus bandwidth of the memory access operation MEM of the high-priority tensor m+3, thus... Figure 13 Compared to (a), the data starvation period shown in (b) can be reduced.

[0238] Furthermore, in some examples, when the second data hunger signal 12 is enabled while the first data hunger signal 11 of other processing cores is disabled, the first controller 1100 can be configured to give lower priority to the other processing cores. In this case, further reduction is possible. Figure 13 The second data starvation period is shown in (b).

[0239] As described above, the first controller 1100 can be configured to perform real-time priority processing by checking the busy status in real time based on data hunger signals 11 and 12 received from each processing core 1000-1, ..., 1000-n, rather than comparing the period of each tensor processed by multiple processing cores 1000-1, ..., 1000-n. Furthermore, by displaying the priority of each processing core 1000-1, ..., 1000-n based on whether it is busy, each processing core can yield bus bandwidth, thereby shortening the data hunger period and enabling the computing circuitry to operate quickly.

[0240] Figure 14 This is a diagram illustrating an example method, according to a third example of this disclosure, for determining priorities to reduce delays identified by the count values ​​of a counter (e.g., a counter circuit). For brevity, details are omitted here. Figure 14 and Figure 12 and Figure 13 Repeated explanations.

[0241] According to the third example, a counter can be provided at a specific location within system 10000. The counter may be contained in one of the first controller 1100, neural processing unit 1000, bus 6000, and CPU 2000, and this disclosure is not limited to the location of the counter. Figure 2 In the diagram, counter 110 is shown as being included in the first controller 1100. (Reference) Figure 14 Each processing core 1000-1, …, 1000-n may include a counter and may be configured to perform counting when a memory access operation is performed based on a counter threshold for the memory access operation. In this case, the number of clock cycles in each memory access operation MEM can be pre-calculated based on the tensor size, and the maximum counter value can be the sum of a pre-calculated number of clock cycles (e.g., 0 to t clock cycles) determined during compilation based on the tensor size, plus a certain number of clock cycles (e.g., 0 to 100 clock cycles).

[0242] On the other hand, a counter threshold for memory access operations can be preset, and each processing core (1000-1, …, 1000-n) can count based on this counter threshold. The counter increment is the counter value when a memory access operation is performed.

[0243] When the counter value is lower than the counter threshold, each processing core 1000-1, …, 1000-n determines that no data starvation period has occurred and disables the data starvation signal for the remaining interval.

[0244] If the counter value of each processing core 1000-1, …, 1000-n exceeds a preset threshold, a data starvation period is determined to have occurred, and a data starvation signal is activated during the period A exceeding the preset maximum counter value. This activation can remain active until the memory access operation is completed. Therefore, the interval during which the data starvation signal is activated is given high priority by the first controller 1100. Based on the higher priority of the memory access operation due to the activation of the data starvation signal, the bus bandwidth of the memory access operation increases, which leads to a reduction in the data starvation period. When the memory access instruction completes, the counter can be reset, and the priority can be reduced again.

[0245] For example, such as Figure 14 As shown, when the counter threshold is set to t+100 clock cycles, each of the processing cores 1000-1, …, 1000-n can disable the data starvation signal during intervals when the counter value is below the counter threshold, and enable the data starvation signal during intervals when the counter value exceeds the counter threshold. The threshold can be appropriately determined by considering the characteristics of various communication networks; that is, when the counter threshold is exceeded, a bottleneck is determined to have occurred on the bus.

[0246] As described above, by having a counter to count memory access operations for each processing core 1000-1, …, 1000-n, and by allowing the first controller 1100 to determine its busy state via a correspondingly activated data hunger signal, the first controller 1100 can dynamically adjust its priority. Furthermore, by allowing each processing core 1000-1, …, 1000-n to yield bus bandwidth by prioritizing based on whether it is in a busy state, computing circuitry with reduced data hunger periods can be enabled.

[0247] The following describes a fourth example of applying the first and second / third examples of this disclosure. Figure 15This is a diagram illustrating the prioritization process according to a fourth example of this disclosure. The first controller 1100 can compare the clock cycles of computational operations and memory access operations for each tensor of the neural processing unit 1000 in step S310 to identify at least one data starvation period. Information regarding the duration of each computational cycle and each memory cycle for each tensor of the neural network model can be included in the neural network model to be processed by the neural processing unit. In step S310, the first controller 1100 can be configured to compare a first processing time T1 (e.g., computational cycle) for completing a computational operation on a specific tensor with a second processing time T2 (e.g., memory cycle) for completing a memory access operation to read data required for subsequent tensor computational operations after that specific tensor, and determine that an interval where the second processing time T2 is longer than the first processing time T1 is a potential data starvation period. If the first processing time is not greater than the second processing time, i.e., if it can be determined that the computational cycle will complete faster than the memory cycle, the system can be configured to determine that a data starvation period may occur. Therefore, the first processing time T1 and the second processing time T2 for each tensor can be compared. The first processing time T1 and the second processing time T2 are included in the neural network model and can be provided to the neural processing unit in advance. If the first processing time T1 is greater than the second processing time T2, the tensor is determined to be within the computationally constrained CB interval. If the first processing time T1 is shorter than the second processing time T2, the tensor is determined to be within the memory-constrained MB interval. The first and second processing time information can be included in the corresponding neural network model and are initial values ​​determined based on the tensor size. The second processing time can change in real time according to the real-time bandwidth allocation of the bus.

[0248] Next, based on the data hunger signal generated by the second controller 100, it can be determined in real time whether the NPU is in a busy state. Specifically, for memory-constrained tensors, based on the data hunger signal received from the second controller 100 for each processing core 1000-1, …, 1000-n, the second controller 200 can determine whether the corresponding processing core in S321 is in a busy state. If the computation cycle completes faster than the memory cycle, the start of the computation cycle for subsequent tensors may be delayed until the memory cycle is completed, i.e., a data hunger period may occur.

[0249] In step S310, the memory cycles allocated to the tensor can be either a default priority or a high priority, and these memory cycles are predicted to encounter a data starvation period determined based on a first processing time T1 and a second processing time T2.

[0250] In step S321, if the processing core is in a busy state S331, that is, although a data starvation period was predicted to occur in step S310, the fact that the NPU is in a busy state confirms that a data starvation period has not actually occurred, then the first controller 1100 can give it a default priority. Therefore, in step S331, the processing core can retain its default priority based on the fact that it is still performing computational operations.

[0251] In step S321, if the corresponding processing core is not in a busy state S332, that is, if it is predicted in step S310 that a data starvation period may occur on the corresponding processing core, and it is confirmed in step S321 that the computation on the corresponding processing core has actually stopped, then the first controller 1100 can give the corresponding processing core a high priority. Therefore, the corresponding processing core can be regarded as being in a data starvation state DS, and can be configured to receive high bus bandwidth with a high priority setting.

[0252] On the other hand, for computationally constrained tensors, based on the data starvation signal received from the second controller 100 of each processing core 1000-1, …, 1000-n, the second controller 200 can determine whether the corresponding processing core is in a busy state S322. If the completion of the computation cycle is later than the memory cycle, the start of the next tensor computation cycle is unlikely to be delayed until the memory cycle is completed, i.e., a data starvation period is unlikely to occur.

[0253] In step S310, the priority of the memory cycle assigned to the tensor that is predicted not to experience a data starvation period, determined based on the first processing time and the second processing time, can be either low priority or high priority.

[0254] In step S322, if the corresponding processing core is in a busy state S333, meaning that the data starvation period is predicted to be unlikely in step S310, and it is confirmed that the NPU is in a busy state, then the first controller 1100 can assign a low priority. Therefore, in step S322, the first controller 1100 can determine that the data starvation period is highly unlikely based on the reason that the processing core is still computing, and can adjust the priority to a low priority.

[0255] In step S322, if the processing core is not in a busy state S322, the first controller 1100 can assign it a high priority. In other words, in step S310, it is predicted that a data starvation period is unlikely to occur in the processing core, but contrary to the prediction, it is confirmed in step S322 that the processing core has actually stopped working. Therefore, the corresponding processing core can be identified as being in a data starvation state DS and can be configured to have a high priority and provide high bus bandwidth. That is, the computation cycle and memory cycle of each tensor can be compared to first determine the probability of a data starvation period occurring, and then it can be determined in real time whether a data starvation period has actually occurred. Furthermore, as the second processing time becomes relatively longer than the first processing time, the probability of a data starvation period occurring may become even longer. Therefore, when a data starvation period occurs, the system can allocate higher bus bandwidth in a priority manner, and then allocate bus bandwidth differently according to the default priority or the lower priority based on the characteristics of the first and second processing times. The first and second processing times are unique characteristics determined based on the size of the tensor parameters of the neural network model and the complexity of the computation algorithm. Therefore, the first and second processing times can be pre-analyzed during the compilation phase of the neural network model. In other words, the fourth example of this disclosure can provide the effect of optimal bus bandwidth allocation by taking into account pre-analyzed static computation scheduling information and real-time bandwidth contention of various data communications occupying the actual bus.

[0256] In other words, the fourth example could allow for finer-grained prioritization, thereby enabling the reallocation of bus bandwidth to allow for more efficient operation at the neural processing unit.

[0257] Figure 16 This is a graph illustrating how to improve data processing speed by prioritizing the reduction of data starvation periods, based on the fourth example. (Reference) Figure 16 (a) If the control method according to the fourth example of this disclosure is not applied, data starvation periods may occur in some tensors of NPU0. Specifically, a first data starvation period DS1 may occur from the completion of the computation cycle COMP of the first tensor n+1 of NPU0 to the completion of the memory cycle MEM of the second tensor n+2. Then, a second data starvation period DS2 may occur from the completion time of the computation cycle COMP of the second tensor n+2 of NPU0 to the completion time of the memory cycle MEM of the third tensor n+3, that is, tensors with memory-constrained characteristics can have data starvation periods. On the other hand, as Figure 16 As shown in (a), even without applying the control method according to the fourth example of this disclosure, no data starvation occurs in NPU1. Specifically, the computation cycle of all tensors in NPU1 is longer than the memory cycle. In this case, data starvation may not occur on NPU1, that is, tensors with computationally constrained characteristics may not experience data starvation.

[0258] refer to Figure 15 and Figure 16 (a) According to the fourth example of this disclosure, the first controller can compare the computation cycle and memory cycle information of each tensor processed on NPU0 and NPU1. That is, the first processing time and the second processing time of each tensor to be processed by each NPU can be compared. The processing time information may be part of the scheduling information of the neural network model. The processing time information may be information contained in the corresponding neural network model and is an initial value determined based on the tensor size. The second processing time may vary in real time according to the real-time bandwidth allocation of the bus. If the first processing time of a tensor is greater than the second processing time, it is determined to be in a computationally restricted CB interval. If the first processing time of a tensor is less than the second processing time, it is determined to be in a memory-restricted MB interval. Specifically, since the computation cycle COMP of the zeroth tensor n of NPU0 is completed before the memory cycle MEM of the first tensor n+1, it is determined to be in a computationally restricted CB interval. Here, as an example, all tensors on NPU1 are also identified as computationally restricted CB intervals. Since the computation cycle COMP of the first tensor n+1 and the second tensor n+2 of NPU0 is completed before the memory cycle MEM of the second tensor n+2 and the third tensor n+3 of NPU0, the aforementioned tensors are identified as memory-constrained MB intervals, which may correspond to Figure 15 Step S310.

[0259] refer to Figure 15 and Figure 16(b) According to the fourth example of this disclosure, the system can be configured to generate data hunger signals in real time. A first controller can determine in real time whether each processing core is in a data hunger state based on the data hunger signal IDLE generated by the second controller. The data hunger signal IDLE can be enabled when the NPU is not busy and disabled when the NPU is busy. The first controller can be configured to dynamically check for the activation of the data hunger signal IDLE when processing computationally constrained CB tensors. The first controller can be configured to dynamically check for the activation of the data hunger signal IDLE when processing memory-constrained MB tensors. Specifically, the data hunger signal IDLE for the intervals of the zeroth tensor n and the first tensor n+1 is disabled, these intervals being determined as computationally constrained CB intervals of NPU0. The data hunger signal IDLE for the first tensor n+1 and the second tensor n+2, determined as memory-constrained MB intervals of NPU0, is disabled and then enabled after the completion of the COMP computation cycle of the first tensor n+1. The data starvation signals IDLE for the second tensor n+2 and the third tensor n+3 of the memory-constrained MB interval identified as NPU0 are disabled, and then enabled after the computation cycle COMP of the second tensor n+2 is completed. The data starvation signals IDLE for all tensors m, m+1, m+2, and m+3 of the computation-constrained CB interval identified as NPU1 are disabled. This corresponds to... Figure 15 Steps S321 and S322.

[0260] refer to Figure 15 and Figure 16 (b) When the data hunger signal IDLE is disabled, the system according to the fourth example can set the priority of the bus with memory-constrained BM characteristics to the default priority D, which can correspond to Figure 15 Step S331. When the data hunger signal IDLE is disabled, the system according to the fourth example can set the priority of the bus with the computationally restricted CM characteristics to a low priority L, which can correspond to Figure 15 Step S333. When the data hunger signal IDLE is activated, the system according to the fourth example can set the priority of the bus of the corresponding tensor to high priority H. If the data hunger signal IDLE is activated, the first controller can assign high priority H to the tensor corresponding to the data hunger signal IDLE, regardless of memory-limited or computation-limited characteristics, which can correspond to Figure 15 Step S332. (See above for reference.) Figure 13 It describes the data hunger signal.

[0261] Reference Figure 16 (b) describes the benefits according to the fourth example of this disclosure. Figure 16 (a) describes the situation prior to the application of this disclosure. Figure 16(b) describes the situation after applying the fourth example of this disclosure.

[0262] According to the fourth example of this disclosure, when multiple tensors compete for bandwidth on the system bus, the bus can be configured to allocate relatively higher bandwidth to tensors with relatively higher priority. For example, the bus can allocate higher bandwidth to memory-constrained tensors than to computationally-constrained tensors. Therefore, if low-priority tensors and normal-priority tensors compete on the bus, the bus can be configured to process the normal-priority tensors first. For example, the bus can allocate higher bandwidth to memory-constrained tensors than to computationally-constrained tensors. Therefore, if low-priority tensors and normal-priority tensors compete on the bus, the bus can be configured to process the normal-priority tensors first. For example, if memory-constrained tensors and computationally-constrained tensors compete on the bus, the bus can reorder the sequence queue and process the memory-constrained tensors first. Furthermore, since the computation cycle COMP is shorter than the corresponding memory cycle MEM, it can be determined that the memory-constrained MB tensor is insufficient in terms of memory bandwidth. Since the computation cycle COMP is longer than the corresponding memory cycle MEM, it can be determined that the computationally-constrained CB tensor is idle in terms of memory bandwidth. For some tensors with increased bandwidth, the duration of a memory cycle may be shortened; conversely, for some tensors with decreased bandwidth, the duration of a memory cycle may be lengthened.

[0263] refer to Figure 16 (a) and Figure 16 In (b), the memory cycle MEM interval of the first tensor n+1 of NPU0 is identified as a computationally restricted CB interval, and the first controller assigns a low priority L to the memory cycle MEM of the first tensor n+1 based on the data hunger signal IDLE. The memory cycle MEM interval of the first tensor m+1 of NPU1, which is competing with the aforementioned tensor, is identified as a computationally restricted CB interval, and the first controller assigns a low priority L to the memory cycle MEM of the first tensor m+1 based on the data hunger signal IDLE. In this case, since the bus priorities of the memory access operations of the first tensor n+1 of NPU0 and the first tensor m+1 of NPU1 are equal, in (b) of the fourth example according to this disclosure, compared to (a), NPU0 and NPU1 do not mutually yield or receive bus bandwidth, and therefore the memory cycles of NPU0 and NPU1 may not change substantially.

[0264] The memory cycle MEM interval of the second tensor n+2 of NPU0 is identified as a memory-constrained MB interval, and the first controller assigns a default priority D to the memory cycle MEM of the second tensor n+2 based on the data hunger signal IDLE, and then assigns a high priority H when the data hunger signal is activated. The memory cycle MEM interval of the second tensor m+2 of NPU1, which competes with the above tensor, is identified as a computationally constrained CB interval, and the first controller assigns a low priority L to the memory cycle MEM of the second tensor m+2 based on the data hunger signal IDLE. The priority of the memory cycle MEM of the second tensor n+2 of NPU0 is the default priority D, while the priority of the memory cycle MEM of the second tensor m+2 of NPU1 is the low priority L. Therefore, in (b) of the fourth example according to this disclosure, compared with (a), NPU0 can yield a predetermined bus bandwidth to NPU1, or reorder the bus sequence queue according to the priorities of NPU0 and NPU1. In this scenario, the duration of the memory cycle MEM of NPU0's second tensor n+2 is shortened, while the duration of the memory cycle MEM of NPU1's second tensor m+2 is increased. Furthermore, when the computation cycle COMP of NPU0's first tensor n+1 is completed, the data hunger signal IDLE is activated in NPU0, and the priority of the memory cycle MEM of NPU0's second tensor n+2 is changed to high priority H. Therefore, in example (b) according to the fourth example of this disclosure, compared to (a), NPU0 allocates a greater amount of bus bandwidth to NPU1. In this case, the duration of the memory cycle MEM of NPU0's second tensor n+2 is further shortened, while the duration of the memory cycle MEM of NPU1's second tensor m+2 is further increased.

[0265] Therefore, the first data hunger period DS1 in NPU0 (a) is reduced to the first data hunger period DS1' in (b), thus improving the processing speed of NPU0. Simultaneously, the duration of the memory cycle MEM of the second tensor m+2 of NPU1 is increased, and the first interval M1 with sufficient bandwidth in (a) is also reduced to the first interval M1' with sufficient bandwidth in (b). Even with the increased interval of the memory cycle MEM of the second tensor m+2 of NPU1, the computation time of NPU1 is not delayed because there is sufficient bandwidth available. Therefore, relinquishing some bus bandwidth allocated to NPU1 to NPU0 has the effect of maintaining the computation speed of NPU1.

[0266] The memory cycle MEM interval of the third tensor n+3 of NPU0 is identified as a memory-constrained MB interval, and the first controller assigns a default priority D to the memory cycle MEM of the third tensor n+3 based on the data hunger signal IDLE, and then assigns a high priority H when the data hunger signal IDLE is activated. The memory cycle MEM interval of the third tensor m+3 of NPU1, which competes with the above tensor, is identified as a computationally constrained CB interval, and the first controller assigns a low priority L to the memory cycle MEM of the third tensor m+3 based on the data hunger signal IDLE. That is, the priority of the memory cycle MEM of the third tensor n+3 of NPU0 is the default priority D, while the priority of the memory cycle MEM of the third tensor m+3 of NPU1 is the low priority L. Therefore, in (b) of the fourth example according to this disclosure, compared with (a), NPU0 can yield a predetermined bus bandwidth to NPU1, or reorder the bus sequence queue according to the priorities of NPU0 and NPU1. In this scenario, the duration of memory cycle MEM for the third tensor n+3 of NPU0 is shortened, while the duration of memory cycle MEM for the third tensor m+3 of NPU1 is increased. Here, when the computation cycle COMP of the second tensor n+2 of NPU0 is completed, the data hunger signal IDLE is activated in NPU0, and the priority of memory cycle MEM for the third tensor n+3 of NPU0 is changed to high priority H. Therefore, in the fourth example according to this disclosure... Figure 16 (b) with Figure 16 Compared to (a), NPU0 allocates a larger amount of bus bandwidth to NPU1. In this case, the interval of memory cycles MEM for NPU0's third tensor n+3 is further reduced, while the interval of memory cycles MEM for NPU1's third tensor m+3 is further increased. Therefore, the second data hunger period DS2 of NPU0 in (a) is reduced to the second data hunger period DS2' in (b). Thus, the processing speed of NPU0 can be improved. At the same time, the duration of memory cycles MEM for NPU1's third tensor m+3 is increased, and the second interval M2 with sufficient bandwidth in (a) is also reduced to the second interval M2' with sufficient bandwidth in (b). Even though the interval of memory cycles MEM for NPU1's third tensor m+3 is increased, the computation time of NPU1 will not be delayed because there is sufficient bandwidth to allocate. Therefore, even if the bus bandwidth allocated to NPU1 is partially allocated to NPU0, it still has the effect of maintaining the computation speed of NPU1.

[0267] In other words, the system according to the fourth example can assign one of a first priority (e.g., low priority L) or a second priority (e.g., default priority D) to each tensor of the neural network model based on information in the neural network model, and also assign a third priority based on a first signal (e.g., a data starvation signal) generated by the neural processing unit processing the corresponding tensor. The third priority is a higher priority than the first or second priority, and the bus can send tensor data of the third priority before tensor data of the first or second priority. The second priority is a higher priority than the first priority, and the bus can send tensor information of the second priority before tensor data of the first priority. The first and second priorities can be determined based on pre-obtained information, and the third priority can be dynamically determined based on a dynamically generated first signal. Therefore, the system according to the fourth example of this disclosure can be configured to adjust the priority of the bus for a specific interval of each tensor processed in real time to reduce the data starvation period of the neural network model processed on at least one of a plurality of processing cores.

[0268] According to the fourth example, by first comparing the cycles of computational operations and memory access operations of each tensor of the neural processing unit 1000 to identify data starvation periods (e.g., corresponding to the first example), and then further identifying the data starvation state of each processing core and dynamically prioritizing it (e.g., corresponding to the second and third examples), data starvation periods can be reduced more effectively.

[0269] Figure 17 This is a flowchart describing a control method for a neural processing unit according to another example of this disclosure. The control method can be executed by a first controller 1100 that controls one or more processing cores 1000-1, …, 1000-n.

[0270] like Figure 17 As shown, the first controller 1100 can identify at least one data starvation period of S1100 based on access operations to the memory 3000 and computation operations of each tensor of one or more processing cores 1000-1, … , 1000-n.

[0271] At this point, computational operations and memory access operations for each tensor can be performed within the bus bandwidth allocated to each of at least one of the processing cores 1000-1, …, 1000-n, to communicate with memory 3000. The bus bandwidth can be set differently for each tensor of the first bus 6100 and the second bus 6200.

[0272] In S1100, the first controller 1100 can identify at least one data starvation period by comparing the computation cycle and memory cycle of each tensor of one or more processing cores 1000-1, …, 1000-n. For this purpose, the first controller 1100 can be configured to receive or monitor the operation cycle and memory cycle information of each tensor.

[0273] Specifically, the first controller 1100 can be configured to: compare a first processing time with a second processing time, the first processing time being the number of computation cycles required to complete an operation on a specific tensor, and the second processing time being the number of memory cycles required to complete a memory access operation on the next tensor after completing the specific tensor; and identify the interval between the first processing time and the second processing time as a data starvation period. In this case, the second processing time can be the sum of the time consumed in writing the computation result of the previous tensor to memory 3000 and the time consumed in reading the data used to compute the next tensor. More specifically, the first processing time and the second processing time are determined by the parameter size of the tensors in the neural network model and the complexity of the computation algorithm. Therefore, the first processing time and the second processing time can be pre-analyzed during the compilation phase of the neural network model.

[0274] Next, the first controller 1100 can control the priority of memory access operations for each tensor of one or more processing cores 1000-1, …, 1000-n in S1200, thereby preventing or reducing data starvation. The first controller 1100 can be configured to prioritize the read operation of the next tensor of a particular processing core when the computation process of the current tensor of that core is predicted to experience a data starvation period. The first controller 1100 can be configured to control the first bus 6100 and the second bus 6200 based on the determined priorities.

[0275] In S1200, the first controller 1100 can identify a processing core that performs computational operations on tensors that include at least one data starvation period. The first controller 1100 can be configured to adjust the bus bandwidth associated with the identified processing core, thereby reducing or eliminating its data starvation period.

[0276] In other words, if the second processing time for the processing core to complete the write operation to the previous tensor and the read operation to the next tensor as a memory access operation is relatively longer than the first processing time to complete the computation operation on the tensor, then the first controller 1100 can be configured to give high priority to the memory access operations of the processing core and reallocate bus bandwidth accordingly. Conversely, if the first processing time is sufficiently longer than the second processing time, and it is acceptable to increase the second processing time, then the first controller 1100 can be configured to give low priority to the memory access operations of the processing core and reallocate bus bandwidth accordingly.

[0277] If the first processing time is longer than the second processing time, the first controller 1100 can be configured to prioritize other processing cores that request memory access operations with a relatively high probability of data starvation, so that the bus bandwidth allocated to each processing core is used to process the operations of other circuits (e.g., other neural processing units, other processing cores, CPU, decoder, image sensor, etc.).

[0278] On the other hand, if the second processing time is longer than the first processing time, the first controller 1100 can increase the priority of the corresponding processing core requesting a memory access operation to reduce or eliminate data starvation, and the bus can be configured to use more available bus bandwidth to prioritize memory access request operations of the corresponding processing core. In this case, additional available bus bandwidth can be reallocated by giving lower priority to other processing cores with a relatively lower probability of data starvation.

[0279] In some examples, the first controller 1100 can be configured to allocate a relatively higher bus bandwidth to a particular neural processing unit than to other neural processing units, based on the first and second processing times of the corresponding tensor requested by each of the plurality of processing cores 1000-1, …, 1000-n. Here, the bus bandwidth allocation can be dynamically adjusted to reduce or eliminate data starvation for each tensor. This advantageously has the effect of reducing data starvation for the multiple processing cores contained in the system 10000.

[0280] In other words, if the first processing time to complete a computational operation is long enough, even if the second processing time to complete a memory access operation of a specific processing core increases slightly, the bus bandwidth can be reallocated to another processing core (e.g., a processing core with overlapping intervals in memory access operations).

[0281] On the other hand, if the second processing time for completing a memory access operation of a specific neural processing unit is long enough compared to the first processing time for completing a computation operation, then the processing core (the processing core with overlapping parts in the memory access operation) can obtain more memory access opportunities by receiving additional bus bandwidth from other processing cores, thereby completing the memory access operation faster in a short period of time and reducing the time when the computing circuit is not running.

[0282] Figure 18 This is a diagram illustrating an example of a method for determining priorities in the control of a processing core according to another example of this disclosure, and is intended to explain in more detail. Figure 17 Step S1200.

[0283] refer to Figure 18 C(n) represents the processing time to complete a computational operation on a specific tensor, and W(n-1) represents the processing time to complete a memory access operation (e.g., a write operation) on the previous tensor of the specific tensor. In other words, C(n) is the first processing time, which is the computation cycle, and the sum of W(n-1) and R(n+1) is the second processing time, which is the memory cycle.

[0284] refer to Figure 18 If the second processing time is longer than the first processing time, i.e., if the data hunger level (e.g., (W(n-1)+R(n+1)) / C(n)) is greater than the first threshold Th1 (e.g., Th1 is 1), which corresponds to a longer second processing time, then the first controller 1100 gives high priority to memory access operations (e.g., write operations and / or read operations) corresponding to W(n-1)+R(n-1). Therefore, by reallocating bus bandwidth from other processing cores, the operation of W(n-1)+R(n+1) can be accelerated. As a result, the total time for processing data is reduced as the time the computing circuit is not running (i.e., the data hunger period) decreases. Furthermore, if the second processing time is shorter than the first processing time, i.e., if the data hunger level is lower than the second threshold Th2 (e.g., Th2 is 1), then the first processing time is longer, and therefore the memory access operations corresponding to W(n-1)+R(n+1) are given low priority. Therefore, at least a portion of the bus bandwidth allocated to the W(n-1)+R(n+1) operation can be reallocated to one of the other processing cores. Consequently, as the idle time of the computing circuitry on one or more other processing cores whose bus bandwidth has been reallocated decreases, the total time required to process data also decreases.

[0285] Meanwhile, the first controller 1100 is assigned a normal priority to maintain its state because if the first processing time and the second processing time are the same, that is, if the data hunger level is equal to the third threshold (e.g., 1), then there is no data hunger period.

[0286] In other words, a system according to an example of this disclosure can be configured to calculate the data hunger level in a specific tensor and assign high priority to the tensor by comparing it to a first threshold. Furthermore, a system according to an example of this disclosure can be configured to assign low priority to the tensor by comparing the data hunger level to a second threshold. Additionally, a system according to an example of this disclosure can be configured to maintain the tensor's priority when the aforementioned data hunger level and third threshold are the same. Here, the first threshold and the second threshold can be the same. Furthermore, the second threshold and the third threshold can be the same.

[0287] In some examples, the first threshold can be greater than the third threshold. The second threshold can be less than the third threshold. Here, the third threshold can be a range between the first and second thresholds. Specifically, the first threshold can be, for example, 1. If the data hunger level is 1, the corresponding tensor theoretically has no data hunger, but due to various overhead and bandwidth contention on the bus, data hunger may occur even in the short term, so it can be given high priority. The second threshold can be 0.8. If the data hunger level is, for example, 0.8, even with various overhead and bandwidth contention on the bus, the corresponding tensor can be considered probabilistically to have no data hunger, and there is enough bus bandwidth to yield, so its priority can be reduced. The third threshold can be a range between the first and second thresholds. When the data hunger level is, for example, between 1 and 0.7, even considering the contention of various overhead and bandwidth contention on the bus, data hunger is unlikely to occur, but there is not enough bandwidth to yield. In other words, the system can be configured to calculate the data hunger level of each tensor, increase the priority of the corresponding tensor based on the first threshold, decrease the priority of the corresponding tensor based on the second threshold which is different from the first threshold, and maintain the importance of the corresponding tensor based on the value between the first and second thresholds (i.e., the third threshold).

[0288] refer to Figure 19 This describes the specific operations for prioritizing the examples of this disclosure and the resulting improved data processing speed. Figure 19 This is a diagram illustrating data processing speed improvements achieved by assigning priorities to reduce data starvation, according to another example of this disclosure.

[0289] refer to Figure 19The diagram illustrates the memory cycle MEM (e.g., read cycle RD and write cycle WR) and computation cycle COMP for each tensor processed by each processing core. Each processing core can be configured to store parameters in internal memory within each memory cycle corresponding to each tensor, and use the parameters stored in internal memory to process the operations of the neural network model within the corresponding operation cycle. In other words, for a processing core to process a tensor, the processing core's DMA first commands the bus to send the tensor to the processing core's internal memory within a memory cycle. Then, the processing element of the processing core computes the corresponding tensor stored in internal memory within a computation cycle.

[0290] refer to Figure 19 Core0 can refer to a single processing core. For example, Core0 can also correspond to... Figure 2 The processing core is 1000-1. Core0 can also refer to a neural processing unit. For example, Core0 could correspond to... Figure 2 The neural processing unit 1000 in the core. Core0 may contain one or more processing cores 1000-1, …, 1000-n.

[0291] On the other hand, Core1 can refer to a processing core. For example, Core1 can also correspond to... Figure 2 Another processing core 1000-n in the process. Core1 can also refer to a neural processing unit. For example, Core1 can correspond to other neural processing units. Core1 can contain at least one processing core.

[0292] refer to Figure 19 In (a), as an example, the first processing time (i.e., computation cycle) of the data operation corresponding to the first tensor m+1 processed by Core1 is shorter than the second processing time (e.g., memory cycle including the write cycle and the read cycle) of the data write operation based on the initial tensor m and the data read operation corresponding to the second tensor m+2. Therefore, a data starvation period may occur between the computation cycle of the first tensor m+1 of Core1 and the computation cycle of the second tensor m+2 of Core1 until the write cycle of the initial tensor m of Core1 and the read cycle of the second tensor m+2 of Core1 are completed.

[0293] In addition, such as Figure 19As shown in (a), the first processing time (i.e., computation cycle) of the data operation corresponding to the first tensor n+1 processed by Core0 is longer than the second processing time (i.e., the memory cycle including the write cycle and the read cycle) of the data write operation based on the initial tensor n and the data read operation corresponding to the second tensor n+2. In other words, the write cycle of the initial tensor n and the read cycle of the second tensor n+2 of Core0 are completed before the computation cycle of the first tensor n+1 of Core0 is completed. Therefore, the read cycle of the second tensor n+2 of Core0 has a bandwidth margin until the computation cycle of the second tensor n+2 of Core0 begins.

[0294] The bus bandwidth of the read cycle of the second tensor m+2 assigned to Core1 can be utilized, which at least partially overlaps with the read cycle of the second tensor n+2 of Core0. Therefore, even if at least a portion of the bus bandwidth of the read cycle of the second tensor n+2 assigned to Core0 is reallocated to Core1, there may be no substantial data starvation period between the computation cycles of the first tensor n+1 and the second tensor n-2 of Core0.

[0295] For example, such as Figure 19 As shown in (b), by applying low priority to the read operation of Core0's second tensor n+2 and high priority to the read operation of Core1's second tensor m+2, the bus bandwidth allocated to Core0's read operations can be reallocated to Core1, thereby reducing Core1's data starvation period. At this point, since Core0 has sufficient bus bandwidth to allocate to the read operation of the second tensor n+2, even with at least a portion of the bus bandwidth reallocated, there may be no data starvation period in Core0's second tensor n+2.

[0296] Meanwhile, as another example, such as Figure 19 As shown in (a), the memory cycle comprising the write cycle of the first tensor n+1 of Core0 and the read cycle of the third tensor n+3 of Core0 is longer than the computation cycle of the second tensor n+2 of Core0. Therefore, a data starvation period may occur between the computation cycles of the second tensor n+2 and the third tensor n+3 of Core0 until the write cycle of the first tensor n+1 and the read cycle of the third tensor n+3 of Core0 are completed.

[0297] In addition, such as Figure 19As shown in (a), the computation cycle of the second tensor m+2 processed by Core1 is longer than the memory cycle that includes the write cycle of the first tensor m+1 and the read cycle of the third tensor m+3. In other words, the write cycle of the first tensor m+1 and the read cycle of the third tensor m+3 of Core1 are completed before the computation cycle of the second tensor m+2 of Core1 is completed. Therefore, there is a bandwidth margin for the read cycle of the third tensor m+3 of Core1 before the computation cycle of the third tensor m+3 of Core1 begins.

[0298] The bus bandwidth of the read cycle of the third tensor n+3 assigned to Core0 can be utilized, which at least partially overlaps with the read cycle of the third tensor m+3 of Core1. Therefore, even if at least a portion of the bus bandwidth of the read cycle of the third tensor m+3 assigned to Core1 is reallocated to Core0, there may be no substantial data starvation period between the computation cycle of the second tensor m+2 of Core1 and the computation cycle of the third tensor m+3 of Core1.

[0299] For example, such as Figure 19 As shown in (b), by applying low priority to the read operation of Core1's third tensor m+3 and high priority to the read operation of Core0's third tensor n+3, the bus bandwidth allocated to Core1's read operations can be reallocated to Core0, thereby reducing Core0's data starvation period. At this point, since Core1 has allocated sufficient bus bandwidth to the read operation of its third tensor m+3, there may be no data starvation period in Core0's third tensor n+3.

[0300] According to the above disclosure, based on the system for controlling the processing core, priority QoS for read and write operations for accessing the memory of each neural processing unit or each processing core can be applied to achieve efficient operation between read and write operations during DMA operation.

[0301] Furthermore, according to this disclosure, when a data starvation period is predicted because the time required to complete a memory access operation is shorter or longer than the time required to complete a computation operation, bus bandwidth can be reallocated according to priority QoS to enable the computing circuit to operate without a data starvation period, thereby improving data processing performance and reducing power consumption.

[0302] Furthermore, according to this disclosure, by applying a high priority to NPU0 and a low priority to NPU1 at a certain point in time, the bus bandwidth of NPU1 can be yielded to NPU0 to reduce the duration of data starvation.

[0303] The implementation relates to a system including at least one processing core configured to perform computational operations on at least one neural network model associated with a tensor. At least one memory circuit is configured to store the tensor. A plurality of bus circuits are operatively coupled to the at least one processing core and the at least one memory circuit. The plurality of bus circuits are configured to send tensors from the at least one memory circuit to the at least one processing core in response to receiving a request for a read or write operation. A controller is operatively coupled to the plurality of bus circuits. The controller is configured to determine the priority of each tensor on each bus circuit in a read or write operation.

[0304] In one or more embodiments, multiple bus circuits may include a first bus configured to perform tensor read operations and a second bus configured to perform tensor write operations.

[0305] In one or more embodiments, the controller may be configured to determine the priority of each tensor by comparing the duration of a read cycle using the first bus, the duration of a write cycle using the second bus, and the duration of a tensor computation cycle.

[0306] In one or more embodiments, the controller may be configured to determine the priority of each tensor by comparing the computation cycle of the first tensor in the tensor at the processing core with the memory cycles of the write cycle of the previous tensor containing the first tensor and the read cycle of the subsequent tensor of the first tensor.

[0307] In one or more embodiments, the controller may be configured to, in response to determining that a data starvation prediction has occurred or has occurred in the first processing core, increase the bus bandwidth of the first read cycle assigned to the first processing core in at least one processing core by reducing the bus bandwidth of the second read cycle assigned to the second processing core in at least one processing core.

[0308] In one or more embodiments, the controller may be configured to, in response to determining that a data starvation prediction has occurred or has occurred in a first processing core, increase the priority of sending tensors relative to the read cycles of the first processing core in at least one of the processing cores, thereby increasing the bus bandwidth allocated to the first processing core.

[0309] In one or more embodiments, the controller may be configured to, in response to determining that at least one processing core is in a compute-constrained state, control at least one of a plurality of bus circuits to reallocate at least a portion of the bandwidth allocated to at least one processing core relative to a read cycle.

[0310] In one or more embodiments, the controller may be configured to determine the priority of read cycles that send tensors to at least one processing core via bus circuitry in response to receiving a data hunger signal.

[0311] In one or more implementations, multiple bus circuits can operate individually for each read and write operation.

[0312] In one or more embodiments, each of at least one processing core may include a plurality of processing elements (PEs), wherein the plurality of PEs include at least one of a multiplication and accumulation (MAC) operator, an adder tree, or an arithmetic logic unit (ALU) operator.

[0313] The implementation involves determining whether data starvation has occurred or is predicted to occur in at least one processing core, which is configured to perform computation of at least one neural network model associated with a tensor based on the duration of a computation cycle for a specific tensor and the duration of a memory cycle for a subsequent tensor of that specific tensor. In response to determining whether data starvation has occurred or is predicted to occur in at least one processing core, a memory access operation priority for each tensor via at least one of a plurality of bus circuits is determined.

[0314] In one or more embodiments, the method may include performing tensor-associated computational operations and memory access operations based on the bandwidth of each of a plurality of bus circuits respectively coupled to at least one processing core and at least one memory circuit.

[0315] In one or more embodiments, determining whether data starvation has occurred or is predicted to occur may include determining the memory access operation priority of each tensor via at least one of the multiple bus circuits by comparing the duration of a memory cycle (which includes the duration of a read cycle using a first bus in the multiple bus circuits and the duration of a write cycle using a second bus in the multiple bus circuits) with the duration of a tensor computation cycle.

[0316] In one or more embodiments, determining whether data starvation has occurred or is predicted to occur may include: determining the priority of memory access operations on each tensor via at least one of a plurality of bus circuits by comparing the duration of a computation cycle of a first tensor in a tensor at a specific processing core in at least one processing core with the duration of a memory cycle (which includes a write cycle of the previous tensor of the first tensor and a read cycle of the subsequent tensor of the first tensor).

[0317] In one or more embodiments, determining the memory access operation priority for each tensor via at least one of a plurality of bus circuits may include: in response to determining that a data starvation prediction will occur or has already occurred in a first processing core, increasing the bus bandwidth of a first read cycle assigned to the first processing core in at least one processor core by reducing the bus bandwidth of a second read cycle assigned to the first processing core in at least one processor core.

[0318] In one or more embodiments, determining the memory access operation priority for each tensor via at least one of a plurality of bus circuits may include: in response to determining that a data starvation prediction will occur or has occurred in a first processing core, increasing the priority of transmitting the tensor relative to the read cycle of the first processing core of at least one processing core, thereby increasing the bus bandwidth allocated to the first processing core.

[0319] In one or more embodiments, determining the memory access operation priority for each tensor via at least one of a plurality of bus circuits may include: in response to determining that at least one processing core is in a computationally constrained state, reallocating at least a portion of the bus bandwidth of at least one of the plurality of bus circuits assigned to at least one processor core relative to the read cycle.

[0320] In one or more embodiments, determining the memory access operation priority for each tensor via at least one of a plurality of bus circuits may include: in response to receiving a data hunger signal, determining the priority of a read cycle that sends the tensor to at least one processing core via at least one of the plurality of bus circuits.

[0321] In one or more embodiments, prioritizing memory access operations for each tensor via at least one of a plurality of bus circuits may include: reallocating bus bandwidth allocated for read cycles of memory cycles at a specific bus circuit among the plurality of bus circuits.

[0322] In one or more embodiments, memory access operation priorities may include first to third priorities. The second priority is higher than the first priority, and the third priority is higher than the first and second priorities.

[0323] The examples of this disclosure disclosed herein and in the accompanying drawings are for the purpose of explaining the technical content of this disclosure and promoting understanding of this disclosure, and are not intended to limit the scope of this disclosure.

[0324] [National R&D projects supporting this invention] [Project Identifier] 2710008571 [Task ID] II20248 [Department Name] Department of Science and Information Technology [Name of Task Management (Professional) Institution] Institute of Information & Communication Technology Planning and Evaluation [Research Project Title] Development (Design) of PIM Artificial Intelligence Semiconductor Core Technologies [Research Task Title] Development of CXL-based Multi-DRAM Module PIM Semiconductor Technology Considering Memory Coherence [Name of the organization performing the task] DeepX CO., LTD. [Research Period] 2024.01.01~2024.12.31.

Claims

1. A system comprising: At least one processing core is configured to perform computational operations on an input tensor to generate an output tensor, the input tensor and the output tensor being associated with at least one neural network model; At least one memory circuit is configured to store the input tensor and the output tensor; Multiple bus circuits operatively coupled to the at least one processing core and the at least one memory circuit, the multiple bus circuits being configured to: In response to receiving a read operation request, the input tensor is sent from the at least one memory circuit to the at least one processing core, and In response to receiving a write operation request, the output tensor is sent from the at least one processing core to the at least one memory circuit; as well as A controller operatively coupled to the plurality of bus circuits is configured to determine the priority of the read operation for each of the input tensors or the write operation for each of the output tensors, and to control the plurality of bus circuits to send each of the input tensors or each of the output tensors according to the determined priority.

2. The system of claim 1, wherein the plurality of bus circuits comprises: A first bus, configured to perform the read operation for reading data from the at least one memory circuit, and A second bus is configured to perform the write operation for writing data to the at least one memory circuit.

3. The system of claim 2, wherein the controller is configured to determine the priority of each input tensor by comparing the duration of a computation cycle for each input tensor with the duration of a memory access cycle associated with the next input tensor following each input tensor.

4. The system of claim 2, wherein the controller is configured to determine the priority of each of the input tensors by comparing the duration of a computation cycle of each input tensor at the processing core with the duration of a memory cycle, the memory cycle comprising a write cycle of the previous tensor preceding each input tensor and a read cycle of the next input tensor following each input tensor.

5. The system of claim 1, wherein the controller is configured to, in response to determining that a data starvation prediction has occurred or has occurred in the first processing core, increase the bus bandwidth of the first read cycle assigned to the first processing core in the at least one processor core by decreasing the bus bandwidth of the second read cycle assigned to the second processing core in the at least one processing core.

6. The system of claim 1, wherein the controller is configured to, in response to determining that a data starvation prediction has occurred or has occurred in the processing core, increase the priority of sending input tensors via the plurality of bus circuits during the read cycle of the processing core in at least one of the processing cores, thereby increasing the bus bandwidth allocated to the processing core.

7. The system of claim 1, wherein the controller is configured to reduce the bandwidth of the plurality of bus circuits assigned to the processing core in response to determining that the processing core is in a compute-constrained state.

8. The system of claim 1, wherein the controller is configured to: Receive a signal from one of the at least one processing cores, the signal indicating that a data starvation has occurred at that processing core; and In response to receiving the signal, the priority of sending input tensors to the processing core is increased.

9. The system of claim 1, wherein the plurality of bus circuits are individually operated for the read operation and the write operation.

10. The system of claim 1, wherein each of the at least one processing core comprises a plurality of processing elements (PEs), wherein the plurality of PEs comprises at least one of a multiplication and accumulation (MAC) operator circuit, an adder tree circuit, or an arithmetic logic unit (ALU) operator circuit.

11. A method comprising: Determine whether data starvation has occurred or is predicted to occur in at least one processing core; In response to determining whether the data starvation has occurred or is predicted to occur in the at least one processing core, the priority of memory access operations for each input tensor is determined via multiple bus circuits; According to the determined priority, each input tensor is sent from memory to the at least one processing core via a first bus circuit of the plurality of bus circuits; Each input tensor is processed by the at least one processing core to generate each output tensor, and the input tensors and the output tensors are associated with at least one neural network model; as well as Each of the output tensors is sent from the at least one processing core to the memory via a second bus circuit of the plurality of bus circuits.

12. The method of claim 11, wherein determining whether the data starvation has occurred or is predicted to occur further comprises: Compare the duration of the computation cycle for each input tensor with the duration of the memory access cycle associated with the next input tensor following each input tensor.

13. The method of claim 11, wherein determining whether the data starvation has occurred or is predicted to occur further comprises: The duration of the computation cycle for each input tensor at the processing core is compared with the duration of the memory cycle, which includes the write cycle of the previous tensor preceding each input tensor and the read cycle of the next input tensor following each input tensor.

14. The method of claim 11, wherein determining the priority of each input tensor comprises: In response to determining that a data starvation period is predicted to occur or has already occurred in the first processing core, the bus bandwidth of the first read cycle allocated to the first processing core in the at least one processing core is increased by reducing the bus bandwidth of the second read cycle allocated to the second processing core in the at least one processing core.

15. The method of claim 11, wherein determining the priority of each input tensor comprises: In response to determining that a data starvation period is predicted to occur or has already occurred in the processing core, the priority of sending input tensors via the plurality of bus circuits during the read cycles of the processing core in at least one of the processing cores is increased to increase the bus bandwidth allocated to the processing core.

16. The method of claim 11, wherein determining the priority of each input tensor comprises: In response to determining that the processing core is in a computationally constrained state, the bandwidth of the plurality of bus circuits assigned to the processing core in the at least one processing core is reduced.

17. The method of claim 11, further comprising: Receive a signal indicating that data starvation has occurred from the processing core of the at least one processing core; as well as In response to receiving the signal, the priority of sending input tensors to the processing core is increased.

18. The method of claim 11, wherein the priority of the memory access operation includes a first priority, a second priority, and a third priority. The second priority is higher than the first priority, and the third priority is higher than both the first priority and the second priority.

19. The method of claim 11, wherein determining whether the data starvation has occurred or is predicted to occur further comprises: Counting is performed simultaneously with memory access operations using a counter. as well as In response to the count value reaching a threshold, it is determined that the data starvation has occurred.

20. The method of claim 19, wherein the threshold is pre-calculated during compilation.