Instruction processing device, acceleration unit, and server

By introducing on-chip memory and direct memory access modules into the hardware acceleration unit of the neural network model, the sub-operations of the neural network application are decomposed and executed in parallel, solving the problem of low efficiency in the prior art and achieving more efficient computing performance and data processing speed.

CN115222015BActive Publication Date: 2026-06-12ALIBABA INNOVATION PRIVATE LIMITED

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA INNOVATION PRIVATE LIMITED
Filing Date
2021-04-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing hardware acceleration units suffer from inefficiency and frequent memory access when processing neural network models, especially when dealing with large-scale data, resulting in insufficient computational performance.

Method used

A novel instruction processing device and acceleration unit are designed, employing on-chip memory and direct memory access modules. By decomposing neural network applications into sub-operations and executing them in parallel on multiple execution units, the reliance on external memory is reduced, and data access efficiency is improved.

🎯Benefits of technology

By optimizing the instruction set architecture and parallel processing, the computational performance and efficiency of neural network models have been significantly improved, the frequency of access to external memory has been reduced, and the data processing speed has been increased.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115222015B_ABST
    Figure CN115222015B_ABST
Patent Text Reader

Abstract

Disclosed are an instruction processing apparatus and an acceleration unit. The instruction processing apparatus includes a plurality of instruction buffers, a register file including a plurality of entries, a selector, a parser, and an operation unit. The selector is configured to parse a command type and a buffer identification from a command, and provide received data and the buffer identification to the parser if the command type is configuration, or to the operation unit if the command type is execution. The parser is configured to parse an instruction sequence from the data, store the instruction sequence into an instruction buffer matched with the buffer identification, and store an operand of each instruction in the register file. The operation unit is configured to drive the instruction buffer matched with the buffer identification to execute each instruction therein one by one to generate a control signal, and a plurality of execution units are configured to perform corresponding operations based on the received control signal and the operand. The apparatus can be dedicated to processing various neural network applications.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of neural networks, and more particularly to an instruction processing device, an acceleration unit, and a server. Background Technology

[0002] Neural networks (NNs) have emerged as one of the most prominent technologies in the last decade, achieving breakthroughs and generating numerous practical applications in fields such as speech, image processing, big data, and biomedicine. However, the industry is also increasingly focused on improving the execution efficiency of neural network models, primarily through two approaches: software-wise, performance enhancement through algorithmic optimization; and hardware-wise, performance improvement through the design of various hardware acceleration units for neural network model execution. Regardless of the type of hardware acceleration unit, designing a sound instruction set architecture is crucial. Summary of the Invention

[0003] The purpose of this disclosure is to provide an instruction processing apparatus and an acceleration unit that uses a novel instruction set architecture specifically for accelerating neural network models.

[0004] This disclosure provides an instruction processing apparatus, including: multiple instruction buffers, a register file, a selector, a parser, and an operation unit, wherein the register file includes multiple entries.

[0005] The selector is used to parse the command type and buffer identifier from the received command. If the command type is configuration, the received data and the buffer identifier are provided to the parser. If the command type is execution, the buffer identifier is provided to the operation unit.

[0006] The parser is used to parse the instruction sequence from the data, store the instruction sequence in an instruction cache that matches the cache identifier, and store the operands of each instruction in the instruction sequence in multiple entries of the register file;

[0007] The operation unit is used to drive the instruction cache that matches the cache identifier to execute each instruction therein one by one to generate control signals. The control signals of each instruction and the operands of the instruction in the register file are sent to multiple execution units, and each execution unit performs a corresponding operation based on the received control signals and operands.

[0008] Optionally, the parser determines the operands of each instruction when it is executed on each execution unit based on the original operands of each instruction in the instruction sequence.

[0009] Optionally, the plurality of instruction processing apparatuses further include: a scalar calculation unit, configured to calculate the operands of a specific instruction and update the operands of that specific instruction in the register file with the new operands.

[0010] Optionally, the instruction processing device supports multiple predefined instructions, and the instruction sequence consists of one or more of the multiple predefined instructions.

[0011] Optionally, the plurality of predefined instructions include a data loading instruction, and each execution unit obtains a first vector and a second vector according to the control signal of the data loading instruction and stores them in a first queue and a first buffer.

[0012] Optionally, the plurality of predefined instructions include a multiplication-accumulation instruction, wherein each execution unit outputs two values ​​from the first queue and the second buffer according to the control signal of the multiplication-accumulation instruction to perform a multiplication-accumulation operation.

[0013] Optionally, the corresponding entries in the register file also store scalar pipeline states, which are used to specify the attributes of the first queue, and the attributes of the first queue are one of the following: first-in-first-out queue, rotating queue, and sliding window queue.

[0014] Optionally, the plurality of predefined instructions include data storage instructions, and each execution unit stores the intermediate calculation results generated by the execution unit into an external memory according to the control signal of the data storage instructions.

[0015] Optionally, the plurality of predefined instructions include special function instructions, and each execution unit activates the corresponding special function unit according to the control signal of the special function instruction.

[0016] Optionally, the instruction sequence comes from a specified neural network application, and the operation unit uses different instruction pipelines for different neural network applications.

[0017] Optionally, the specified neural network application is one of the following: matrix multiplication, convolution, and depthwise convolution.

[0018] Optionally, the operation unit uses a two-stage instruction pipeline of decoding stage and fetch stage when processing instruction sequences for matrix multiplication; and uses a four-stage instruction pipeline of decoding stage, fetch stage, scalar processing stage and write-back stage when processing convolution or depthwise convolution.

[0019] In a second aspect, a cluster is provided, comprising the instruction processing device described in any of the preceding claims and a plurality of execution units coupled to the instruction processing device, wherein the cluster receives commands and data transmitted together with the commands.

[0020] Thirdly, an acceleration unit for executing neural network models is provided, comprising:

[0021] Direct memory access module;

[0022] On-chip memory includes multiple storage units;

[0023] Multiple cluster groups, including multiple clusters, each cluster including an instruction processing device as described in any one of claims 1 to 12 and an execution unit coupled to the instruction processing device;

[0024] The command processor is used to decompose the operation of a specified neural network application representation into multiple sub-operations, convert the sub-operations into instruction sequences to be executed on the cluster, specify the operation data for each instruction sequence, load the operation data of the sub-operations multiple times through the direct memory access module, and store the instruction sequences and operation data corresponding to the multiple clusters contained in each cluster group into the corresponding storage units.

[0025] Multiple distribution units are coupled to the multiple storage units respectively, and are also coupled to the multiple cluster groups respectively. Each distribution unit reads instruction sequences and operation data from the storage unit coupled to it, and sends the instruction sequences and operands to the multiple instruction processing devices coupled to it respectively.

[0026] Fourthly, a server is provided, comprising:

[0027] The aforementioned acceleration unit;

[0028] A processing unit is configured to send an instruction to the acceleration unit to drive the acceleration unit to execute the specified neural network application;

[0029] A memory for storing weight data and activation data for the specified neural network application.

[0030] The instruction processing apparatus provided in this disclosure is essentially an instruction set architecture for processing neural network applications. It can be used to build acceleration units for various neural network models to improve the hardware acceleration capabilities of neural network models. Attached Figure Description

[0031] The above and other objects, features, and advantages of this disclosure will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:

[0032] Figure 1 This is a hierarchical structure diagram of the data center;

[0033] Figure 2 It is a 3D structural diagram of the data center;

[0034] Figure 3 This is a schematic diagram of the structure of a cloud server, which is a general architecture for data centers;

[0035] Figure 4 yes Figure 3 A more detailed structural diagram of the cloud server in the diagram;

[0036] Figure 5a This is a design diagram of an exemplary PE cluster;

[0037] Figure 5b This is another exemplary design diagram of a PE cluster;

[0038] Figure 5c yes Figure 5a and Figure 5b A schematic diagram of the cluster control unit in the system;

[0039] Figure 6a A schematic diagram of matrix multiplication is shown;

[0040] Figure 6b and Figure 6c This is a schematic diagram of convolution and depthwise convolution;

[0041] Figures 7a to 7c These are three pieces of pseudocode;

[0042] Figure 8 This is a schematic diagram illustrating an exemplary two-dimensional matrix multiplication.

[0043] Figures 9a-9i Used to show Figure 8 The diagram shows nine options for deploying matrix multiplication onto a PE array. Detailed Implementation

[0044] The present disclosure is described below based on embodiments, but it is not limited to these embodiments. In the detailed description of the present disclosure below, certain specific details are described in detail. Those skilled in the art will fully understand the present disclosure even without these details. To avoid obscuring the substance of the present disclosure, well-known methods, processes, and procedures are not described in detail. Furthermore, the accompanying drawings are not necessarily drawn to scale.

[0045] The following terms are used in this document.

[0046] Acceleration Unit: A processing unit designed to improve data processing speed in specialized areas (e.g., image processing, neural network operations, etc.) where general-purpose processors are inefficient. It is often used in conjunction with a general-purpose CPU, controlled by the CPU, to perform specific tasks or tasks, thereby improving computer processing efficiency in those specific areas. It can also be called an AI processing unit and may include graphics processing units (GPUs), central processing units (CPUs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and dedicated AI acceleration hardware (e.g., acceleration units).

[0047] On-chip memory: Memory used independently within the main core or sub-core and cannot be shared.

[0048] Command processor: The command interface between the acceleration unit and the central processing unit (CPU) that drives the acceleration unit. The command processor receives instructions from the CPU instructing the acceleration unit to execute, and then distributes these instructions to the various components within the acceleration unit for execution. Additionally, it is responsible for synchronizing the various components within the acceleration unit.

[0049] Lifecycle: An operand is not used throughout the entire instruction sequence. The lifetime of an operand is the portion between its first appearance in the instruction sequence and the last instruction in which it is used. In other words, after its lifetime, it is no longer used and there is no need for it to remain in on-chip memory.

[0050] Neural networks, generally referring to Artificial Neural Networks (ANNs), are algorithmic networks that mimic the behavioral characteristics of animal neural networks to perform distributed parallel information processing. A classic neural network, also the simplest neural network structure, consists of three layers: an input layer, an output layer, and intermediate layers (also known as hidden layers). Each of the input, output, and intermediate layers contains multiple nodes.

[0051] Neural network model: In a neural network, nodes are mathematically represented, resulting in a mathematical model of the nodes. The numerous mathematical models of nodes in a neural network constitute the neural network model.

[0052] Deep learning models: The concept of deep learning originates from the study of neural networks. A neural network containing multiple intermediate layers is called a deep learning network. Therefore, in this sense, a deep learning model is also a type of neural network model. Both deep learning models and neural network models must be trained. Sample data is input into a pre-designed network structure, feature information is extracted through multiple intermediate layers, and the weights of each node are continuously adjusted based on the output of the output layer, making the output of the output layer increasingly closer to the preset result, until the final weights are determined. A trained deep learning model can be truly applied to real-world scenarios, and the usage of the deep learning model in real-world scenarios can be collected to further optimize the deep learning model.

[0053] A node is the smallest independent unit of computation in a deep learning model. It receives input, processes it using its own weights or other model parameters (such as hyperparameters), and then produces an output. A deep learning model can include various specific operations such as convolution and pooling, and consequently, various operation nodes such as convolution nodes and pooling nodes. A deep learning model has multiple layers, each with multiple nodes, and the output of each node is the input of the node in the next layer. Furthermore, a node includes the program code for the specific operation and related data. For example, a convolution operation node includes the program code used for the convolution operation and some data used in the convolution.

[0054] An operator is a set of operations built into a deep learning model to achieve a specific function. Each layer of a deep learning model can contain multiple such operators. In the TensorFlow framework, they are called operations, and in the Caffe framework, they are called layers. Operators can be viewed as a further implementation based on nodes; one operator can correspond to one or more nodes. Therefore, operators and nodes sometimes correspond to the same program and data.

[0055] Instruction set (instruction set architecture): The set of instructions supported by the chip for performing operations. For example, it mainly supports deep learning operators such as Convolution, Pooling, and ROI.

[0056] Data Center

[0057] Figure 1 This diagram illustrates a layered structure of a data center, representing one scenario in which this disclosure is applied.

[0058] Data centers are globally collaborative networks of specific devices used to transmit, accelerate, display, compute, and store data information on the internet infrastructure. In the future, data centers will become a key competitive asset for businesses. With the widespread application of data centers, artificial intelligence and other technologies are increasingly being used in them. Neural networks, as a crucial technology in artificial intelligence, are already being extensively applied to big data analytics and computation within data centers.

[0059] In traditional large data centers, the network architecture is typically... Figure 1 The three-layer structure shown is a hierarchical inter-networking model. This model consists of the following three layers:

[0060] Access Layer 103: Sometimes also called the edge layer, this layer includes access switches 130 and the servers 140 connected to them. Each server 140 is the processing and storage entity of the data center; the processing and storage of large amounts of data in the data center are performed by these servers. Access switches 130 are used to connect these servers to the data center. One access switch 130 connects multiple servers 140. Access switches 130 are typically located at the top of the rack, so they are also called top-of-rack switches; they physically connect the servers.

[0061] Aggregation Layer 102: Sometimes also called the distribution layer, this includes aggregation switches 120. Each aggregation switch 120 connects multiple access switches and provides other services such as firewalls, intrusion detection, and network analysis.

[0062] Core Layer 101: Includes Core Switch 110. Core Switch 110 provides high-speed forwarding for packets entering and leaving the data center and provides connectivity for multiple aggregation layers. The entire data center network is divided into L3 routing networks and L2 routing networks, and Core Switch 110 typically provides a flexible L3 routing network for the entire data center network.

[0063] Typically, aggregation switch 120 serves as the boundary between L2 and L3 layer routing networks. Below aggregation switch 120 is the L2 network, and above it is the L3 network. Each aggregation switch group manages one Point of Delivery (POD), and each POD contains an independent VLAN network. Server migration within a POD does not require modification of IP addresses and default gateways, as one POD corresponds to one L2 broadcast domain.

[0064] The Spanning Tree Protocol (STP) is typically used between switch 120 and access switch 130. STP ensures that only one aggregation layer switch 120 is available for a given VLAN network; other aggregation layer switches 120 are only used in case of failure (dashed lines in the diagram above). In other words, horizontal scaling is not possible at the aggregation layer because even if multiple aggregation switches 120 are added, only one will be operational.

[0065] Figure 2 It shows Figure 1 The physical connections between components in a layered data center. For example... Figure 2 As shown, a core switch 110 connects to multiple aggregation switches 120, an aggregation switch 120 connects to multiple access switches 130, and an access switch 130 connects to multiple servers 140.

[0066] cloud server

[0067] The cloud server 140 is the actual equipment in the data center. Because the cloud server 140 operates at high speed to perform various tasks such as matrix calculations, image processing, machine learning, compression, and search ranking, it typically includes a central processing unit (CPU) and various acceleration units to efficiently complete these tasks. Figure 3 As shown. Acceleration units include, for example, one of the following: acceleration units dedicated to neural networks, data transfer units (DTUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). The following... Figure 3 The following example will be used to introduce each acceleration unit.

[0068] Data Transmission Unit (DTU) 260: This is a wireless terminal device specifically designed for converting serial data to IP data or vice versa for transmission over a wireless communication network. The main function of the DTU is to wirelessly transmit data from remote devices back to the backend center. At the front end, the DTU and the client's device are connected via an interface. After powering on, the DTU first registers with the mobile GPRS network and then establishes a socket connection with the backend center configured within the DTU. The backend center acts as the server for the socket connection, and the DTU acts as the client. Therefore, the DTU and the backend software work together to enable wireless data transmission between the front-end device and the backend center after the connection is established.

[0069] Graphics Processing Unit (GPU) 240: This is a processor dedicated to image and graphics-related computations. By using a GPU, the limitation of insufficient computing units in the CPU is overcome. By employing a large number of computing units specifically for graphics calculations, the graphics card reduces its dependence on the CPU and takes over some of the computationally intensive image processing tasks that the CPU originally handled.

[0070] Application-Specific Integrated Circuit (ASIC): This refers to an integrated circuit designed and manufactured to meet the specific requirements of a user and the needs of a particular electronic system. Because such integrated circuits are customized according to user requirements, their structure is often adapted to those specific user requirements.

[0071] Field-Programmable Gate Arrays (FPGAs) are a further development based on programmable devices such as PALs and GALs. They emerged as a semi-custom circuit in the field of Application-Specific Integrated Circuits (ASICs), addressing the shortcomings of custom circuits while overcoming the limitation of the limited number of gate circuits in traditional programmable devices.

[0072] Acceleration Unit 230 for Neural Network Models: This is a processing unit using a data-driven parallel computing architecture to handle a large number of operations (such as convolution and pooling) at each neural network node. Since the data and intermediate results in the numerous operations (such as convolution and pooling) at each neural network node are closely linked throughout the computation process and are frequently used, existing CPU architectures suffer from inefficiencies due to the limited memory capacity within the CPU core, necessitating frequent access to external memory. By employing an acceleration unit with on-chip memory of suitable capacity for neural network computation, frequent access to external memory is avoided, significantly improving processing efficiency and computational performance.

[0073] While the acceleration unit 230 offers significantly higher execution efficiency than ordinary processors for specific applications or domains, it is still subject to the control of the processing unit 220. Taking an acceleration unit dedicated to deep learning models as an example, the memory 210 stores various deep learning models, including the neurons and their weights. These deep learning models are processed when needed. Figure 3One of the processing units 220 is deployed to an acceleration unit 230. Specifically, the processing unit 220 can inform the acceleration unit 230 of the storage location of the deep learning model in the memory 210 in the form of instructions. The acceleration unit 230 can then address according to these locations and store the instructions to be executed in its on-chip memory. The processing unit 220 can also send the instructions to be executed to the acceleration unit 230 in the form of instructions, and the acceleration unit 230 receives the instructions and stores them in the on-chip memory. The acceleration unit 230 can also obtain input data in the above manner. Once the acceleration unit 230 obtains the instructions to be executed and the input data, it performs inference calculations. The weight data of the nodes can be included in the instruction sequence of the deep learning model and retrieved from the memory 210 by the acceleration unit 230. Of course, the weight data of the nodes can also be stored independently and retrieved from the memory 210 by the acceleration unit 230 when needed. The processing unit 220 is a hardware unit with scheduling and control capabilities, generally a central processing unit (CPU), microcontroller, microprocessor, or other hardware unit.

[0074] The acceleration unit of the present disclosure embodiment

[0075] The following is combined with Figure 4 The internal structures of the processing unit 220 and the acceleration unit 2301 provided in the embodiments of this disclosure are described, as well as how the processing unit 220 controls the operation of the acceleration unit 2301.

[0076] like Figure 4 As shown, the processing unit 220 includes multiple processor cores 222 and a cache 221 shared by the multiple processor cores 222. Each processor core 222 includes an instruction fetch unit 203, an instruction decode unit 224, an instruction issue unit 225, and an instruction execution unit 226.

[0077] Instruction fetch unit 223 is used to move the instruction to be executed from memory 210 to instruction register (which may be...). Figure 4 The instruction is stored in one of the registers in the register file 229 shown, and the next fetch address is received or calculated according to the fetch algorithm, which may include, for example, incrementing or decrementing the address based on the instruction length.

[0078] After the instruction is fetched, the processing unit 220 enters the instruction decoding stage. The instruction decoding unit 224 decodes the fetched instruction according to a predetermined instruction format to obtain the operand fetch information required by the fetched instruction, thereby preparing for the operation of the instruction execution unit 225. Operand fetch information includes, for example, pointers to immediate values, registers, or other software / hardware that can provide source operands.

[0079] The instruction issuing unit 225 is located between the instruction decoding unit 224 and the instruction execution unit 226. It is used for instruction scheduling and control to efficiently allocate each instruction to different instruction execution units 226, making parallel operation of multiple instructions possible.

[0080] After instruction issuing unit 225 sends an instruction to instruction execution unit 226, instruction execution unit 226 begins executing the instruction. However, if instruction execution unit 226 determines that the instruction should be executed by an acceleration unit, it forwards it to the corresponding acceleration unit for execution. For example, if the instruction is a neural network inference instruction, instruction execution unit 226 will not execute the instruction but will instead send it to acceleration unit 230 via the bus for execution.

[0081] The acceleration unit 2301 includes a bus channel 231, a direct memory access module 235, an on-chip memory 236, a dispatch unit 237, a command processor 238, and a PE array.

[0082] Bus channel 231 is the channel through which instructions enter and exit the acceleration unit 230. Depending on the mechanism, bus channel 231 may include PCIe channel 232, I2C channel 233, and JTAG channel 234. PCIe, or PCI-Express, is a high-speed serial computer expansion bus standard proposed by Intel in 2001 to replace older PCI, PCI-X, and AGP bus standards. PCIe is a high-speed serial point-to-point dual-channel high-bandwidth transmission system where connected devices are allocated dedicated channel bandwidth and do not share bus bandwidth. It primarily supports active power management, error reporting, end-to-end reliable transmission, hot-plugging, and Quality of Service (QoS) functions. Its main advantage is its high data transfer rate, and it has considerable development potential. Currently, most PCIe buses are PCIe Gen3, but this embodiment can also use PCIe Gen4, i.e., a bus channel conforming to the PCI-Express 4.0 standard. I2C channel 233 is a simple, bidirectional two-wire synchronous serial bus channel developed by Philips. It requires only two wires to transmit information between devices connected to the bus. JTAG is short for Joint Test Action Group, and is the common name for IEEE standard 1149.1, also known as Standard Test Access Ports and Boundary Scan Architecture. This standard is used to verify the functionality of designed and manufactured printed circuit boards. JTAG was officially standardized in 1990 by IEEE document 1149.1-1990, and in 1994, supplementary documentation was added to describe the Boundary Scan Description Language (BSDL). Since then, this standard has been widely adopted by electronics companies worldwide. Boundary scan has almost become synonymous with JTAG. JTAG Channel 234 is a bus channel conforming to this standard.

[0083] The Direct Memory Access (DMA) module 235 is a feature provided by some computer bus architectures that allows data to be written directly from an external device (such as external memory) to the on-chip memory 236 of the acceleration unit 2301. This method significantly improves the data access efficiency of the acceleration unit 2301 compared to obtaining data through the processing unit 220. Because of this mechanism, the acceleration unit 230 can directly access memory 210 to read the weights and activation data of the deep learning model, greatly improving data access efficiency. Although the diagram shows the DMA module 235 located between the processor 238 and the bus channel 231, the design of the acceleration unit 2301 is not limited to this. Furthermore, in some hardware designs, each PE unit can include a DMA module 235 to directly read data from an external device and write it to the on-chip memory 236.

[0084] In neural network models, applications such as matrix operations, convolution, and depthwise convolution involve a large amount of input data, which typically cannot be imported into the acceleration unit 2301 all at once. Therefore, the acceleration unit 2301 in this embodiment of the present disclosure, if it is determined that the application cannot be completed in one go, decomposes the neural network application to be executed into multiple sub-operations by the command processor 238, converts the sub-operations into instruction sequences (multiple instructions to be executed) to be executed on each PE cluster of multiple PE cluster groups, loads the operation data required for each sub-operation multiple times through the direct memory access module 235, specifies the operation data for each instruction sequence, and finally stores the instruction sequences and operation data corresponding to the multiple PE clusters contained in each PE cluster group into the corresponding storage units. Optionally, the operation data of the sub-operations is evenly distributed to each instruction sequence.

[0085] It should be noted that each sub-operation produces an intermediate result. Therefore, it is necessary to integrate the intermediate results of multiple sub-operations into the final result. Since the intermediate results are generated in the PE cluster, and the storage space on the PE cluster is limited, it is impossible to store the intermediate results indefinitely. Therefore, the instruction sequence must include rewinding the intermediate results from the PE cluster back to the corresponding storage unit or exporting the intermediate results to memory 210 via the corresponding storage unit. In the integration step after all or part of the sub-operations are completed, there can be multiple integration methods. For example, multiple PE clusters coupled to the same distribution unit (in Figure 4 The intermediate results of PE clusters belonging to the same row are integrated, and then the intermediate results of multiple PE cluster groups are integrated.

[0086] As shown in the figure, the command processor 238 is coupled to the memory 236, which is divided into multiple storage units. Each storage unit is coupled to a corresponding distribution unit, and each distribution unit is coupled to a PE cluster group consisting of multiple PE clusters. Each distribution unit retrieves the instruction sequence and operation data executable on the PE cluster from its coupled storage unit and distributes them to the PE clusters it is coupled to. It should be noted that each PE cluster group is designed to contain the same number of PE clusters, and each PE cluster has the same function and hardware structure. Therefore, the instruction sequence deployed on the PE clusters can be identical, and the instruction sequence and operation data of each PE cluster can be sent to each PE cluster only during the first sub-operation; in subsequent sub-operations, only new operation data is sent to the PE clusters.

[0087] As an example in the diagram, there are n storage units, n distribution units, and an n-row, m-column PE cluster. Each distribution unit is coupled to a row of PE clusters via a first bus. If PE clusters in a row need to obtain the same data, the distribution unit broadcasts the data to the PE units in that row via the first bus. Otherwise, the distribution unit is only responsible for sending the instruction sequence and operation data to the respective PE clusters coupled to it via the first bus. As shown in the diagram, each PE cluster further includes k functionally identical PE units, thus forming a three-dimensional PE array with dimensions n*m*k, where m, n, and k are all integers greater than 1. Of course, based on the same inventive concept, two-dimensional or higher-dimensional PE arrays can also be designed.

[0088] Figure 5a This is a design diagram of an exemplary PE cluster. As shown in the diagram, the PE cluster 500 includes a cluster control unit 602 and multiple PE units with the same function, coupled to the cluster control unit 602. The cluster control unit 602 receives a sequence of instructions, including a data loading instruction. The cluster control unit 602 controls each PE unit to execute the same sequence of instructions, and can control the execution of the data loading instruction in the sequence through control signals generated by the cluster control unit 602, loading different operational data from different data addresses, so that different PE units obtain different intermediate results based on different operational data.

[0089] The PE controller 501 is contained in each PE unit. Each PE unit also includes a data loading unit 502, a weight queue 503, an input buffer 504, an index comparison unit 505, a selector 511, a multiply-accumulate unit 506, a buffer 508, an output queue 513, selectors 514-516, a special control unit 509, and a special function unit 510.

[0090] The data loading unit 502 is responsible for receiving input data from the distribution unit 601 and storing the input data into the weight queue 503 or the input buffer 504 according to the data type of the input data. The data types of the input data include weight data and activation data. The weight data is stored in the weight queue 503, and the activation data is stored in the input buffer 504. At the same time, the data loading unit 502 generates a bitmask for the activation data by checking whether each value of the activation data (i.e., checking each item of the matrix) is equal to 0. Therefore, the bitmask of the activation data is used to indicate whether each value of the activation data is 0. For example, when it is 0, the bitmask is set to 0.

[0091] In some embodiments, when compiling and deploying a sparse neural network model, the processing unit 220 organizes and stores the weight data in the form of "non-zero values ​​+ weight indices". Therefore, when the weight data enters the PE cluster through the distribution unit 601, the weight data loaded into the weight queue 503 consists of weight indices and the corresponding non-zero values ​​(in the weight queue 503 in the figure, different patterns are used to mark the weight indices and the non-zero values ​​corresponding to the weight indices). In other embodiments, before entering the weight queue 503, the distribution unit 601 and the command processor 238 complete the conversion of the weight data into the form of "non-zero values ​​+ weight indices" for organization and storage. Both of these implementations are particularly suitable for sparse neural network models.

[0092] Referring to the diagram, to achieve streaming storage of weight data, the weight queue 503 adopts a queue-like architecture. The storage units constituting the weight queue 503 can be shift registers, and it can form a loopback path to support the reuse of weight data during convolution operations. The loopback path means that the queue is connected end-to-end; when a write and / or read operation is performed at the tail of the queue, the next write and / or read will return to the head of the queue.

[0093] Input buffer 504 stores activation data and a bitmask generated based on the activation data. Although not shown, the activation data here should include an activation index and an activation value corresponding to the activation index, plus the bitmask based on the activation index stored in input buffer 504.

[0094] The index comparison unit 505 is responsible for generating the payload, which refers to matrix operations based on non-zero weights and activation data. The index comparison unit 505 includes an adder and a comparator. The adder adds the weight index and the base address (received from the weight queue 503, the base address obtained from the cluster control unit 602) to obtain the input index. The comparator receives the input index from the adder, compares it with the index value output by the output buffer 504, and if they are the same and the bitmask indicates that the corresponding value is not 0, a control signal is generated and provided to the control terminal of the selector 511, causing the input buffer 504 to output the value corresponding to the input index and provide it to the multiply-accumulate unit 506. The multiply-accumulate unit 506 performs multiply-accumulate operations. The multiply-accumulate unit 506 stops the multiply-accumulate operation according to the control signal from the PE controller 501 and outputs the accumulation result to the buffer.

[0095] As shown in the figure, as an optional embodiment, the multiply-accumulate unit 506 includes a multiplier 5066, an adder 5061, a selector 5067, multiple accumulation buffers 5064, multiple selectors 5064, selector 5065, and selector 5067. The output of multiplier 5066 is coupled to the input of adder 5061. The output of adder 5061 is coupled to the input of selector 5067. Multiple outputs of selector 5067 are coupled to multiple accumulation buffers 5064 respectively. Multiple accumulation buffers 5063 are coupled to the inputs of multiple selectors 5064. The two outputs of each selector 5064 are coupled to the inputs of selector 5065 and selector 5067 respectively. The output of selector 5065 is coupled to the input of adder 5061. The output of selector 5067 is coupled to an external buffer 508.

[0096] In the accumulation buffer 506, the product generated by multiplier 5066 is accumulated by adder 5061. The accumulated result is input to selector 5062 to determine which of the four buffers 5063 to store the accumulated result based on a control signal from PE controller 501. The multiple accumulation buffers 5063 are shown as four homogeneous accumulation buffers in the figure. The accumulated result stored in the accumulation buffer 5063 is transferred to different submodules, depending on the operation. As shown in the figure, the accumulated result can be transferred to adder 5061 for continued accumulation operation via selectors 5064 and 5065. The accumulated result is also stored in output queue 513 via buffer 508 and selectors 515 and 516. Output queue 513 can store the accumulated results of multiple operations. These intermediate results can be transferred to memory 236 via distribution unit 601, and can further be transferred to external memory 210. The accumulated result can also be stored as an intermediate result in the output queue 513 for a long time, and provided to the four buffers 5063 when appropriate for re-accumulating multiple accumulated results. The accumulated result can also be provided to the special function unit 510 via selector 516. The accumulated result in the output queue 513 can also be provided to the special function unit 510 via selector 514.

[0097] Special Function Unit (SFU) 510 is used to execute all special functions required by the neural network model (e.g., activation functions or shrinking). SFU 510 can be coupled to multiple parallel PE units via a message queue / FIFO interface. SFU 510 has its own instruction path and operates asynchronously with all parallel PE units. Therefore, SFU 510 utilizes only a small number of hardware operators to match the throughput of multiple PE units while minimizing area and power consumption. Depending on the specific application, SFU 510 can operate in two modes: chained mode and decoupled mode. Chained mode is typically used for element-wise special functions, such as activation functions of a neural network model. Typically, data in the accumulation buffer 506 is written to the output queue 513, then SFU 510 reads from the output queue 513 to execute the special function and writes the final result back to the output queue 513. However, in chained mode, the accumulation buffer 506 is directly transferred to SFU 510 instead of the output queue 513. In this way, Special Function Unit 510 only needs the local output buffer address corresponding to each PE unit, reducing memory access to the output buffer 513 by 2 / 3. Decoupling mode is typically used to process special functions, such as reduction, which require data on parallel PE units (input data is interleaved across all PE units). When executing these special functions, data in the queue of Special Function Unit 510 uses tags / tokens to identify which PE the data belongs to. Using tags / tokens, Special Function Unit 510 can effectively determine whether the current special function has been completed. Unlike chained mode, decoupling mode requires a global output buffer address to flexibly access the output data of any PE unit.

[0098] Figure 5b This is another exemplary design diagram of a PE cluster. This design incorporates considerations regarding load balancing. As shown in the diagram, the hardware modules related to load balancing include: an index comparison unit 505, load queues 518 and 519, a load balancer 517, and a selector 520.

[0099] As shown in the figure, the index comparison unit 505 is responsible for generating the payload, which refers to matrix operations based on non-zero weights and activation data. The index comparison unit 505 includes an adder and a comparator. The adder adds the weight index and the base address (received from the weight queue 503, and obtained from the cluster control unit 602) to obtain the input index corresponding to the current weight value. The comparator uses this input index to retrieve the bitmask in the input buffer. If the bitmask corresponding to the input retrieval is equal to 1, a payload is found, and the input index is stored in the payload queue 518. Note that the index comparison unit 505 processes two weight indices in parallel and generates 0 / 1 / 2 payloads based on activation sparsity.

[0100] Based on the design of the index comparison module 505, the PE unit adopts a two-level load queue design. The first-level load queue 518 uses a double-width design to match the push load of the index comparison unit 505 (parallel 0 / 1 / 2 payloads). The second-level load queue 519 adopts a standard single-width design to facilitate popping an input index (corresponding to each payload) at any given time. Therefore, the first and second-level load queues are loosely synchronized to handle the width mismatch problem.

[0101] In this embodiment, load balancing is implemented as follows: the number of input indices (i.e., the number of effective loads) in the load queue 518 of the left PE unit is significantly less than the number of input indices in the load queue 518 of the right PE unit, indicating a load imbalance. Therefore, when the load balancer 517 in the right PE unit detects this load imbalance, it performs load balancing in the running state. During load balancing, input indices popped from the local load queue 518 are pushed to the load queue 518 of the left PE unit instead of to the local load queue 519. Furthermore, to achieve load balancing, the input buffer 504 and weight queue 503 in each PE unit need to be doubled. The increased portion is used to store the activation and weight data of the right PE unit. This allows the left PE unit to directly retrieve activation and weight data from its own input buffer 504 and weight queue 503 for calculation after obtaining the input index from the right PE unit. Of course, load balancing is not limited to this embodiment. There are other implementation methods. For example, swapping the left and right PE units in this embodiment is also possible. Another example is requiring each PE unit to periodically report its own load status to the distribution unit 601. Based on the received load status of each PE unit, the distribution unit 601 controls the sending of data and instructions to be executed to the PE unit with a lower load.

[0102] As shown in the figure, the input index popped from the load queue 519 is used to obtain the corresponding non-zero activation data. The popped input index is provided to the control terminal of selector 511, which outputs the activation value matching the input index to multiplier 5066. Multiplier 5066 simultaneously receives the weight value matching the input index from weight queue 503, and then performs a multiplication operation. The product generated by multiplier 5066 is then provided to adder 5061 for accumulation. Adder 5061 also receives the partial sum of the PE unit on the right as input. The accumulated result output by adder 5061 is provided to selector 5062 to determine which buffer to use for storage. As shown in the figure, each PE unit has four heterogeneous accumulation buffers to maintain the partial results of different inputs. Depending on the different operations to be performed, the data in the accumulation buffer 506 is transferred to different sub-modules, including: providing it to the adder 506 for accumulation operations; providing it to the input terminal of the selector 520 of the left PE unit by the selector 516; storing it as output results in the output queue 513; and providing it as input data to special function units such as activation functions and reduction functions.

[0103] It should be understood that Figure 5b The PE cluster structure shown is for Figure 5a Improved design. Therefore, some in Figure 5a The structure and function already described in Figure 5b It is not described in detail. Figure 5c yes Figure 5a and Figure 5b A schematic diagram of the cluster control unit is shown. The cluster control unit 602 is located in the PE cluster and is used to generate multiple control signals to provide to the multiple PE units coupled to it. As shown in the figure, the cluster control unit 602 includes a selection unit 603, multiple instruction buffers 1 to X, a scalar calculation unit 604, a register file 605, a parser 606, and an operation unit 607. The selection unit 603 includes one input terminal and two output terminals, the two output terminals being coupled to the parser 606 and the operation unit 607 respectively. The parser 606 is coupled to the multiple instruction buffers 1-X. The operation unit 607 is coupled to the multiple instruction buffers 1-X. All instruction buffers 1-X are coupled to PE units 1-X (for simplicity, only the lines connecting instruction buffers 1-X to PE units 1 are shown in the figure).

[0104] Instruction buffers 1-X are used to store the instruction sequence for a specified neural network application (target application). Note that each instruction buffer can store multiple instructions and execute them according to its own program counter. Each instruction buffer uses an implicit loopback program counter (PC) to automatically issue repeated instructions. These instruction buffers can operate independently for different hardware modules, and each instruction buffer enjoys an independent instruction pipeline unaffected by the execution of other instruction sequences (decoupled execution pipeline). Notably, for convenient and efficient control of the instruction pipeline, these instruction buffers can trigger each other's execution. The operands of each instruction in the instruction buffer are stored in different entries in register file 605 (register file 605 in the figure includes entries 1-N). Scalar computation unit 604 is responsible for performing scalar computations on the operands of the executed instructions and updating the corresponding entries in the register file based on the computation results. Furthermore, since the PE unit performs vector multiplication, the scalar calculation unit 604 can also receive partial intermediate calculation results from multiple PE units and perform scalar calculations (e.g., scalar accumulation). The scalar calculation results can be stored in the corresponding entries in the register file. This also speeds up scalar calculations.

[0105] When the cluster control unit 602 is working, the selection unit 603 in the cluster control unit 602 receives commands and data from the distribution unit 601, obtains the command type and buffer identifier from the command, and determines whether the command type is configuration or execution. If it is configuration, the received data and buffer identifier are provided to the parser 606. If it is execution, the buffer identifier is provided to the operation unit 607.

[0106] Parser 606 is used to parse the instruction sequence from the data, store the instruction sequence into the instruction register that matches the register identifier, and store the operands of each instruction in the instruction sequence into the corresponding entries in the register file. Optionally, the parser also determines the operands of each instruction when it is executed on each PE unit based on the original operands of each instruction in the instruction sequence, and stores the determined operands into the corresponding entries in the register file. For example, if the original operand of each instruction is a data address, and the instruction is to be allocated to multiple PE units for execution, the data address for execution on each PE unit is determined based on the original data address.

[0107] Operation unit 607 receives a register identifier and drives the instruction register that matches the register identifier to execute each instruction in it one by one to generate control signals. The control signals of each instruction and the operands of the instruction in the register file are sent to multiple execution units. Each execution unit performs the corresponding operation based on the received control signals and operands.

[0108] In practice, the processing of a specified neural network application can be divided into two phases: a configuration phase and an execution phase. In the configuration phase, instruction buffers 1-X are initialized with the instruction sequence of the specified neural network application, and each entry in the register file is initialized with the operands of each instruction in the instruction sequence of the specified neural network application. In the execution phase, the dispatch unit 601 sends the operation data of the specified neural network application to each PE cluster. Simultaneously, the dispatch unit 601 sends a command containing the identifier (ID) of the instruction buffer to be selected for execution. The operation unit 607 uses this identifier to drive the execution of each instruction in the instruction sequence pre-loaded into the corresponding instruction buffer. During execution, these instructions generate control signals for each PE unit. Each PE unit loads its own operation data from the operation data of the specified neural network application based on the received control signals. Then, the PE unit can perform corresponding operations based on its own operation data and the received control signals.

[0109] Furthermore, the distribution unit 601 uses a packed design when distributing instructions / data to the PE cluster. This packed design means that data and commands are stacked together. During the configuration phase, the command specifies which instruction buffer is selected to store the instruction sequence it is associated with. During the execution phase, the command specifies which instruction buffer is selected to process the operation data it is associated with.

[0110] Furthermore, when processing a specific neural network application, its corresponding instruction pipeline and PE unit hardware pipeline are separate and will not interfere with each other. Depending on the different neural network applications, the acceleration unit of this disclosure embodiment uses two types of instruction pipelines.

[0111] The first type of instruction pipeline consists of two stages: a decode (ID) stage and a fetch (FO) stage. In the decode stage, control signals are generated based on the opcode of the current instruction. In the fetch stage, the required operands (e.g., buffer addresses, accumulator buffer identifiers (IDs), etc.) are fetched from the register file. Then, each PE unit is instructed to process the data coupled to the current instruction. In this case, the instruction pipeline only provides the necessary information to each PE unit, and no further operations are required. It should be noted that this type of instruction pipeline is designed for neural network applications that do not require updating the operands of instructions, such as matrix multiplication (SPMV and SPMM).

[0112] The second type of instruction pipeline comprises four stages: the decode (ID) stage, the fetch (FO) stage, the scalar computation unit (SU) stage, and the write-back (WB) stage. The first two stages (ID and FO) are identical to the first two stages of the first type of instruction pipeline. However, unlike the first type, the second instruction pipeline is designed for neural network applications that require updating the operands of instructions after execution from the PE unit. These neural network applications include convolution (spCONV) and depthwise convolution (DCONV). In the scalar computation unit (SU) stage, the scalar computation unit 604 is used to perform scalar computations to update the corresponding operands of the current instruction. The updated operands are then written back to the corresponding entries in the register file in the write-back stage.

[0113] In this disclosure, the instruction design is implemented based on a finite state machine (FSM) and data-dependent instructions. Specifically, the behavior of the PE unit depends on three key factors: the commands stacked with the data; the instructions stored in the instruction buffer; and the state of the scalar pipeline. The specific design of these three factors is described below.

[0114] First, during the configuration phase, the distribution unit 601 transmits the instruction sequence for the specified neural network application and a command containing an identifier (ID) of the instruction buffer to the PE cluster. At this stage, the identifier of the instruction buffer in the command indicates the instruction buffer in which the instruction sequence transmitted with the command should be stored. During the execution phase, the distribution unit sends the operation data (which may be part or all of the operation data for the specified neural network application) and a command containing an identifier (ID) of the instruction buffer to the PE cluster. At this stage, the identifier of the instruction buffer in the command indicates which buffer's instruction sequence will process the operation data transmitted with the command. An exemplary command design is shown in the table below.

[0115] Table 1

[0116]

[0117] The command block uses 8 bits to minimize the overhead of data transfer between the distribution unit and the PE cluster. Commands are divided into two types: configuration commands and execution commands, corresponding to the configuration and execution phases respectively. Referring to Table 1, the opcode (CONFIG or EXE) occupies the first bit. In configuration commands, the second and third bits are reserved, and the fourth to eighth bits are used to identify the instruction buffer. In execution commands, the second bit is an identifier determining whether the PE unit should remove the accumulated result. The second bit is an identifier indicating whether the current data segment of the PE unit is the last data segment of the input data; if so, the accumulated result is written back to the output buffer. Similar to configuration commands, the fourth to eighth bits of execution commands are used to identify the instruction buffer.

[0118] Table 2 illustrates the instruction design in the instruction buffer. These four instructions cover all operations involved in neural network applications (spMV, spMM, spCONV, and Depth-wise CONV).

[0119] Table 2

[0120]

[0121] As shown in Table 2, the first and second bits are used to encode the opcode. For the multiply-accumulate instruction (MAC), the third and fourth bits indicate whether load balancing is enabled and whether the special function unit (SPU) is active, respectively. The fifth to eighth bits are used to encode the base address of the input buffer, the ninth and tenth bits are used to encode the identifier of the accumulation buffer, and the last two bits are reserved. For data load instructions and data store instructions (LD / ST), the fifth to eighth bits are used to encode the base address of the input / output buffer to be loaded / stored. For SPU (Special Function Unit) instructions, the third and fourth bits indicate the type of special function to be executed. Note that in this design, writing directly to the output buffer is also encoded as a type of special function (Non-SF), and the fifth to eighth bits and the ninth to twelfth bits encode the buffer addresses of the input and output data of the special function.

[0122] The scalar pipeline's states primarily control the behavior of the weight queue. There are three scalar pipeline states, each applied to different neural network applications. For spMV and spMM, the weight queue acts as a moving queue, and the corresponding scalar pipeline state is defined as a First-In-First-Out (SQ) queue. However, since spCONV requires the weight queue to support rotation operations on loop paths, the corresponding state is defined as a Rotation Queue.

[0123] In depthwise convolution (CONV), the weight queue actually stores the activation data. To minimize data transfer between the distribution unit and the PE cluster, we fully utilize a sliding window to generate overlapping regions, popping only the data within the overlapping regions instead of popping all activation data from the weight queue. In this way, the weight queue operates in sliding window mode, thus the corresponding scalar pipeline state is called a sliding window queue. As a result, typically, in the SQ state, data is popped in a first-in, first-out manner; in the RQ state, data is popped cyclically (the loop path implements weight reuse); and in the SW state, data is loaded into and exported from the PE unit.

[0124] The following describes the spMV instruction sequence. Instruction sequences for other applications can be generated by modifying the spMV instruction sequence. Based on the spMV instruction sequence, one can understand how to use this set of instructions to execute other applications. The spMV instruction sequence is divided into five main stages.

[0125] The first stage is configuring the instruction buffer (i.e., CONFIG above) before the execution stage.

[0126] The second stage involves writing activation data and generating a bitmask during execution (the corresponding instruction is LD).

[0127] The third stage is writing the partial accumulated result (the corresponding instruction is LD).

[0128] The fourth stage is execution (the corresponding instruction is MAC, including Balance and SPU indication information).

[0129] The fifth stage is reading (the corresponding instruction is ST).

[0130] The following section focuses on the instructions executed on the PE cluster. In the first phase, only configuration-related instructions are executed to initialize the instruction buffer, scalar computation unit, and register file. In the second and third phases, LD instructions are executed to load data into the PE cluster. In the fourth phase, MAC instructions with different configurations (shifting, load balancing, SPU, etc.) are executed, based on commands from within the dispatch unit and various configurations prior to the execution phase. In the fifth phase, ST instructions are executed to export data from the PE cluster.

[0131] Based on the instruction sequence of spMV, we can generate instruction sequences for other neural network applications. For spMM, the fourth-stage instruction buffer should contain multiple MAC instructions. These MAC instructions are issued and executed according to the instruction buffer's own PC. For spCONV, after the pre-configured MAC instructions, the weight queue rotates according to the state of the scalar pipeline. Depth-wise CONV uses a similar instruction sequence to spMV, the only difference being that the roles of weights and activation data are reversed in depth-wise convolution.

[0132] The neural network application is mapped to the acceleration unit of this disclosure embodiment for execution.

[0133] The acceleration unit supports various neural network applications, including matrix multiplication, convolution, and depth convolution. The most basic operations in these applications are multiplication and accumulation; therefore, the PE unit designed in the disclosed embodiments primarily performs multiplication and accumulation operations. A detailed description based on neural network applications follows.

[0134] Figure 6a A schematic diagram of matrix multiplication is shown. (For example...) Figure 6a As shown, the activation data is an m*k two-dimensional matrix, where m represents rows and k represents columns. The weight data is a k*n matrix, where m represents rows and k represents columns. Therefore, the output data is an m*n matrix, where m represents rows and n represents columns. For example, if A is a 2*3 matrix, B is a 3*2 matrix, and C is the matrix product of A and B, which is a 2*2 matrix, the operation process is as follows.

[0135]

[0136]

[0137]

[0138] like Figure 6b and 6c As shown, in convolution and depthwise convolution, more dimensions are incorporated. (Reference) Figure 6b As shown, the activation data, weight data, and output data are all four-dimensional matrices (in this paper, one-dimensional and two-dimensional matrices are referred to as low-dimensional matrices, and three-dimensional and higher-dimensional matrices are referred to as high-dimensional matrices). The parameters of the activation data are [b, w, h, c]. in The parameters of the weighted data are [c] out ,l,l,c in The parameters for the output data are [b, w, h, c]. outFor ease of understanding, we will interpret this example as a convolution operation on image data. b represents the number of images, w and h represent the width and height of the image dimensions, and c... in This represents the number of channels, such as c in an RGB image. in Equals 3. The convolution operation can be understood as using l*l*c... in The convolution kernel in each image (Figure c) in, The process of scanning the cube (defined by w and h) to obtain the output image involves the following calculations: first, taking the inner product of the l*l matrix with the corresponding feature elements in the two-dimensional image; summing the inner product values; and then... in The sum of the inner products of the corresponding coordinates is used as the value at the corresponding coordinate on the two-dimensional feature map. In other words, l*l*c in The convolution kernel with a [w,h,c in The image defined is used to calculate a two-dimensional feature map of size w*h. out l*l*c in The convolution kernel with a [w,h,c in The image defined is used to calculate a c. out *w*h output feature maps. Since there are b images used as activation data, we ultimately obtain b cout*w*h output feature maps. Figure 5c The calculation process of depthwise convolution includes: first, taking the inner product of the l*l convolution kernel with the corresponding feature elements in the input two-dimensional image, summing the inner product values ​​as the values ​​at the corresponding coordinates on the output two-dimensional feature map, c being the number of channels of the input and convolution kernel, which remains unchanged, and being used as the number of channels of the output image, finally obtaining b c*w*h feature maps.

[0139] From the above, we can see that the basic operations of convolution and depthwise convolution are matrix operations (multiplication and summation). Convolution and depthwise convolution simply involve more dimensions. However, during program processing, the high-dimensional matrix operations of convolution and depthwise convolution can be converted into multiple iterative low-dimensional matrix operations. Figures 6a-6c For example, Figures 6b-6c bwh in the middle corresponds to Figure 6a m,cin corresponds to Figure 6a k,cout corresponds to Figure 6a In this way, n Figures 6b-6c The indicated convolution and depthwise convolution are converted into multiple iterations of m*k two-dimensional matrix operations with k*n matrix operations. When performing neural network applications, it also involves using the Direct Memory Access (DMA) module 235 to load the data required for each operation into the on-chip memory 236.

[0140] In implementation, there are various ways to convert the high-dimensional matrix operations of convolution and depthwise convolution into multiple iterative low-dimensional matrix operations. This embodiment defines three mapping methods: input stationary mapping, weight stationary mapping, and output stationary mapping. The command processor 238 can select one of these mapping methods when processing neural network applications. For each neural network application, the preferred mapping method should reduce data transfer between the acceleration unit 2301 and the external memory 210. To this end, the acceleration unit 2301 can be configured with a preferred mapping method for each neural network application so that the corresponding method is used when executing each neural network application.

[0141] The following section will use matrix multiplication as an example to introduce these three mapping methods.

[0142] The core idea of ​​the input fixed mapping method is to retain the active data in the PE array for as long as possible. The following example... Figure 7a The pseudocode example shown illustrates this. This pseudocode segment involves multiple iterations (the number of iterations is determined by iter_n0, iter_k0, and iter_m0), each iteration specifying a two-dimensional matrix multiplication performed on the PE array. The input matrices for this two-dimensional matrix multiplication are denoted by i (activation data) and w (weight data), and the output matrix is ​​denoted by o. For i, its row start and end indices in the two-dimensional matrix (derived from the activation data of the high-dimensional matrix) are defined by m_start and m_end, and its column start and end indices are defined by k_start and k_end. Similarly, for w, its row start and end indices in the two-dimensional matrix (derived from the weight data of the high-dimensional matrix) are defined by k_start and k_end, and its column start and end indices are defined by n_start and n_end. The same applies to o.

[0143] As can be seen from the pseudocode, in the conditional statement of the nested loop, n changes before k, and k changes before m. Therefore, the two-dimensional matrix defined by k*n from the weight data will change before the two-dimensional matrix defined by m*k from the activation data. Thus, when m and k remain constant while n changes, a two-dimensional matrix defined by m*k is deployed to the PE array and held for a period of time. This two-dimensional matrix defined by k*n is continuously loaded from external memory and sent to the PE array. When k changes, the two-dimensional matrix defined by m*k changes, at which point the new m*k is loaded from external memory into the PE array. Furthermore, the output two-dimensional matrix defined by m*n sometimes needs to be written back to memory 210. It should be noted that if the PE array can hold all the two-dimensional matrices defined by m*k, then there is no need to use the fixed input mapping method.

[0144] The core idea of ​​the fixed-map output method is to retain the output data in the on-chip memory 236 for as long as possible. The corresponding pseudocode is as follows: Figure 7b As shown. The analysis of this pseudocode can be found above, and will not be elaborated further here. It should be noted that when all active data can be stored in on-chip memory 236, there is no need to use the fixed input data loading method.

[0145] The core idea of ​​the fixed-weight mapping method is to retain the weight data in on-chip memory (236) for as long as possible. The corresponding pseudocode is as follows: Figure 7c As shown. Analysis of this pseudocode can be found above and will not be detailed here. It should be noted that the fixed-weight mapping method can only be used when weight data and calculation are separate. If weight data and calculation overlap, the fixed-weight mapping method cannot be used. When using the fixed-weight mapping method, the command processor 238 needs to write the current partial result data (calculated by the PE array) back to memory 210 before loading new activation data into on-chip memory 236.

[0146] When implementing the above mapping method, the data transfer pipeline issue also needs to be considered. (Reference) Figure 7aIn the pseudocode, when the PE array performs the (k+1)th iteration calculation, it first loads the activation data and weight data for this iteration from the on-chip memory 236. The activation data and weight data for this iteration are loaded from memory 210 into the on-chip memory 236 by the command processor 238 during the kth iteration. Note that the on-chip memory 236 is a global storage area, and each storage unit is designed according to ping-pong principles. This design divides each storage unit into two units: the first unit is used to load data from memory 210, and the second unit is used to provide the data to the PE array. Therefore, during PE computation, the activation and weight data for the next iteration are transferred from memory 210 to the on-chip memory 236, and the activation and weight data for the next iteration are transferred from the on-chip memory 236 to the PE array. Thus, if the computation time of the PE array is longer than the time spent loading activation and weight data from memory 210, the time spent loading activation and weight data from memory 210 is hidden within the computation time of the PE array, which helps improve the execution efficiency of the acceleration unit. In the final iteration, the activation and weight data for the first iteration of the next group need to be prepared. Simultaneously, the output data is written back from the PE array to memory 210 in the final iteration. The operation of writing the output data from on-chip memory 236 back to memory 210 is performed in the first iteration of the next group.

[0147] Data segmentation method implemented in the acceleration unit of this disclosure embodiment

[0148] As described above, the command processor 238 loads the data required for each iteration into the various storage units of the on-chip memory 236 via the direct memory access module 235, and then distributes the data to the PE cluster via the distribution unit. The PE cluster further distributes the data to the PE units. In this process, the distribution unit typically segments the matrix according to the dimensions m, n, and k to obtain a matrix that can be distributed to the PE cluster.

[0149] refer to Figure 8 As shown, the activation data is a two-dimensional matrix with 4 rows and 8 columns; the weight data is a two-dimensional matrix with 8 rows and 8 columns; the output matrix is ​​a two-dimensional matrix with 4 rows and 8 columns. The following section details how to... Figure 8 The matrix multiplication shown is deployed to a 2x2 PE array, which includes PE clusters (0,0), (1,0), (0,1), and (1,1). In our design, each PE cluster is a 2D mesh. Therefore, when mapping the above matrix to the PE array, there are three choices in each dimension, for a total of nine choices.

[0150] Figures 9a-9i It shows how to Figure 8The diagram shows nine options for deploying matrix multiplication to a PE array. In the diagram, I, W, and O represent the activation data, weight data, and output matrix of the matrix multiplication performed on the corresponding PE cluster, respectively.

[0151] exist Figure 9a In this process, the PE cluster (0,0) performs the task of multiplying the first row of the activation data (i.e., I[0:1,0:8]) with the weight data (i.e., W[0:8,0:8]), and the result is the first row of the output data (i.e., O[0:1,0:8]). Here, [0:1,0:8] in I[0:1,0:8] specifies the rows and columns of the input data, [0,1] represents the first row, and [0,8] represents columns 1 to 8. W[0:8,0:8] represents the matrix composed of rows 1 to 8 and columns 1 to 8 of the weight data, i.e., a complete set of weight data. O[0:1,0:8] represents the matrix composed of the first row and columns 1 to 8 of the output data. Figures 9a-9i The representation of these data is the same throughout, so it will not be described in detail below. On PE cluster (1,0), the task of multiplying the second row of activation data (i.e., I[1:2,0:8]) with the weight data (i.e., W[0:8,0:8]) is performed, and the result is the second row of the output matrix (i.e., O[1:2,0:8]). On PE cluster (0,1), the task of multiplying the third row of activation data (i.e., I[2:3,0:8]) with the weight data (i.e., W[0:8,0:8]) is performed, and the result is the third row of the output matrix (i.e., [2:3,0:8]). On PE cluster (1,1), the task of multiplying the fourth row of activation data (i.e., I[3:4,0:8]) with the weight data (i.e., W[0:8,0:8]) is performed, and the result is the fourth row of the output matrix (i.e., O[2:3,0:8]).

[0152] based on Figure 9a It can be seen that the input and output matrices participating in matrix multiplication on PE clusters (0,0) to (1,1) are different, but the weight data participating in matrix multiplication on PE clusters (0,0) to (1,1) are the same. That is to say, PE clusters (0,0) to (1,1) share weight data.

[0153] exist Figure 9bIn the first step, PE cluster (0,0) performs the task of multiplying the first two rows of activation data (I[0:2,0:8]) with the first four columns of weight data (W[0:8,0:4]), and the result is the output of the first two rows and the first four columns of the output data (O[0:2,0:4]). On PE cluster (1,0), the task of multiplying the last two rows of activation data (I[2:4,0:8]) with the last four columns of weight data (W[0:8,0:4]) is performed, and the result is the output of the first two rows and the first four columns of the output matrix (O[0:2,0:4]). On PE cluster (0,1), the task of multiplying the first two rows of activation data (I[0:2,0:8]) with the last four columns of weight data (W[0:8,4:8]) is performed, and the result is the output of the first two rows and the last four columns of the output matrix ([0:2,4:8]). On the PE cluster (1,1), perform the task of multiplying the first two rows of activation data (i.e., I[2:4,0:8]) with the last four columns of weight data (i.e., W[0:8,4:8]). The result is the output matrix consisting of the first two rows and the last four columns (i.e., O[2:4,4:8]).

[0154] based on Figure 9b It can be seen that the input and output matrices participating in matrix multiplication on PE clusters (0,0) to (1,1) are different, but the weight data between PE clusters (0,0) and (1,0) is the same, and the weight data between PE clusters (0,1) and (1,1) is the same.

[0155] exist Figure 9c In the first step, PE cluster (0,0) performs the task of multiplying the first two rows and first four columns of the activation data (I[0:2,0:4]) with the first four rows of the weight data (W[0:4,0:8]), and the result is the output of the first two rows of the output data (O[0:2,0:8]). On PE cluster (1,0), the task of multiplying the last two rows and first four columns of the activation data (I[2:4,0:4]) with the first four rows of the weight data (W[0:4,0:8]) is performed, and the result is the output of the last two rows of the output matrix (O[2:4,0:8]). On PE cluster (0,1), the task of multiplying the first two rows and last four columns of the activation data (I[0:2,4:8]) with the last four rows of the weight data (W[4:8,0:8]) is performed, and the result is the output of the first two rows of the output matrix ([0:2,0:8]). On the PE cluster (1,1), perform the task of multiplying the first two rows and the last four columns of the activation data (i.e., I[2:4,4:8]) with the last four rows of the weight data (i.e., W[4:8,0:8]). The result is to output the first two rows of the output matrix (i.e., O[2:4,0:8]).

[0156] based on Figure 9cAs can be seen, the matrices output by PE clusters (0,0) to (0,1) are the same, and the values ​​at corresponding positions in the two matrices need to be added together to obtain the final value. Similarly, the matrices output by PE clusters (1,0) and (1,1) are the same, and the values ​​at corresponding positions in the two matrices need to be added together to obtain the final value.

[0157] exist Figure 9d In the first step, PE cluster (0,0) performs the task of multiplying the first two rows of activation data (I[0:2,0:8]) with the first four columns of weight data (W[0:8,0:4]), and the result is the output of the first two rows and the first four columns of the output data (O[0:2,0:4]). On PE cluster (1,0), the task of multiplying the first two rows of activation data (I[0:2,0:8]) with the last four columns of weight data (W[0:8,4:8]) is performed, and the result is the output of the first two rows and the last four columns of the output matrix (O[0:2,4:8]). On PE cluster (0,1), the task of multiplying the last two rows of activation data (I[2:4,0:8]) with the first four columns of weight data (W[0:8,0:4]) is performed, and the result is the output of the last two rows and the first four columns of the output matrix ([2:4,0:4]). On the PE cluster (1,1), perform the task of multiplying the last two rows of activation data (i.e., I[2:4,0:8]) with the last four columns of weight data (i.e., W[0:8,4:8]). The result is to output the last two rows and the last four columns of the output matrix (i.e., O[2:4,4:8]).

[0158] based on Figure 9d The output matrices from PE cluster (0,0) to PE cluster (1,1) are combined to obtain the final matrix multiplication result.

[0159] exist Figure 9eIn the first step, PE cluster (0,0) performs the task of multiplying the activation data (I[0:4,0:8]) with the first two columns of the weight data (i.e., W[0:8,0:2]), and the result is the output of the first two columns of the output data (i.e., O[0:4,0:2]). On PE cluster (1,0), the task of multiplying the activation data (I[0:4,0:8]) with the third and fourth columns of the weight data (i.e., W[0:8,2:4]) is performed, and the result is the output of the third and fourth columns of the output matrix (i.e., O[0:4,2:4]). On PE cluster (0,1), the task of multiplying the activation data (I[0:4,0:8]) with the fifth and sixth columns of the weight data (i.e., W[0:8,4:6]) is performed, and the result is the output of the fifth and sixth columns of the output matrix (i.e., [0:4,4:6]). Perform the task of multiplying the activation data (i.e., I[0:4,0:8]) with the seventh and eighth columns of the weight data (i.e., W[0:8,6:8]) on the PE cluster (1,1). The result is the seventh and eighth columns of the output matrix (i.e., O[0:4,6:8]).

[0160] based on Figure 9e The output matrices from PE cluster (0,0) to PE cluster (1,1) are combined to obtain the final matrix multiplication result.

[0161] exist Figure 9f In the first step, PE cluster (0,0) performs the task of multiplying the first four columns of the activation data (I[0:4,0:4]) with the first four rows and first four columns of the weight data (W[0:4,0:4]), and the result is the output of the first four columns of the output data (O[0:4,0:4]). On PE cluster (1,0), the task of multiplying the first four columns of the activation data (I[0:4,0:4]) with the first four rows and last four columns of the weight data (W[0:4,4:8]) is performed, and the result is the output of the first four rows and last four columns of the output matrix (O[0:4,4:8]). On PE cluster (0,1), the task of multiplying the last four columns of the activation data (I[0:4,4:8]) with the last four rows and first four columns of the weight data (W[4:8,4:4]) is performed, and the result is the output of the first four rows and first four columns of the output matrix ([0:4,0:4]). On the PE cluster (1,1), perform the task of multiplying the last four columns of the activation data (i.e., I[0:4,4:8]) with the last four rows and last four columns of the weight data (i.e., W[4:8,4:8]). The result is the last four columns of the output matrix (i.e., O[0:4,4:8]).

[0162] based on Figure 9fThe corresponding values ​​of the output matrices on PE cluster (0,0) and PE cluster (0,1) are added together to obtain the final value. The corresponding values ​​of the output matrices on PE cluster (1,0) and PE cluster (1,1) are added together to obtain the final value. The final matrix is ​​the result of matrix multiplication.

[0163] exist Figure 9g In the first step, PE cluster (0,0) performs the task of multiplying the first two rows and first four columns of the activation data (I[0:2,0:4]) with the first four rows of the weight data (W[0:4,0:8]), and the result is the output of the first two rows of the output data (O[0:2,0:8]). On PE cluster (1,0), the task of multiplying the first two rows and last four columns of the activation data (I[0:2,4:8]) with the first four rows of the weight data (W[4:8,0:8]) is performed, and the result is the output of the first two rows of the output matrix (O[0:2,0:8]). On PE cluster (0,1), the task of multiplying the third and fourth rows of the activation data and the first four columns (I[2:4,0:4]) with the first four rows of the weight data (W[0:4,0:8]) is performed, and the result is the output of the third and fourth rows of the output matrix ([2:4,0:8]). On the PE cluster (1,1), perform the task of multiplying the last two rows and the last four columns of the activation data (i.e., I[2:4,4:8]) with the last four rows of the weight data (i.e., W[4:8,0:8]). The result is to output the last two rows of the output matrix (i.e., O[2:4,0:8]).

[0164] based on Figure 9g The corresponding values ​​of the output matrices on PE cluster (0,0) and PE cluster (1,0) are added together to obtain the final value. The corresponding values ​​of the output matrices on cluster (0,1) and PE cluster (1,1) are added together to obtain the final value. The final matrix is ​​the result of matrix multiplication.

[0165] exist Figure 9gIn the first step, PE cluster (0,0) performs the task of multiplying the first two rows and first four columns of the activation data (I[0:2,0:4]) with the first four rows of the weight data (W[0:4,0:8]), and the result is the output of the first two rows of the output data (O[0:2,0:8]). On PE cluster (1,0), the task of multiplying the first two rows and last four columns of the activation data (I[0:2,4:8]) with the last four rows of the weight data (W[4:8,0:8]) is performed, and the result is the output of the first two rows of the output matrix (O[0:2,0:8]). On PE cluster (0,1), the task of multiplying the third and fourth rows and first four columns of the activation data (I[2:4,0:4]) with the first four rows of the weight data (W[0:4,0:8]) is performed, and the result is the output of the third and fourth rows of the output matrix ([2:4,0:8]). On the PE cluster (1,1), perform the task of multiplying the last two rows and the last four columns of the activation data (i.e., I[2:4,4:8]) with the last four rows of the weight data (i.e., W[4:8,0:8]). The result is to output the last two rows of the output matrix (i.e., O[2:4,0:8]).

[0166] based on Figure 9g The corresponding values ​​of the output matrices on PE cluster (0,0) and PE cluster (1,0) are added together to obtain the final value. The corresponding values ​​of the output matrices on cluster (0,1) and PE cluster (1,1) are added together to obtain the final value. The final matrix is ​​the result of matrix multiplication.

[0167] exist Figure 9h In the first step, PE cluster (0,0) performs the task of multiplying the first four columns of the activation data (I[0:4,0:4]) with the first four rows and first four columns of the weight data (W[0:4,0:4]), and the result is the output of the first four columns of the output data (O[0:4,0:4]). On PE cluster (1,0), the task of multiplying the last four columns of the activation data (I[0:4,4:8]) with the last four rows of the weight data (W[4:8,0:8]) is performed, and the result is the output of the first four columns of the output matrix (O[0:4,0:4]). On PE cluster (0,1), the task of multiplying the first four columns of the activation data (I[0:4,0:4]) with the first four rows and last four columns of the weight data (W[0:4,4:8]) is performed, and the result is the output of the first four rows and last four columns of the output matrix ([0:4,4:8]). On the PE cluster (1,1), perform the task of multiplying the last four columns of the activation data (i.e., I[0:4,4:8]) with the last four rows and last four columns of the weight data (i.e., W[4:8,4:8]). The result is the last four columns of the output matrix (i.e., O[0:4,4:8]).

[0168] based on Figure 9hThe corresponding values ​​of the output matrices on PE cluster (0,0) and PE cluster (1,0) are added together to obtain the final value. The corresponding values ​​of the output matrices on cluster (0,1) and PE cluster (1,1) are added together to obtain the final value. The final matrix is ​​the result of matrix multiplication.

[0169] exist Figure 9i In the first step, PE cluster (0,0) performs the task of multiplying the first two columns of the activation data (I[0:4,0:2]) with the first two rows of the weight data (W[0:2,0:8]), and the result is the output data (O[0:4,0:8]). On PE cluster (1,0), the task of multiplying the third and fourth columns of the activation data (I[0:4,2:4]) with the third and fourth rows of the weight data (W[2:4,0:8]) is performed, and the result is the output matrix (O[0:4,0:8]). On PE cluster (0,1), the task of multiplying the fifth and sixth columns of the activation data (I[0:4,4:6]) with the fifth and sixth rows of the weight data (W[4:6,0:8]) is performed, and the result is the output matrix ([0:4,0:8]). On the PE cluster (1,1), perform the task of multiplying the last two columns of the activation data (i.e., I[0:4,6:8]) with the seventh and eighth rows of the weight data (i.e., W[6:8,0:8]), and the result is the output matrix (i.e. O[0:4,0:8]).

[0170] based on Figure 9h The corresponding values ​​of the output matrices from PE cluster (0,0) to PE cluster (1,1) are added together to obtain the final matrix multiplication result.

[0171] In summary, partitioning along the m-direction (the row direction of the activation data) means that different PE clusters process the activation data and different rows of the output matrix, but these PE clusters share the same weight data. The number of PE clusters participating in the computation can be determined based on the number of valid rows of the activation data. For example, in SPMV (sparse matrix-vector multiplication), only one PE cluster is valid (the row and column directions of the PE array contain different m-directions).

[0172] Splitting along the n-axis (the column direction of the weight data) means that different PE clusters compute various output matrix slices along the n-axis, while the PE clusters share the same input matrix slices. Under this splitting method, different PE clusters require different weight data. If the reuse of weight data in the computation is low (smaller than m), data transmission latency will become more severe.

[0173] Splitting along the k-axis (the row direction of the weighted data) means that different PE clusters compute partial sums of the same output matrix slice. Under this splitting method, different PE clusters do not share data during computation. Furthermore, the partial sums generated by different clusters need to be accumulated together to obtain the final result.

[0174] The acceleration unit provided in this embodiment decomposes the instructions to be executed of the neural network model into multiple sub-operations that are performed in multiple iterations. The operation data of the multiple sub-operations is obtained through the direct memory access module, and then the multiple sub-operations are deployed to the PE array for execution. Since the PE array includes three-dimensional PE units, each of which can perform a set operation, the parallel execution of the three-dimensional PE units can achieve hardware acceleration for neural network applications.

[0175] Furthermore, the step of decomposing the instructions to be executed of the neural network model into multiple sub-operations performed in multiple iterations and deploying the multiple sub-operations onto the PE array involves: converting the operation of activation data and weight data of the high-dimensional matrix into the operation of activation data and weight data of the low-dimensional matrix performed in multiple iterations, and deploying the operation of activation data and weight data of the low-dimensional matrix onto the PE array. Each PE unit can be used to perform one-dimensional matrix multiplication operations, and the results of one-dimensional multiplication operations can be accumulated together, which helps to achieve hardware acceleration for neural network applications.

[0176] Furthermore, since neural network models mainly include several key neural network applications, such as matrix multiplication, convolution, and deep convolution, these key neural network applications can be converted into iterative operations on activation data and weight data of low-dimensional matrices, thereby achieving hardware acceleration of neural network applications and further achieving hardware acceleration of neural network models.

[0177] Furthermore, while each neural network application can employ different mapping methods to map its activation and weight data into iterative low-dimensional matrices, the inventors have discovered that, for the inherent characteristics of each neural network application, a preferred mapping method can be used that reduces data movement between external memory and the PE array compared to other mapping methods. For example, for matrix multiplication, the preferred mapping method is the input-fixed mapping method.

[0178] The commercial value of the embodiments disclosed herein

[0179] This disclosure provides a novel instruction set architecture for neural network applications, which can be used as an acceleration unit for neural network models. Such acceleration units have already been widely adopted, meaning that the acceleration units provided by this disclosure have real-world application scenarios and therefore have market potential and commercial value.

[0180] Those skilled in the art will understand that this disclosure can be implemented as a system, method, and computer program product. Therefore, this disclosure can be implemented as entirely hardware, entirely software (including firmware, resident software, and microcode), or a combination of software and hardware. Furthermore, in some embodiments, this disclosure can also be implemented as a computer program product contained in one or more computer-readable media, the computer-readable media containing computer-readable program code.

[0181] Any combination of one or more computer-readable media may be used. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium is, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium include: an electrical connection of one or more wires, a portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with a processing unit, apparatus, or device.

[0182] Computer-readable signal media may include data signals propagated in baseband or as part of a chopped signal, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any other suitable combination. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of transmitting, propagating, or transmitting programs for use by or in connection with an instruction system, apparatus, or device.

[0183] The program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, and any suitable combination thereof.

[0184] Computer program code for executing embodiments of this disclosure can be written in one or more programming languages ​​or combinations thereof. The programming languages ​​include object-oriented programming languages ​​such as JAVA and C++, and may also include conventional procedural programming languages ​​such as C. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0185] The above description is merely a preferred embodiment of this disclosure and is not intended to limit this disclosure. Various modifications and variations can be made to this disclosure by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. An instruction processing apparatus, comprising: Multiple instruction buffers, register files, selectors, parsers, and operation units, wherein the register file includes multiple entries. The selector is used to parse the command type and buffer identifier from the received command. If the command type is configuration, the received data and the buffer identifier are provided to the parser. If the command type is execution, the buffer identifier is provided to the operation unit. The parser is used to parse the instruction sequence from the data, store the instruction sequence in an instruction buffer that matches the buffer identifier, and store the operands of each instruction in the instruction sequence in the corresponding entry of the register file; The operation unit is used to drive the instruction buffer that matches the buffer identifier to execute each instruction therein one by one to generate control signals. The control signals of each instruction and the operands of the instruction in the register file are sent to multiple execution units, and each execution unit performs a corresponding operation based on the received control signals and operands.

2. The instruction processing apparatus according to claim 1, wherein, The parser determines the operands of each instruction when it is executed on each execution unit based on the original operands of each instruction in the instruction sequence.

3. The instruction processing apparatus according to claim 1, wherein, The instruction processing apparatus further includes a scalar calculation unit, used to calculate the operands of a specific instruction and update the operands of that specific instruction in the register file with the new operands.

4. The instruction processing apparatus according to claim 1, wherein, The instruction processing device supports multiple predefined instructions, and the instruction sequence consists of one or more of the multiple predefined instructions.

5. The instruction processing apparatus according to claim 4, wherein, The plurality of predefined instructions include a data loading instruction. Each execution unit obtains a first vector and a second vector according to the control signal of the data loading instruction and stores them in a first queue and a first buffer.

6. The instruction processing apparatus according to claim 4, wherein, The plurality of predefined instructions include a multiplication-accumulation instruction. Each execution unit outputs two values ​​from the first queue and the first buffer according to the control signal of the multiplication-accumulation instruction to perform a multiplication-accumulation operation.

7. The instruction processing apparatus according to claim 6, wherein, The corresponding entries in the register file also store scalar pipeline states, which are used to specify the attributes of the first queue. The attributes of the first queue are one of the following: first-in-first-out queue, rotating queue, and sliding window queue.

8. The instruction processing apparatus according to claim 4, wherein, The plurality of predefined instructions include data storage instructions, and each execution unit stores the intermediate calculation results generated by the execution unit into external memory according to the control signal of the data storage instructions.

9. The instruction processing apparatus according to claim 4, wherein, The multiple predefined instructions include special function instructions, and each execution unit starts the corresponding special function unit according to the control signal of the special function instruction.

10. The instruction processing apparatus according to claim 1, wherein, The instruction sequence comes from a specified neural network application, and the operation unit uses different instruction pipelines for different neural network applications.

11. The instruction processing apparatus according to claim 10, wherein, The specified neural network application is one of the following: matrix multiplication, convolution, and depthwise convolution.

12. The instruction processing apparatus according to claim 10, wherein, When processing the instruction sequence for matrix multiplication, the operation unit uses a two-stage instruction pipeline of decoding stage and fetch stage; when processing convolution or depthwise convolution, it uses a four-stage instruction pipeline of decoding stage, fetch stage, scalar processing stage and write-back stage.

13. A cluster comprising an instruction processing apparatus as described in any one of claims 1 to 12 and a plurality of execution units coupled to the instruction processing apparatus, the cluster receiving commands and data transmitted together with the commands.

14. An acceleration unit for executing a neural network model, comprising: Direct memory access module; On-chip memory includes multiple storage units; Multiple cluster groups, including multiple clusters, each cluster including an instruction processing device as described in any one of claims 1 to 12 and an execution unit coupled to the instruction processing device; The command processor is used to decompose the operation of a specified neural network application representation into multiple sub-operations, convert the sub-operations into instruction sequences to be executed on the cluster, specify the operation data for each instruction sequence, load the operation data of the sub-operations multiple times through the direct memory access module, and store the instruction sequences and operation data corresponding to the multiple clusters contained in each cluster group into the corresponding storage units. Multiple distribution units are coupled to the multiple storage units and the multiple cluster groups respectively. Each distribution unit reads instruction sequences and operation data from the storage unit coupled to it, and sends the instruction sequences and operands to the multiple instruction processing devices coupled to it respectively.

15. A server, comprising: The acceleration unit as described in claim 14; A processing unit is configured to send an instruction to the acceleration unit to drive the acceleration unit to execute the specified neural network application; A memory for storing weight data and activation data for the specified neural network application.